bro-cut should be rewritten for speed and to not depend on gawk

Description

The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default).

Environment

None

Activity

Show:
cubic1271
July 11, 2014, 1:59 PM

Why a static array with local code to resize instead of using something like std::vector? Is it a requirement that bro-cut be C and not C++?

Daniel Thayer
July 11, 2014, 2:28 PM

The current implementation can be compiled with a C++ compiler (and it works), so I guess it's already C++.

Robin Sommer
July 23, 2014, 12:40 AM

I noticed a regression compared to the awk-version: the C bro-cut cannot handle more than one time column when converting to readable output. The branch topic/robin/ticket1215-merge has a test case in bro-cut/multiple-times.test. Might be a bit painful to fix, but I think we should ...

Daniel Thayer
July 30, 2014, 4:57 PM

In branch topic/dnthayer/ticket1215, I've made the following changes:

1) bro-cut now handles time conversion for multiple time columns in a log file (and there is a new test case),
2) bro-cut no longer has a hard-coded limit on the number of columns that it can handle,
3) all tests now pass on OS X (previously, some were failing due to strftime("%z") behavior on OS X)

Jon Siwek
August 4, 2014, 8:56 PM

Just an FYI: I've added a job to Jenkins to run the bro-aux test suite, so bro-cut is now being regression tested automatically.

Assignee

Robin Sommer

Reporter

Daniel Thayer

Labels

None

External issue ID

None

Components

Fix versions

Priority

Normal
Configure