It seems that after running a "broctl stop" not all bro processes are killed immediately. On our cluster, one of the processes keeps running; I seems like it eventually terminates after all log-compression is done. Is that on purpose or is that a bug?
Ps output (on the node running the manager, bro process in first line, including the running compression jobs for completeness):
I wonder if that process is just left over from when bro calls system() to run the child process...
I'm not sure what to do about this. killing that process is not the best idea, but there may be a way to wait for it.
I think there is a larger issue here in that log rotation has a number of problems:
All logs get rotated+compressed at the same time, causing a CPU/IO Storm
Logs are compressed on the fly to their destination, then the originals are removed
If compression is not in use, logs are copied and then removed (rather than moved)
If using something like the sftp handler and sftp fails, nothing is retried.
Bro is the parent process to all of this.
If bro crashes logs often end up in a crash directory rather than the proper location.
I think that the only thing bro should be doing is atomically moving the current logs to an archive directory or an archive staging directory. The compression,moving,copying,uploading would be done by an external tool. There are a number of benefits to this:
If bro crashes recovering the logs is easy: on startup just move any existing log files to the staging dir. A bro crash could never result in a partially compressed/rotated log file
Compression can be done serially or with limited parallelism rather than all at once
You could even delay the compression to idle periods
Bugs like this would not occur since stopping bro would just require the logs to be moved, not compressed
This may be related to BIT-1306, let's wait for that.
Can somebody see if 0620bc97 helps?
I've tested this, and 0620bc97 fixed the problem for me.