After running a large bro cluster for a few days on a FreeBSD system (FreeBSD 10.1, 28 physical nodes, 81 worker processes), broctl actions that interact with all nodes seem to take excessive amounts of time (>2 minutes for a broctl status). This was not the case right after starting up the cluster.
If there is any way I can help with more information, please let me know what to do.
I looked into this a tad more - and it seems that two nodes were very slow to reply and potentially ran into a timeout. That does not really seem obvious from the status output at the moment though (unless I completely missed it) - perhaps we should add that.
And even more detail - the cause of this was hardware problems on two nodes. The bro instances of these nodes were still kind-of-running, but I don't think they were communicating with master anymore and they were unnkillable (even with kill -9); probably hanging while waiting for disk-io (harddrive problems). Since you still could ssh into the nodes, and they worked normally unless you tried to do certain file system accesses, broctl apparently listed them as online, without giving any indication of problems with the nodes, besides the fact that "status" takes a long time.
I'm not seeing a problem. As a test, I simulated a slow node by adding a "sleep"
command to one of the scripts that broctl runs on the remote host.
If the sleep is long enough to exceed the timeout, then I see "???" in the status
output (in the "Running", "Peers", and "Started" columns).
Otherwise, broctl status simply gathers information reported by Bro.
set timeout to 30s and make configurable, revisit later when Broker is there
Branch topic/dnthayer/ticket1353 in the broctl repo contains the fix for this issue.
This has been merged already.