Investigate further improvements to file analysis performance

Description

Some further ideas for measuring and improving the performance of maintaining the handles were floating around.

Environment

None

Activity

Show:
Jon Siwek
April 24, 2014, 5:50 PM

I have some performance improvement changes in topic/jsiwek/faf-perf in the bro repo w/ minor updates to test baselines in bro-testing and bro-testing-private repos.

The degree to which the changes improve performance are, of course, sensitive to the sampled traffic. I think traffic that contains a lot of files, largish files that Bro only sees in small chunks at a time, and/or many HTTP partial content messages would see the most improvement from the changes. Summary of "hot spot" findings (using 2009-M57-day11-18.trace from bro-testing as "representative" sample):

~8% of total time spent on MIME magic signature matching.

This was also the slowest part of file analysis when it relied on libmagic. Using Bro's signatures engine instead is, theoretically, an improvement because the amount of time spent on matching shouldn't increase w/ number of signatures. However, I didn't actually measure much of a performance increase after the switch. No easy optimizations for the RE/signature matching stood out to me, code-wise, that would significantly help – most time spend appears to be in just having to iterate over each byte of input and the number of bytes being so large that the smallish amount of work each iteration adds up. Possibly, we could be strict about only allowing file-magic signatures that are anchored close to the beginning of the file so that the matching code can bail out earlier when it detects no further matches are possible.

~4.4% of total time spent on requesting file handles from the script-layer.

This includes the cost of creating new Vals to pass in as event arguments, the execution of the event handler, and the MD5 hashing of the file handle in to a file ID. The most costly aspect seems to be the first, creating new Vals, while the later two, hashing and execution of the event handler, weren't significant. For the moment, the best way I can think to reduce the cost of file handle requests is just to not make them as often (or at all if it's possible). Being more aggressive about caching file IDs or just generating file handle+ID internally, I was able to reduce this from 4.4% to 1.7%.

~2% of total time spent on updating fa_file record field values (e.g. seen bytes, missed bytes, last active time)

The fa_file Val is always kept up-to-date as new info is available, but this results in a lot of new Val creations and Val destructions. Similar to connection records, it may be possible to only update the fa_file record "on-demand" when it's needed for an event and thus reduce this cost. I don't like that much as I've always found that behavior confusing and it makes it more difficult to "poll" those values at the script-layer (ConnPolling currently uses a hack to force updates, and there's no equivalent hack at the moment for fa_file records). How ugly would it be (or would it even work) to have an interface to change Val's internal values (e.g. Val::val.uint_val) so a new Val doesn't have to be created to update fields in a fa_file record? Or is there another option?

~0.5% of total time spent on internal file ID string to File* lookups.

This seems like a reasonable cost. Don't think it would be that beneficial for file_analysis::Manager to have an interface that exposes File* to callers in order to avoid the lookup.

Robin Sommer
April 24, 2014, 10:59 PM
Edited

Two questions:

(1)

I'm actually wondering about performance here as set/map can potentially
be expensive in particular for small sizes (compared to using a vector
for example), and these will be instantiated and manipulated quite often.

Put differently: I wouldn't be sure that using a set here is necessarily faster overall than a list as long as there's just a few elements in there. Were you able to confirm that?

(2)

Baseline/tests.m57-long/http.log: some MIME types change from
text/html to text/plain, is that expected? (Update: Ah, is that the bof_buffer_size change?)

Robin Sommer
April 24, 2014, 11:08 PM

How ugly would it be (or would it even work) to have an interface to change Val's internal values (e.g. Val::val.uint_val) so a new Val doesn't have to be created to update fields in a fa_file record?

Not sure that would be safe. I believe the Val can end up being shared with other locations which wouldn't expect it to change. Say if somebody assigned the current value to somewhere else, then when it would later be changed under the hood it would show up there too.

Robin Sommer
April 24, 2014, 11:11 PM

Here are the performance improvements I'm seeing:

Pretty neat (though as you say, also clearly traffic dependent).

Jon Siwek
April 28, 2014, 2:18 PM

I'm actually wondering about performance here as set/map can potentially
be expensive in particular for small sizes (compared to using a vector
for example), and these will be instantiated and manipulated quite often.
Put differently: I wouldn't be sure that using a set here is necessarily faster overall than a list as long as there's just a few elements in there. Were you able to confirm that?

It can be questionable – in other places I've tried replacing lists with sets/maps and have measured some performance decrease. But in this case, the difference seemed negligible... I think it was a slight improvement possibly because file signatures will now more commonly have multiple matches where before only a single protocol signature would match. Code-wise, it did simplify things, though I guess that's only a minor/weak argument for the change.

Baseline/tests.m57-long/http.log: some MIME types change from
text/html to text/plain, is that expected? (Update: Ah, is that the bof_buffer_size change?)

Yes, that was from the change to restrict how much data may be fed in the the file MIME signature matching stuff to be no greater than the bof_buffer_size field – as that's the original intent and also the way it's documented.

Robin Sommer
May 2, 2014, 3:33 AM

Sounds good, thanks.

Assignee

Jon Siwek

Reporter

Robin Sommer

Labels

None

External issue ID

None

Components

Fix versions

Priority

Normal
Configure