topic/jsiwek/http-file-id-caching

Description

This branch is in bro and bro-testing repos. It adds a file ID caching / "fast path" mechanism to the file analysis API and adapts HTTP to use it for performance improvement.

Environment

None

Activity

Show:
Robin Sommer
January 31, 2014, 12:12 AM

For the case that the core can compute the file id itself without needing the script-land, is the idea that it then just passes it in as the cached_id?

Seth Hall
January 31, 2014, 1:50 AM

I've been thinking about this and I'm not sure how I feel about analyzers computing their own identifiers. That actually causes inconsistent behavior because a user would have to know that a certain analyzer does that or that it does that in certain cases. i.e. the user would have no control over how file chunks are tied together to form complete files. Is this something that is already implemented?

Jon Siwek
January 31, 2014, 3:41 PM

For the case that the core can compute the file id itself without needing the script-land, is the idea that it then just passes it in as the cached_id?

Yes, and it can ignore the return value from those methods and just always supply its own file ID if that's what it wants to do.

I've been thinking about this and I'm not sure how I feel about analyzers computing their own identifiers. That actually causes inconsistent behavior because a user would have to know that a certain analyzer does that or that it does that in certain cases. i.e. the user would have no control over how file chunks are tied together to form complete files.

Probably few users are going to want change how file IDs are calculated in the first place and the cases where an analyzer directly calculated a file ID are probably going to be the ones where there's not really any other sane way to do it. I do agree it's somewhat inconsistent, though.

Is this something that is already implemented?

Yes, it comes free w/ the new support for caching a file ID returned from script-land due to the way the code is structured (just in this case the return value from file analysis API functions is whatever was passed in instead of something calculated in script-land).

Seth Hall
January 31, 2014, 4:16 PM

True. I think the cases where there is really only one way to do it are pretty limited. Maybe just the old "File" analyzer that is used for FTP and IRC transfers?

Ah, ok. Thanks.

Robin Sommer
January 31, 2014, 4:17 PM

Agree with Jon, I think we want the option, it just feels unnecessary to pass through script-land in cases where there's really no question on how to compute the handle. I don't think that's actually different from other low-level decisions analyzers sometimes make on how to process something without asking script-land for its opinion.

Also, analyzers can document whether they offer any customization.

I think I'll rename cached_id to precomputed_id then make it cover both cases.

(and I would like to have a document eventually that summarizes the options an analyzer have

Robin Sommer
January 31, 2014, 4:26 PM

{{I think the cases where there is really only one way to do it are pretty limited.}

Also recursive content inspection of container formats.

Assignee

Unassigned

Reporter

Jon Siwek

Labels

None

External issue ID

None

Components

Fix versions

Affects versions

Priority

Normal
Configure