Same file id generated for potentially different files

Description

Attached sample contains two HTTP downloads of the same URL from the same client, but there are no guarantees that the files is actually the same (no Etags etc - in this case it actually is the same, but lets pretend they were different...). However the file analysis framework seems to give the same file ID in file_name and file_chunk for both downloads.

Think this is something to do with Range requests as doesn't happen if do "normal" HTTP requests.

Environment

CentOS 6

Activity

Show:
Seth Hall
September 25, 2014, 1:13 PM

I suspect your changes break some of our tests. Have you run with your changes over the full test suite?

Jimmy Jones
September 26, 2014, 1:33 PM

http.log resp_fuids different, as expected

http/206_example_b.pcap, multiple requests were being incorrectly merged into one, but don't quite understand the output before (or after!)

http/206_example_a.pcap, multiple requests were being incorrectly merged into one, are now split into two files with the correct number of bytes. Now get FILE_TIMEOUT events, not sure why.

Seth Hall
September 26, 2014, 2:03 PM

The tests that are merging multiple files into one are actually working exactly like they're supposed to. With the change you made, you will end up with two chunks of the file if you enable extraction but if you leave it as it is you will end up with one file that just happened to be transferred over two connections and reassembled back into the single original file.

This is definitely an area where there isn't a right answer so we just have to go based on experience of what's happening in real traffic and we definitely see this sort of stuff in real traffic. Also, if you don't like Bro's behavior, you can run your own script (without modifying any of the shipped scripts) that gives you the behavior you're looking for. Did you understand my suggestion about doing your own get_file_handle function and registering that at the begging of this ticket?

Jimmy Jones
September 29, 2014, 9:18 AM

Sorry I've not been as clear as I could here. I've changed my own bro instance, but I'm concerned that out of the box, Bro's behaviour while convenient for the majority of cases, isn't correct and will result in irrecoverably corrupted files in some instances (unless you’re lucky enough to keep full captures).

I've researched this further and I would argue there is a right answer and the spec is clear, see RFC2616, 10.2.7:

A cache MUST NOT combine a 206 response with other previously cached content if the ETag or Last-Modified headers do not match exactly, see 13.5.4.

I'd say Bro is a cache in this instance, and for example clients like IE follow this behavior and Adobe Reader uses the If-Range conditional to ensure the URL is the same document.

I agree my change is over-conservative, would you accept something that include ETag and Last-Modified in the hash? Or is the (small) chance of corruption not a concern (which is fine, as long as someone has actively decided not to follow the RFC)

Seth Hall
September 30, 2014, 1:35 PM

Ah! I think it's perfectly reasonable to make our default behavior a bit closer to RFC2616. I'll take a look into it soon.

At the very least, if someone does want the more liberal file combining they can add it back with a separate script (which I'll probably include with Bro somewhere). I'll take this ticket to make sure I deal with this soon.

Assignee

Seth Hall

Reporter

Jimmy Jones

Labels

None

External issue ID

None

Components

Affects versions

Priority

Normal
Configure