Attached sample contains two HTTP downloads of the same URL from the same client, but there are no guarantees that the files is actually the same (no Etags etc - in this case it actually is the same, but lets pretend they were different...). However the file analysis framework seems to give the same file ID in file_name and file_chunk for both downloads.
Think this is something to do with Range requests as doesn't happen if do "normal" HTTP requests.
I suspect your changes break some of our tests. Have you run with your changes over the full test suite?
http.log resp_fuids different, as expected
http/206_example_b.pcap, multiple requests were being incorrectly merged into one, but don't quite understand the output before (or after!)
http/206_example_a.pcap, multiple requests were being incorrectly merged into one, are now split into two files with the correct number of bytes. Now get FILE_TIMEOUT events, not sure why.
The tests that are merging multiple files into one are actually working exactly like they're supposed to. With the change you made, you will end up with two chunks of the file if you enable extraction but if you leave it as it is you will end up with one file that just happened to be transferred over two connections and reassembled back into the single original file.
This is definitely an area where there isn't a right answer so we just have to go based on experience of what's happening in real traffic and we definitely see this sort of stuff in real traffic. Also, if you don't like Bro's behavior, you can run your own script (without modifying any of the shipped scripts) that gives you the behavior you're looking for. Did you understand my suggestion about doing your own get_file_handle function and registering that at the begging of this ticket?
Sorry I've not been as clear as I could here. I've changed my own bro instance, but I'm concerned that out of the box, Bro's behaviour while convenient for the majority of cases, isn't correct and will result in irrecoverably corrupted files in some instances (unless you’re lucky enough to keep full captures).
I've researched this further and I would argue there is a right answer and the spec is clear, see RFC2616, 10.2.7:
A cache MUST NOT combine a 206 response with other previously cached content if the ETag or Last-Modified headers do not match exactly, see 13.5.4.
I'd say Bro is a cache in this instance, and for example clients like IE follow this behavior and Adobe Reader uses the If-Range conditional to ensure the URL is the same document.
I agree my change is over-conservative, would you accept something that include ETag and Last-Modified in the hash? Or is the (small) chance of corruption not a concern (which is fine, as long as someone has actively decided not to follow the RFC)
Ah! I think it's perfectly reasonable to make our default behavior a bit closer to RFC2616. I'll take a look into it soon.
At the very least, if someone does want the more liberal file combining they can add it back with a separate script (which I'll probably include with Bro somewhere). I'll take this ticket to make sure I deal with this soon.