different file hash between downloaded file by ANALYZER_EXTRACT with original file

Description

Hello, everyone .
i'm new to bro recently, i'm using FAF(File Analysis Framework) to
extract certain type file to disk for further analysis from traffic .
but now i have problem which is so difficult to understand:

  • bro extract file size is one byte bigger than my original file

  • or bro extract file the right size with my original file, but it's
    different MD5 value among these files

below is my test env, test steps and test result:

  1. my test env
    bro version:

  • bro version 2.5-156
    OS (32C 64G):

  • CentOS Linux release 7.3.1611 (Core)
    CPU model:

  • Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

  • CPU(s): 32

  • CPU MHz: 2334.445
    NIC:

  • 03:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network

  1. my test bro scripts
    ```
    event file_sniff(f: fa_file, meta: fa_metadata)
    {
    print "file sniff event by Myth";
    if ( meta?$mime_type )#&& hook FileExtraction::extract(f, meta) )
    {
    if ( meta$mime_type in mime_to_ext )
    {
    local fext = mime_to_ext[meta$mime_type];
    if ( fext == "txt" )
    {
    #print "txt";
    if ( f$source != "SMTP" )
    {
    #print "NOT SMTP";
    return;
    }
    }
    }
    else
    return;
    #fext = split_string(meta$mime_type, /\//)[1];

local fname = fmt("%s%s-%s.%s", path, f$source, f$id, fext);

  1. file path
    #print fname;
    Files::add_analyzer(f, Files::ANALYZER_MD5);
    Files::add_analyzer(f, Files::ANALYZER_SHA1);
    Files::add_analyzer(f, Files::ANALYZER_SHA256);
    Files::add_analyzer(f, Files::ANALYZER_EXTRACT,[$extract_filename=fname]);
    }
    }
    ```

  1. my test steps

1. generate test file

>>> [root at sensor ~]# dd if=/dev/urandom of=test.for.bro.txt bs=1024
count=512
>>> [root at sensor ~]# tar -cvzf test.for.bro.tar.gz test.for.bro.txt

2. original file size and MD5 valud

>>> [root at sensor ~]# ls -lt test.for.bro.tar.gz
rw-rr- 1 root root 524608 8月 7 13:59 test.for.bro.tar.gz
>>> [root at sensor ~]# md5sum test.for.bro.tar.gz
6e755b5c0a7754c7066ca6db5f0f90ba test.for.bro.tar.gz

2. start test web server using Python
>>> [root at sensor ~]# python -m SimpleHTTPServer 8998 > ws.log 2>&1

3. start bro
>>> [root at sensor myth]# /usr/local/bro/bin/bro -i eno1 -C
bro-scripts/tophant.entrypoint.bro > myth.log 2>&1

4. using `ab` do make lots of http request to test file from another machine
>>> [root at localhost ~]# ab -n 2000 -c 4 '
http://10.0.81.54:8998/test.for.bro.tar.gz'

5. result ( after all request is done)

5.1 webserver process request count
>>> [root at sensor ~]# cat ws.log | grep test.for.bro | wc -l
2000

5.2 bro `file_sniff` event count
>>> [root at sensor myth]# cat myth.log | grep "file sniff event by Myth" | wc -l
976

5.3 download file count
>>> [root at sensor sensor_files_by_myth]# ls | wc -l
973

5.4 file count with different file size:
>>> [root at sensor sensor_files_by_myth]# ls -lt | grep -v 524608 | wc -l
193

5.5 file count with same file size:
>>> [root at sensor sensor_files_by_myth]# ls -lt | grep 524608 | wc -l
780

5.6 file count with same MD5 value:
>>> [root at sensor sensor_files_by_myth]# ls -lt | awk '{print $NF}' | xargs md5sum | grep 6e755b5c0a7754c7066ca6db5f0f90ba | wc -l
19

5.7 file count with same file size but different MD5 (!!! NOTICE: all is
different MD5)
>>> [root at sensor sensor_files_by_myth]# ls -lt | grep 524608 | awk '{print $NF}' | xargs md5sum | grep -v 6e755b5c0a7754c7066ca6db5f0f90ba | awk '{print $1}' | sort | uniq -c | wc -l
761

5.8 download file size distribution:
>>> [root at sensor sensor_files_by_myth]# ls -lt | awk '{print $5}' | sort -rn | uniq -c

  • 136 524609 <<<<<<<<<<<<<<< this is one byte bigger than myoriginal test file !!!*

  • 780 524608*
    3 523990
    3 522542
    8 521094
    1 520208
    1 519646
    2 518198
    1 515302
    1 513854
    1 512968
    1 512406
    1 510958
    1 509510
    2 503718
    1 502176
    1 501384
    1 497926
    1 490296
    1 488808
    1 487040
    1 486342
    1 480550
    1 473310
    1 467518
    1 464622
    1 458830
    1 453038
    1 442902
    1 441454
    1 396566
    1 382408
    1 377742
    1 358918
    1 354574
    1 318240
    1 283312
    1 263350
    1 256110
    1 250318
    1 234952
    1 189502
    1 164886
    1 79454
    2 2710
    1

Thanks for reading so far, wish someone could help me with this

Myth

Environment

bro version:

  • bro version 2.5-156
    OS (32C 64G):

  • CentOS Linux release 7.3.1611 (Core)
    CPU model:

  • Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

  • CPU(s): 32

  • CPU MHz: 2334.445
    NIC:

  • 03:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network

Assignee

Unassigned

Reporter

Myth Ren

External issue ID

None

Components

Affects versions

Priority

High
Configure