to_json mishandles strings with both \ and unprintable characters

Description

The to_json function (implemented in base/utils/json.bro) escapes unprintable characters using the clean primitive, and then manually escapes }} and {{" with regexes. This is wrong on two levels: the transformation performed by clean is irreversible, and \x escapes are not part of the JSON standard. For instance, consider

Running this test program with bro -b -C test.bro will produce the output "\"\\\\x81". Because of the irreversible clean transformation, this output could correspond to either the original three-byte string (hexdumped, 22 5C 81) or a six-byte string containing two backslashes and the literal string x81 (hexdumped, 22 5C 5C 78 38 31.) Because \x escapes are not part of the JSON standard, it is not enough to replace clean and the inner gsub with escape_string; that would produce "\\\x81", which is unambiguous, but also unparseable.

The ideal output would be "\"\\\u0081". I'm not sure how to accomplish this, considering that gsub does not appear to implement any way of referring to capture groups from the replacement string. For now I'm going to change json.bro to read

and postprocess the JSON with a more powerful regex engine before trying to parse it.

(There is also the headache of dealing with strings with U+0000 in them, but I think it would be fair to declare that Not Your Problem.)

Environment

any

Status

Assignee

Unassigned

Reporter

Zack Weinberg

Labels

None

External issue ID

None

Components

Affects versions

2.5.4

Priority

Normal
Configure