Hello and thank you in advance for taking a look at my issue!
Summary:
- Problem: Data unexpectedly truncated when parsing strings with multi-byte characters.
- Suspected problem line:
slither/section.rb:63
- Slither Definition:
Slither.define :my_definition, by_bytes: false do |d|
d.body do |body|
body.trap { |line| line[0, 4] =~ /[^(HEAD|FOOT)]/ }
body.column :foo, 4
end
end
# pry console in slither/section.rb:63
line = "J�HN"
line.encoding #=> Encoding:UTF-8
line.length #=> 4
line.bytesize #=> 6
unpacker #=> "A4"
line_data = line.unpack(unpacker) #=> ["J\xEF\xBF\xBD"]
line_data[0].encoding #=> Encoding:ASCII-8BIT
line_data[0].length #=> 4
line_data[0].bytesize #=> 4
line_data[0].force_encoding('UTF-8') #=> "J�"
The long version:
My current understanding of what Slither is doing: Section.parse(line) uses unpack(unpacker) from the standard library with unpacker defined in the Column class. The unpacker takes the length from the Slither definition and appends A to it. In this case it is A4. line.unpack(unpacker) takes a four character line like JOHN and slices it into an array as expected, ["JOHN"]. However, the same definition would unpack another four character line with a multi-byte character like J�HN and slice it into ASCII-8BIT encoded ["J\xEF\xBF\xBD"], which is equivalent to "J�" in UTF-8. "J\xEF\xBF\xBD".length and bytesize are both four, but this presents unexpected data loss.
I would expect that setting by_bytes: false in the definition would depend on character length only and treat multi-byte characters as a single character. Hope I am not missing something here, and would greatly appreciate your help interfacing with the gem, as right now this appears to be a bug to my eyes. Thanks!
Hello and thank you in advance for taking a look at my issue!
Summary:
slither/section.rb:63The long version:
My current understanding of what
Slitheris doing:Section.parse(line)usesunpack(unpacker)from the standard library withunpackerdefined in theColumnclass. The unpacker takes the length from the Slither definition and appendsAto it. In this case it isA4.line.unpack(unpacker)takes a four character line likeJOHNand slices it into an array as expected,["JOHN"]. However, the same definition would unpack another four character line with a multi-byte character likeJ�HNand slice it intoASCII-8BITencoded["J\xEF\xBF\xBD"], which is equivalent to"J�"in UTF-8."J\xEF\xBF\xBD".lengthandbytesizeare both four, but this presents unexpected data loss.I would expect that setting
by_bytes: falsein the definition would depend on character length only and treat multi-byte characters as a single character. Hope I am not missing something here, and would greatly appreciate your help interfacing with the gem, as right now this appears to be a bug to my eyes. Thanks!