close
Skip to content

Unexpected Truncation for Strings with Multi-Byte Characters #33

@nickdbmiller

Description

@nickdbmiller

Hello and thank you in advance for taking a look at my issue!

Summary:

  • Problem: Data unexpectedly truncated when parsing strings with multi-byte characters.
  • Suspected problem line: slither/section.rb:63
  • Slither Definition:
Slither.define :my_definition, by_bytes: false do |d|
            d.body do |body|
              body.trap { |line| line[0, 4] =~ /[^(HEAD|FOOT)]/ }
              body.column :foo, 4
            end
end
  • General flow
# pry console in slither/section.rb:63
line = "J�HN"
line.encoding #=> Encoding:UTF-8
line.length #=> 4
line.bytesize #=> 6
unpacker #=> "A4"
line_data = line.unpack(unpacker) #=> ["J\xEF\xBF\xBD"]
line_data[0].encoding #=> Encoding:ASCII-8BIT
line_data[0].length #=> 4
line_data[0].bytesize #=> 4
line_data[0].force_encoding('UTF-8') #=> "J�"

The long version:
My current understanding of what Slither is doing: Section.parse(line) uses unpack(unpacker) from the standard library with unpacker defined in the Column class. The unpacker takes the length from the Slither definition and appends A to it. In this case it is A4. line.unpack(unpacker) takes a four character line like JOHN and slices it into an array as expected, ["JOHN"]. However, the same definition would unpack another four character line with a multi-byte character like J�HN and slice it into ASCII-8BIT encoded ["J\xEF\xBF\xBD"], which is equivalent to "J�" in UTF-8. "J\xEF\xBF\xBD".length and bytesize are both four, but this presents unexpected data loss.

I would expect that setting by_bytes: false in the definition would depend on character length only and treat multi-byte characters as a single character. Hope I am not missing something here, and would greatly appreciate your help interfacing with the gem, as right now this appears to be a bug to my eyes. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions