Unexpected Truncation for Strings with Multi-Byte Characters

Hello and thank you in advance for taking a look at my issue!

Summary:
- Problem: Data unexpectedly truncated when parsing strings with multi-byte characters.
- Suspected problem line: `slither/section.rb:63`
- Slither Definition:
```
Slither.define :my_definition, by_bytes: false do |d|
            d.body do |body|
              body.trap { |line| line[0, 4] =~ /[^(HEAD|FOOT)]/ }
              body.column :foo, 4
            end
end
```

- General flow
```
# pry console in slither/section.rb:63
line = "J�HN"
line.encoding #=> Encoding:UTF-8
line.length #=> 4
line.bytesize #=> 6
unpacker #=> "A4"
line_data = line.unpack(unpacker) #=> ["J\xEF\xBF\xBD"]
line_data[0].encoding #=> Encoding:ASCII-8BIT
line_data[0].length #=> 4
line_data[0].bytesize #=> 4
line_data[0].force_encoding('UTF-8') #=> "J�"
```

The long version:
My current understanding of what `Slither` is doing: `Section.parse(line)` uses `unpack(unpacker)` from the standard library with `unpacker` defined in the `Column` class. The unpacker takes the length from the Slither definition and appends `A` to it. In this case it is `A4`. `line.unpack(unpacker)` takes a four character line like `JOHN` and slices it into an array as expected, `["JOHN"]`. However, the same definition would unpack another four character line with a multi-byte character like `J�HN` and slice it into `ASCII-8BIT` encoded `["J\xEF\xBF\xBD"]`, which is equivalent to `"J�"` in UTF-8. `"J\xEF\xBF\xBD".length` and `bytesize` are both four, but this presents unexpected data loss.

I would expect that setting `by_bytes: false` in the definition would depend on character length only and treat multi-byte characters as a single character. Hope I am not missing something here, and would greatly appreciate your help interfacing with the gem, as right now this appears to be a bug to my eyes. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Truncation for Strings with Multi-Byte Characters #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unexpected Truncation for Strings with Multi-Byte Characters #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions