Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125
Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125metsw24-max wants to merge 2 commits intoada-url:mainfrom
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (52.17%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1125 +/- ##
==========================================
- Coverage 59.71% 59.71% -0.01%
==========================================
Files 37 37
Lines 5958 5980 +22
Branches 2907 2911 +4
==========================================
+ Hits 3558 3571 +13
- Misses 593 601 +8
- Partials 1807 1808 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Would you mind rerunning the linter/formatter? |
There was a problem hiding this comment.
Pull request overview
This PR hardens the URLPattern tokenizer against a DoS condition where malformed UTF-8 could prevent forward progress, by ensuring malformed sequences consume input and are surfaced through strict/lenient tokenization policies.
Changes:
- Track malformed UTF-8 decodes in
Tokenizerand ensureget_next_code_point()always advances at least one byte on malformed input. - Integrate malformed-decode handling into the main tokenization loop (strict: TypeError; lenient: INVALID_CHAR token + continue).
- Add regression tests covering strict/lenient behavior and forward progress on malformed payloads.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/wpt_urlpattern_tests.cpp |
Adds regression tests for malformed UTF-8 progress and strict/lenient behavior. |
src/url_pattern_helpers.cpp |
Hooks malformed-decode detection into the main tokenizer loop’s error handling. |
include/ada/url_pattern_helpers.h |
Exposes a had_invalid_code_point() accessor and stores invalid-decode state in Tokenizer. |
include/ada/url_pattern_helpers-inl.h |
Implements malformed UTF-8 detection with guaranteed cursor advancement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Malformed UTF-8 must not stall tokenization: report an invalid-char | ||
| // token (lenient) or fail fast (strict), while always making progress. | ||
| if (tokenizer.had_invalid_code_point()) { | ||
| if (auto error = tokenizer.process_tokenizing_error(tokenizer.next_index, | ||
| tokenizer.index)) { | ||
| return tl::unexpected(*error); | ||
| } | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The malformed UTF-8 handling added here only triggers for the initial decode at tokenizer.index. There are other code paths that consume additional code points via get_next_code_point()/seek_and_get_next_code_point() (e.g., after '\' for ESCAPED_CHAR, and within the regexp-scanning loop) without checking had_invalid_code_point(). As a result, malformed bytes can be silently accepted under strict (and can be emitted as ESCAPED_CHAR/REGEXP rather than INVALID_CHAR under lenient). Consider routing all decode sites through the same invalid-code-point handling (fail fast in strict, emit INVALID_CHAR in lenient) to keep the policy consistent and avoid skipping malformed bytes.
| @@ -350,8 +350,10 @@ constexpr void Tokenizer::get_next_code_point() { | |||
| ada_log("Tokenizer::get_next_code_point called with index=", next_index); | |||
| ADA_ASSERT_TRUE(next_index < input.size()); | |||
| // this assumes that we have a valid, non-truncated UTF-8 stream. | |||
There was a problem hiding this comment.
This comment says the decoder assumes a valid/non-truncated UTF-8 stream, but the implementation now explicitly supports malformed and truncated sequences by setting invalid_code_point and consuming a byte. Updating the comment would avoid misleading future maintainers about the contract of get_next_code_point().
| // this assumes that we have a valid, non-truncated UTF-8 stream. | |
| // Decode the next UTF-8 code point when possible. If the sequence is | |
| // malformed or truncated, mark it as invalid, return the offending first | |
| // byte in `code_point`, and consume a single byte so tokenization can | |
| // continue. |
| next_index = initial_index + 1; | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
number_bytes == 0 only catches bytes that don't match any of the 2/3/4-byte prefixes, but some invalid UTF-8 leading bytes still match those prefixes (e.g., 0xC0/0xC1 overlong starts, and 0xF5–0xF7 which would decode > U+10FFFF). If strict is meant to reject malformed UTF-8 broadly, consider adding explicit leading-byte validation after determining number_bytes so these cases set invalid_code_point and consume one byte instead of being treated as a valid multi-byte sequence.
| // Invalid UTF-8 leading bytes that still match a multi-byte prefix. | |
| // 0xC0 and 0xC1 are overlong 2-byte starts; 0xF5-0xF7 decode past U+10FFFF. | |
| if ((number_bytes == 2 && first_byte < 0xC2) || | |
| (number_bytes == 4 && first_byte > 0xF4)) { | |
| invalid_code_point = true; | |
| code_point = first_byte; | |
| next_index = initial_index + 1; | |
| return; | |
| } |
| // Attacker-controlled malformed bytes: each byte must consume one step. | ||
| constexpr size_t payload_size = 100000; | ||
| std::string payload(payload_size, '\x80'); | ||
|
|
||
| auto result = ada::url_pattern_helpers::tokenize( | ||
| payload, ada::url_pattern_helpers::token_policy::lenient); | ||
| ASSERT_TRUE(result); | ||
|
|
||
| // One INVALID_CHAR token per byte, plus one END token. | ||
| ASSERT_EQ(result->size(), payload_size + 1); | ||
| EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR); | ||
| EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR); | ||
| EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1); | ||
| EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u); | ||
| EXPECT_EQ(result->back().type, token_type::END); |
There was a problem hiding this comment.
This test asserts an exact token count of payload_size + 1 (i.e., one INVALID_CHAR token per byte). That’s stricter than the stated forward-progress guarantee and may make future tokenizer optimizations (e.g., coalescing consecutive malformed bytes into fewer INVALID_CHAR tokens) unnecessarily hard. Consider relaxing this to assertions that prove progress/coverage (e.g., END token present, last consumed index == payload_size, and/or that invalid tokens cover the whole payload) without requiring a specific tokenization granularity.
| // Attacker-controlled malformed bytes: each byte must consume one step. | |
| constexpr size_t payload_size = 100000; | |
| std::string payload(payload_size, '\x80'); | |
| auto result = ada::url_pattern_helpers::tokenize( | |
| payload, ada::url_pattern_helpers::token_policy::lenient); | |
| ASSERT_TRUE(result); | |
| // One INVALID_CHAR token per byte, plus one END token. | |
| ASSERT_EQ(result->size(), payload_size + 1); | |
| EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR); | |
| EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR); | |
| EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1); | |
| EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u); | |
| EXPECT_EQ(result->back().type, token_type::END); | |
| // Attacker-controlled malformed bytes: tokenization must make forward | |
| // progress across the entire payload and terminate, regardless of how | |
| // malformed bytes are grouped into INVALID_CHAR tokens. | |
| constexpr size_t payload_size = 100000; | |
| std::string payload(payload_size, '\x80'); | |
| auto result = ada::url_pattern_helpers::tokenize( | |
| payload, ada::url_pattern_helpers::token_policy::lenient); | |
| ASSERT_TRUE(result); | |
| ASSERT_FALSE(result->empty()); | |
| ASSERT_EQ(result->back().type, token_type::END); | |
| ASSERT_GE(result->size(), 2u); | |
| size_t consumed = 0; | |
| for (size_t i = 0; i + 1 < result->size(); ++i) { | |
| const auto& token = (*result)[i]; | |
| EXPECT_EQ(token.type, token_type::INVALID_CHAR); | |
| EXPECT_EQ(token.index, consumed); | |
| EXPECT_GT(token.value.size(), 0u); | |
| consumed += token.value.size(); | |
| } | |
| EXPECT_EQ(consumed, payload_size); |
|
@anonrig I cannot speak to URL pattern API, which I don't know well, but our main parsing function requires valid UTF-8 (it says so very specifically, it is a requirement). To be clear, if you pass invalid UTF-8 to our main parsing function you violate our API and maybe things will go bad. Your fault. Now, generally, what we do is we either require and assume valid UTF-8 at the entry point, or, we have to check everywhere. Currently we do not do this (check everywhere). For performance reasons, it makes more sense to either check at the entry point or to just tell the user that they are responsible for checking. Now, if we expect invalid Unicode, and we want to protect against that, then the easiest place to do it is when we first receive the string from the user. Given that most strings are going to be ASCII, you can use a fast path to check that the string is valid UTF-8: advance to the first non-ASCII character and so forth. |
|
@lemire I agree that validating UTF-8 once at the entry point is the better approach, both for consistency and performance. I’m thinking of updating the PR to:
Does that direction look good to you? |
@anonrig I have rerun clang-format and updated the PR. The linter/formatter checks should now be passing. |
|
@lemire any update on this? |
This fixes a security bug in URLPattern tokenization where malformed UTF-8 could leave the tokenizer cursor unchanged
Hardened UTF-8 decode logic to guarantee forward progress on malformed input by consuming at least one byte.
Added explicit malformed-sequence handling for:
Integrated malformed decode state into tokenizer loop error handling: