close
Skip to content

Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125

Open
metsw24-max wants to merge 2 commits intoada-url:mainfrom
metsw24-max:urlpattern-utf8-nonprogress-dos
Open

Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125
metsw24-max wants to merge 2 commits intoada-url:mainfrom
metsw24-max:urlpattern-utf8-nonprogress-dos

Conversation

@metsw24-max
Copy link
Copy Markdown
Contributor

This fixes a security bug in URLPattern tokenization where malformed UTF-8 could leave the tokenizer cursor unchanged

Hardened UTF-8 decode logic to guarantee forward progress on malformed input by consuming at least one byte.

Added explicit malformed-sequence handling for:

  • invalid leading bytes
  • truncated multibyte sequences
  • invalid continuation bytes

Integrated malformed decode state into tokenizer loop error handling:

  • strict policy returns type error
  • lenient policy emits invalid-char token and continues with progress

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.71%. Comparing base (95895d6) to head (a6d2aeb).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
include/ada/url_pattern_helpers-inl.h 41.17% 8 Missing and 2 partials ⚠️
src/url_pattern_helpers.cpp 80.00% 0 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (52.17%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1125      +/-   ##
==========================================
- Coverage   59.71%   59.71%   -0.01%     
==========================================
  Files          37       37              
  Lines        5958     5980      +22     
  Branches     2907     2911       +4     
==========================================
+ Hits         3558     3571      +13     
- Misses        593      601       +8     
- Partials     1807     1808       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anonrig
Copy link
Copy Markdown
Member

anonrig commented Apr 7, 2026

Would you mind rerunning the linter/formatter?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the URLPattern tokenizer against a DoS condition where malformed UTF-8 could prevent forward progress, by ensuring malformed sequences consume input and are surfaced through strict/lenient tokenization policies.

Changes:

  • Track malformed UTF-8 decodes in Tokenizer and ensure get_next_code_point() always advances at least one byte on malformed input.
  • Integrate malformed-decode handling into the main tokenization loop (strict: TypeError; lenient: INVALID_CHAR token + continue).
  • Add regression tests covering strict/lenient behavior and forward progress on malformed payloads.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/wpt_urlpattern_tests.cpp Adds regression tests for malformed UTF-8 progress and strict/lenient behavior.
src/url_pattern_helpers.cpp Hooks malformed-decode detection into the main tokenizer loop’s error handling.
include/ada/url_pattern_helpers.h Exposes a had_invalid_code_point() accessor and stores invalid-decode state in Tokenizer.
include/ada/url_pattern_helpers-inl.h Implements malformed UTF-8 detection with guaranteed cursor advancement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +599 to +607
// Malformed UTF-8 must not stall tokenization: report an invalid-char
// token (lenient) or fail fast (strict), while always making progress.
if (tokenizer.had_invalid_code_point()) {
if (auto error = tokenizer.process_tokenizing_error(tokenizer.next_index,
tokenizer.index)) {
return tl::unexpected(*error);
}
continue;
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The malformed UTF-8 handling added here only triggers for the initial decode at tokenizer.index. There are other code paths that consume additional code points via get_next_code_point()/seek_and_get_next_code_point() (e.g., after '\' for ESCAPED_CHAR, and within the regexp-scanning loop) without checking had_invalid_code_point(). As a result, malformed bytes can be silently accepted under strict (and can be emitted as ESCAPED_CHAR/REGEXP rather than INVALID_CHAR under lenient). Consider routing all decode sites through the same invalid-code-point handling (fail fast in strict, emit INVALID_CHAR in lenient) to keep the policy consistent and avoid skipping malformed bytes.

Copilot uses AI. Check for mistakes.
@@ -350,8 +350,10 @@ constexpr void Tokenizer::get_next_code_point() {
ada_log("Tokenizer::get_next_code_point called with index=", next_index);
ADA_ASSERT_TRUE(next_index < input.size());
// this assumes that we have a valid, non-truncated UTF-8 stream.
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says the decoder assumes a valid/non-truncated UTF-8 stream, but the implementation now explicitly supports malformed and truncated sequences by setting invalid_code_point and consuming a byte. Updating the comment would avoid misleading future maintainers about the contract of get_next_code_point().

Suggested change
// this assumes that we have a valid, non-truncated UTF-8 stream.
// Decode the next UTF-8 code point when possible. If the sequence is
// malformed or truncated, mark it as invalid, return the offending first
// byte in `code_point`, and consume a single byte so tokenization can
// continue.

Copilot uses AI. Check for mistakes.
next_index = initial_index + 1;
return;
}

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number_bytes == 0 only catches bytes that don't match any of the 2/3/4-byte prefixes, but some invalid UTF-8 leading bytes still match those prefixes (e.g., 0xC0/0xC1 overlong starts, and 0xF5–0xF7 which would decode > U+10FFFF). If strict is meant to reject malformed UTF-8 broadly, consider adding explicit leading-byte validation after determining number_bytes so these cases set invalid_code_point and consume one byte instead of being treated as a valid multi-byte sequence.

Suggested change
// Invalid UTF-8 leading bytes that still match a multi-byte prefix.
// 0xC0 and 0xC1 are overlong 2-byte starts; 0xF5-0xF7 decode past U+10FFFF.
if ((number_bytes == 2 && first_byte < 0xC2) ||
(number_bytes == 4 && first_byte > 0xF4)) {
invalid_code_point = true;
code_point = first_byte;
next_index = initial_index + 1;
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +355 to +369
// Attacker-controlled malformed bytes: each byte must consume one step.
constexpr size_t payload_size = 100000;
std::string payload(payload_size, '\x80');

auto result = ada::url_pattern_helpers::tokenize(
payload, ada::url_pattern_helpers::token_policy::lenient);
ASSERT_TRUE(result);

// One INVALID_CHAR token per byte, plus one END token.
ASSERT_EQ(result->size(), payload_size + 1);
EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR);
EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR);
EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1);
EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u);
EXPECT_EQ(result->back().type, token_type::END);
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts an exact token count of payload_size + 1 (i.e., one INVALID_CHAR token per byte). That’s stricter than the stated forward-progress guarantee and may make future tokenizer optimizations (e.g., coalescing consecutive malformed bytes into fewer INVALID_CHAR tokens) unnecessarily hard. Consider relaxing this to assertions that prove progress/coverage (e.g., END token present, last consumed index == payload_size, and/or that invalid tokens cover the whole payload) without requiring a specific tokenization granularity.

Suggested change
// Attacker-controlled malformed bytes: each byte must consume one step.
constexpr size_t payload_size = 100000;
std::string payload(payload_size, '\x80');
auto result = ada::url_pattern_helpers::tokenize(
payload, ada::url_pattern_helpers::token_policy::lenient);
ASSERT_TRUE(result);
// One INVALID_CHAR token per byte, plus one END token.
ASSERT_EQ(result->size(), payload_size + 1);
EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR);
EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR);
EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1);
EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u);
EXPECT_EQ(result->back().type, token_type::END);
// Attacker-controlled malformed bytes: tokenization must make forward
// progress across the entire payload and terminate, regardless of how
// malformed bytes are grouped into INVALID_CHAR tokens.
constexpr size_t payload_size = 100000;
std::string payload(payload_size, '\x80');
auto result = ada::url_pattern_helpers::tokenize(
payload, ada::url_pattern_helpers::token_policy::lenient);
ASSERT_TRUE(result);
ASSERT_FALSE(result->empty());
ASSERT_EQ(result->back().type, token_type::END);
ASSERT_GE(result->size(), 2u);
size_t consumed = 0;
for (size_t i = 0; i + 1 < result->size(); ++i) {
const auto& token = (*result)[i];
EXPECT_EQ(token.type, token_type::INVALID_CHAR);
EXPECT_EQ(token.index, consumed);
EXPECT_GT(token.value.size(), 0u);
consumed += token.value.size();
}
EXPECT_EQ(consumed, payload_size);

Copilot uses AI. Check for mistakes.
@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 7, 2026

@anonrig I cannot speak to URL pattern API, which I don't know well, but our main parsing function requires valid UTF-8 (it says so very specifically, it is a requirement). To be clear, if you pass invalid UTF-8 to our main parsing function you violate our API and maybe things will go bad. Your fault.

Now, generally, what we do is we either require and assume valid UTF-8 at the entry point, or, we have to check everywhere. Currently we do not do this (check everywhere). For performance reasons, it makes more sense to either check at the entry point or to just tell the user that they are responsible for checking. Now, if we expect invalid Unicode, and we want to protect against that, then the easiest place to do it is when we first receive the string from the user.

Given that most strings are going to be ASCII, you can use a fast path to check that the string is valid UTF-8: advance to the first non-ASCII character and so forth.

@metsw24-max
Copy link
Copy Markdown
Contributor Author

@lemire
My initial intent was to prevent non-progress states when handling malformed input, but I see that adding checks deep
in the tokenizer is not aligned with the current design.

I agree that validating UTF-8 once at the entry point is the better approach, both for consistency and performance.

I’m thinking of updating the PR to:

  • add a fast UTF-8 validation step at the API boundary
  • handle invalid input early
  • remove the deeper internal checks

Does that direction look good to you?

@metsw24-max
Copy link
Copy Markdown
Contributor Author

Would you mind rerunning the linter/formatter?

@anonrig I have rerun clang-format and updated the PR. The linter/formatter checks should now be passing.

@metsw24-max
Copy link
Copy Markdown
Contributor Author

metsw24-max commented Apr 23, 2026

@lemire any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants