Fix URLPattern tokenizer to prevent DoS on malformed UTF-8 by metsw24-max · Pull Request #1125 · ada-url/ada

metsw24-max · 2026-04-07T07:45:34Z

This fixes a security bug in URLPattern tokenization where malformed UTF-8 could leave the tokenizer cursor unchanged

Hardened UTF-8 decode logic to guarantee forward progress on malformed input by consuming at least one byte.

Added explicit malformed-sequence handling for:

invalid leading bytes
truncated multibyte sequences
invalid continuation bytes

Integrated malformed decode state into tokenizer loop error handling:

strict policy returns type error
lenient policy emits invalid-char token and continues with progress

…nbounded DoS

codecov · 2026-04-07T07:47:37Z

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.71%. Comparing base (95895d6) to head (a6d2aeb).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
include/ada/url_pattern_helpers-inl.h	41.17%	8 Missing and 2 partials ⚠️
src/url_pattern_helpers.cpp	80.00%	0 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (52.17%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1125      +/-   ##
==========================================
- Coverage   59.71%   59.71%   -0.01%     
==========================================
  Files          37       37              
  Lines        5958     5980      +22     
  Branches     2907     2911       +4     
==========================================
+ Hits         3558     3571      +13     
- Misses        593      601       +8     
- Partials     1807     1808       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

anonrig · 2026-04-07T22:33:06Z

Would you mind rerunning the linter/formatter?

Copilot

Pull request overview

This PR hardens the URLPattern tokenizer against a DoS condition where malformed UTF-8 could prevent forward progress, by ensuring malformed sequences consume input and are surfaced through strict/lenient tokenization policies.

Changes:

Track malformed UTF-8 decodes in Tokenizer and ensure get_next_code_point() always advances at least one byte on malformed input.
Integrate malformed-decode handling into the main tokenization loop (strict: TypeError; lenient: INVALID_CHAR token + continue).
Add regression tests covering strict/lenient behavior and forward progress on malformed payloads.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`tests/wpt_urlpattern_tests.cpp`	Adds regression tests for malformed UTF-8 progress and strict/lenient behavior.
`src/url_pattern_helpers.cpp`	Hooks malformed-decode detection into the main tokenizer loop’s error handling.
`include/ada/url_pattern_helpers.h`	Exposes a `had_invalid_code_point()` accessor and stores invalid-decode state in `Tokenizer`.
`include/ada/url_pattern_helpers-inl.h`	Implements malformed UTF-8 detection with guaranteed cursor advancement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T22:38:46Z

+    // Malformed UTF-8 must not stall tokenization: report an invalid-char
+    // token (lenient) or fail fast (strict), while always making progress.
+    if (tokenizer.had_invalid_code_point()) {
+      if (auto error = tokenizer.process_tokenizing_error(tokenizer.next_index,
+                                                          tokenizer.index)) {
+        return tl::unexpected(*error);
+      }
+      continue;
+    }


The malformed UTF-8 handling added here only triggers for the initial decode at tokenizer.index. There are other code paths that consume additional code points via get_next_code_point()/seek_and_get_next_code_point() (e.g., after '\' for ESCAPED_CHAR, and within the regexp-scanning loop) without checking had_invalid_code_point(). As a result, malformed bytes can be silently accepted under strict (and can be emitted as ESCAPED_CHAR/REGEXP rather than INVALID_CHAR under lenient). Consider routing all decode sites through the same invalid-code-point handling (fail fast in strict, emit INVALID_CHAR in lenient) to keep the policy consistent and avoid skipping malformed bytes.

Copilot · 2026-04-07T22:38:46Z

@@ -350,8 +350,10 @@ constexpr void Tokenizer::get_next_code_point() {
  ada_log("Tokenizer::get_next_code_point called with index=", next_index);
  ADA_ASSERT_TRUE(next_index < input.size());
  // this assumes that we have a valid, non-truncated UTF-8 stream.


This comment says the decoder assumes a valid/non-truncated UTF-8 stream, but the implementation now explicitly supports malformed and truncated sequences by setting invalid_code_point and consuming a byte. Updating the comment would avoid misleading future maintainers about the contract of get_next_code_point().

Suggested change

// this assumes that we have a valid, non-truncated UTF-8 stream.

// Decode the next UTF-8 code point when possible. If the sequence is

// malformed or truncated, mark it as invalid, return the offending first

// byte in `code_point`, and consume a single byte so tokenization can

// continue.

Copilot · 2026-04-07T22:38:47Z

+    next_index = initial_index + 1;
+    return;
+  }
+


number_bytes == 0 only catches bytes that don't match any of the 2/3/4-byte prefixes, but some invalid UTF-8 leading bytes still match those prefixes (e.g., 0xC0/0xC1 overlong starts, and 0xF5–0xF7 which would decode > U+10FFFF). If strict is meant to reject malformed UTF-8 broadly, consider adding explicit leading-byte validation after determining number_bytes so these cases set invalid_code_point and consume one byte instead of being treated as a valid multi-byte sequence.

Suggested change

// Invalid UTF-8 leading bytes that still match a multi-byte prefix.

// 0xC0 and 0xC1 are overlong 2-byte starts; 0xF5-0xF7 decode past U+10FFFF.

if ((number_bytes == 2 && first_byte < 0xC2) ||

(number_bytes == 4 && first_byte > 0xF4)) {

invalid_code_point = true;

code_point = first_byte;

next_index = initial_index + 1;

return;

}

Copilot · 2026-04-07T22:38:47Z

+  // Attacker-controlled malformed bytes: each byte must consume one step.
+  constexpr size_t payload_size = 100000;
+  std::string payload(payload_size, '\x80');
+
+  auto result = ada::url_pattern_helpers::tokenize(
+      payload, ada::url_pattern_helpers::token_policy::lenient);
+  ASSERT_TRUE(result);
+
+  // One INVALID_CHAR token per byte, plus one END token.
+  ASSERT_EQ(result->size(), payload_size + 1);
+  EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR);
+  EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR);
+  EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1);
+  EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u);
+  EXPECT_EQ(result->back().type, token_type::END);


This test asserts an exact token count of payload_size + 1 (i.e., one INVALID_CHAR token per byte). That’s stricter than the stated forward-progress guarantee and may make future tokenizer optimizations (e.g., coalescing consecutive malformed bytes into fewer INVALID_CHAR tokens) unnecessarily hard. Consider relaxing this to assertions that prove progress/coverage (e.g., END token present, last consumed index == payload_size, and/or that invalid tokens cover the whole payload) without requiring a specific tokenization granularity.

Suggested change

// Attacker-controlled malformed bytes: each byte must consume one step.

constexpr size_t payload_size = 100000;

std::string payload(payload_size, '\x80');

auto result = ada::url_pattern_helpers::tokenize(

payload, ada::url_pattern_helpers::token_policy::lenient);

ASSERT_TRUE(result);

// One INVALID_CHAR token per byte, plus one END token.

ASSERT_EQ(result->size(), payload_size + 1);

EXPECT_EQ((*result)[0].type, token_type::INVALID_CHAR);

EXPECT_EQ((*result)[payload_size - 1].type, token_type::INVALID_CHAR);

EXPECT_EQ((*result)[payload_size - 1].index, payload_size - 1);

EXPECT_EQ((*result)[payload_size - 1].value.size(), 1u);

EXPECT_EQ(result->back().type, token_type::END);

// Attacker-controlled malformed bytes: tokenization must make forward

// progress across the entire payload and terminate, regardless of how

// malformed bytes are grouped into INVALID_CHAR tokens.

constexpr size_t payload_size = 100000;

std::string payload(payload_size, '\x80');

auto result = ada::url_pattern_helpers::tokenize(

payload, ada::url_pattern_helpers::token_policy::lenient);

ASSERT_TRUE(result);

ASSERT_FALSE(result->empty());

ASSERT_EQ(result->back().type, token_type::END);

ASSERT_GE(result->size(), 2u);

size_t consumed = 0;

for (size_t i = 0; i + 1 < result->size(); ++i) {

const auto& token = (*result)[i];

EXPECT_EQ(token.type, token_type::INVALID_CHAR);

EXPECT_EQ(token.index, consumed);

EXPECT_GT(token.value.size(), 0u);

consumed += token.value.size();

}

EXPECT_EQ(consumed, payload_size);

lemire · 2026-04-07T22:57:14Z

@anonrig I cannot speak to URL pattern API, which I don't know well, but our main parsing function requires valid UTF-8 (it says so very specifically, it is a requirement). To be clear, if you pass invalid UTF-8 to our main parsing function you violate our API and maybe things will go bad. Your fault.

Now, generally, what we do is we either require and assume valid UTF-8 at the entry point, or, we have to check everywhere. Currently we do not do this (check everywhere). For performance reasons, it makes more sense to either check at the entry point or to just tell the user that they are responsible for checking. Now, if we expect invalid Unicode, and we want to protect against that, then the easiest place to do it is when we first receive the string from the user.

Given that most strings are going to be ASCII, you can use a fast path to check that the string is valid UTF-8: advance to the first non-ASCII character and so forth.

metsw24-max · 2026-04-08T04:07:57Z

@lemire
My initial intent was to prevent non-progress states when handling malformed input, but I see that adding checks deep
in the tokenizer is not aligned with the current design.

I agree that validating UTF-8 once at the entry point is the better approach, both for consistency and performance.

I’m thinking of updating the PR to:

add a fast UTF-8 validation step at the API boundary
handle invalid input early
remove the deeper internal checks

Does that direction look good to you?

metsw24-max · 2026-04-08T04:36:19Z

Would you mind rerunning the linter/formatter?

@anonrig I have rerun clang-format and updated the PR. The linter/formatter checks should now be passing.

metsw24-max · 2026-04-23T05:52:29Z

@lemire any update on this?

Fix URLPattern tokenizer non-progress on malformed UTF-8 to prevent u…

475b4c4

…nbounded DoS

anonrig requested review from anonrig, Copilot and lemire April 7, 2026 22:33

Copilot started reviewing on behalf of anonrig April 7, 2026 22:33 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

updated

a6d2aeb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125

Fix URLPattern tokenizer to prevent DoS on malformed UTF-8#1125
metsw24-max wants to merge 2 commits intoada-url:mainfrom
metsw24-max:urlpattern-utf8-nonprogress-dos

metsw24-max commented Apr 7, 2026

Uh oh!

codecov Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

anonrig commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

lemire commented Apr 7, 2026

Uh oh!

metsw24-max commented Apr 8, 2026

Uh oh!

metsw24-max commented Apr 8, 2026

Uh oh!

metsw24-max commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-  // this assumes that we have a valid, non-truncated UTF-8 stream.
+  // Decode the next UTF-8 code point when possible. If the sequence is
+  // malformed or truncated, mark it as invalid, return the offending first
+  // byte in `code_point`, and consume a single byte so tokenization can
+  // continue.

+  // Invalid UTF-8 leading bytes that still match a multi-byte prefix.
+  // 0xC0 and 0xC1 are overlong 2-byte starts; 0xF5-0xF7 decode past U+10FFFF.
+  if ((number_bytes == 2 && first_byte < 0xC2) ||
+      (number_bytes == 4 && first_byte > 0xF4)) {
+    invalid_code_point = true;
+    code_point = first_byte;
+    next_index = initial_index + 1;
+    return;
+  }

Uh oh!

Conversation

metsw24-max commented Apr 7, 2026

Uh oh!

codecov Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

anonrig commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

lemire commented Apr 7, 2026

Uh oh!

metsw24-max commented Apr 8, 2026

Uh oh!

metsw24-max commented Apr 8, 2026

Uh oh!

metsw24-max commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 7, 2026 •

edited

Loading

metsw24-max commented Apr 23, 2026 •

edited

Loading