close
Skip to content

Optionally keep original title headers for main content extraction accuracy#1006

Open
mcPear wants to merge 1 commit intomozilla:mainfrom
surferseo:mg/title
Open

Optionally keep original title headers for main content extraction accuracy#1006
mcPear wants to merge 1 commit intomozilla:mainfrom
surferseo:mg/title

Conversation

@mcPear
Copy link
Copy Markdown

@mcPear mcPear commented Apr 17, 2026

Summary

Reader extraction currently rewrites all in-article h1 elements to h2 so the article title can remain the sole top-level heading in the reader UI. Moreover, removes the first similar heading spotted after the title. These normalizations improve classic “reader mode” presentation but weaken the semantic outline of the page: crawlers, SEO tooling, and systems that infer structure from HTML (including retrieval and “reverse engineering” of how a page is organized) rely on stable heading levels that match the publisher’s markup.

This change preserves the original heading tag names and levels in the extracted content wherever we are not explicitly removing noise, so the serialized article HTML stays closer to the source document’s hierarchy. All that is gated behind an option.

What changes (high level)

  • Stop blanket h1h2 replacement in article content
  • Stop duplicate-title header removal
  • Add unit tests

@mcPear mcPear changed the title chore: optionally keep original title headers Optionally keep original title headers for main content extraction accuracy Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant