close
Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/wp-includes/widgets.php
Original file line number Diff line number Diff line change
Expand Up @@ -1644,7 +1644,8 @@ function wp_widget_rss_output( $rss, $args = array() ) {
}
$link = esc_url( strip_tags( $link ) );

$title = esc_html( trim( strip_tags( $item->get_title() ) ) );
$title = esc_html( trim( strip_tags( html_entity_decode( $item->get_title() ) ) ) );
Copy link
Copy Markdown
Member

@dmsnell dmsnell Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh. we have a hard time separating XML and HTML. RSS makes it even more complicated because it’s not required to be a valid XML file and they usually don’t indicate how we should be interpreting the characters inside the different elements.

here is what comes from the linked feed. it’s obvious they have sent encoded content inside the title element.

<dc:title>Oral administration of &lt;em&gt;Lactiplantibacillus plantarum&lt;/em&gt; GKK1 ameliorates atopic dermatitis in a mouse model</dc:title>

this brings us to an odd spot. if we follow this logic we are missing a second html_entity_decode(), and the first one is wrong. the first level is decoding XML, where HTML named character references are not valid. that produces the HTML of the item’s title. then we want to decode the item’s title, revealing plaintext. but that title itself could have referenced something like drug A is < drug B in which case its HTML would be drug A is &lt; drug B in which case the XML wrapping should escape that a second time into drug A is &amp;lt; drug B.

we have decoded the XML into HTML and then removed tags, potentially leaving those same character references undecoded from the HTML side.


it may help to use additional variables, and I will recommend switching from html_entity_decode() on the XML side into the HTML API on the HTML side. why this function works for XML and breaks for HTML is beyond my imagination, but alas, that’s the way it is.

$item_title_xml  = $item->get_title();
$item_title_html = html_entity_decode( $item_title_xml, ENT_XML1 | ENT_SUBSTITUTE );

$processor  = new WP_HTML_Tag_Processor( $item_title_html );
$item_title = '';
while ( $processor->next_token() ) {
	if ( '#text' === $processor->get_token_name() ) {
		$item_title .= $processor->get_modifiable_text();
	}
}

This function is old and could use a lot of love.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it’s very likely we could encounter other encodings of the title attribute. Fixing this could break others, because RSS is not explicit about the content type of the contained values.

The above code could turn into some function explicitly indicating what it is assuming and used here. Should we decide to enhance the robustness later, we would have an easy way to assess the existing code and swap it out as appropriate.

/**
 * Returns the plaintext content of encoded HTML content serialized in XML.
 *
 * When an RSS tag contains “encoded content” then the decoded XML
 * represents HTML. After decoding the XML into HTML, this returns
 * the plaintext content of that decoded HTML.
 *
 * Example:
 *
 *     echo wp_rss_xml_to_html( '&lt;p&gt;&amp;#x1f63c;&lt;/p&gt;' );
 *     // <p>&#x1f63c;</p>
 *
 *     echo wp_rss_xml_to_html_to_text( '&lt;p&gt;&amp;#x1f63c;&lt;/p&gt;' );
 *     // 😼
 *
 */
function wp_rss_xml_to_html_to_text( string $raw_xml ): string {
	// XML only defines five named entities: &amp; &gt; &lt; &apos; &quot;
	$html = html_entity_decode( $raw_xml, ENT_XML1 | ENT_SUBSTITUTE );

	$plaintext = '';
	$processor = new WP_HTML_Tag_Processor( $html );
	while ( $processor->next_token() ) {
		if ( '#text' === $processor->get_token_name() ) {
			$plaintext .= $processor->get_modifiable_text();
		}
	}

	return $plaintext;
}

and then call it

$escaped_title = esc_html( trim( wp_rss_xml_to_html_to_text( $item->get_title() ) ) );

The same should likely apply to the description below as well. Why do we call esc_attr( esc_html( $desc ) ) 🤦‍♂️ 😭.


if ( empty( $title ) ) {
$title = __( 'Untitled' );
}
Expand Down
43 changes: 43 additions & 0 deletions tests/phpunit/tests/widgets/wpWidgetRss.php
Original file line number Diff line number Diff line change
Expand Up @@ -116,4 +116,47 @@ public function mocked_rss_response() {
'filename' => null,
);
}

/**
* @ticket 63611
* @covers wp_widget_rss_output
*/
public function test_rss_title_html_entities_decoded() {
$mock_item = $this->getMockBuilder( 'SimplePie_Item' )
->disableOriginalConstructor()
->getMock();

$mock_item->method( 'get_title' )
->willReturn( 'Title with &lt;em&gt;HTML entities&lt;/em&gt;' );

$mock_item->method( 'get_link' )
->willReturn( 'https://example.com' );

$mock_item->method( 'get_description' )
->willReturn( 'Description' );

$mock_item->method( 'get_date' )
->willReturn( false );

$mock_item->method( 'get_author' )
->willReturn( false );

$mock_rss = $this->getMockBuilder( 'SimplePie' )
->disableOriginalConstructor()
->getMock();

$mock_rss->method( 'get_item_quantity' )
->willReturn( 1 );

$mock_rss->method( 'get_items' )
->willReturn( array( $mock_item ) );

ob_start();
wp_widget_rss_output( $mock_rss );
$output = ob_get_clean();

$this->assertStringContainsString( 'Title with HTML entities', $output );
$this->assertStringNotContainsString( '&lt;em&gt;', $output );
$this->assertStringNotContainsString( '&lt;/em&gt;', $output );
}
}
Loading