feat: add js_trim() and mb_trim() compat#9519
feat: add js_trim() and mb_trim() compat#9519USERSATOSHI wants to merge 8 commits intoWordPress:trunkfrom
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| } | ||
|
|
||
| if ( 'UTF-8' !== $encoding ) { | ||
| $characters = mb_convert_encoding( $characters, 'UTF-8', $encoding ); |
There was a problem hiding this comment.
I believe this will intentionally corrupt the list of characters in every case that the code runs. is the $characters string not already UTF-8 by construction in the PH source code?
so if we convert it from anything else we’ll be telling PHP to misunderstand the string and double-convert it?
I would imagine that if the $encoding is ISO-8859-1, for instance, that we would get something like � instead of NARROW NO-BREAK SPACE U+202F.
dmsnell
left a comment
There was a problem hiding this comment.
@USERSATOSHI although this looks sound from the function-call arguments, I would like to hear your thoughts on some of the ways it could interact with actual site data and the encodings of strings coming into it.
there could be an argument for requiring that all incoming strings be converted into UTF-8 before reaching this function.
|
|
||
| if ( 'UTF-8' !== $encoding ) { | ||
| $characters = mb_convert_encoding( $characters, 'UTF-8', $encoding ); | ||
| $str = mb_convert_encoding( $str, 'UTF-8', $encoding ); |
There was a problem hiding this comment.
this line is a heavy lifter, and I generally encourage folks to disregard content if it’s not UTF-8 because the conversion here is more than likely to introduce corruption.
it may be less risky to check if the string is valid in its own encoding first…
if (
! is_utf8_charset( $encoding ) &&
mb_check_encoding( $str, $encoding )
) {
$str = mb_convert_encoding( $str, 'UTF-8', $encoding );
} else {
// REJECT!
}but even in this case we run a large risk because most strings will validate as any of the single-byte encodings likely to be set on a real site, if not UTF-8.
the primary source of non-UTF-8 is from legacy database tables, and it’s best to convert encodings at the point of demarcation when reading from the database. any other string sent here is almost certainly going to be in a different encoding than what is set for $encoding
also, I would guess that there is an extremely low likelihood that mb_internal_encoding() matches a site’s blog_charset or the encoding of the incoming text unless they are all UTF-8.
PHP’s trim() function, by default, only strips a limited set of ASCII whitespace characters, and mb_trim(), introduced in PHP 8.4, does not behave identically to JavaScript’s String.prototype.trim().
This PR implements
js_trim(), a PHP function that replicates JavaScript’sString.prototype.trim()behavior.It works by defining a set of
$js_trimmablescharacters, which are passed tomb_trim()withUTF-8encoding.In addition, this PR adds a polyfill for
mb_trim()incompat.phpto support PHP versions below 8.4 with unit tests for bothjs_trim()andmb_trim()Trac ticket: https://core.trac.wordpress.org/ticket/63804
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.