Fun with mb_strlen
I noticed the fallback implementation for mb_strlen() that we had in GlobalSettings.php sucked:
function mb_strlen( $str, $enc = "" ) {
preg_match_all( '/./us', $str, $matches );
return count($matches);
}
There are two things to note about this code:
- It doesn’t actually work, because no matches are done — it always returns 1
- Even if you fix it to return the matches, it’s extremely slow and will eat lots of memory by creating a giant array of every character in the (potentially quite long) string
I’m replacing this with a new version which uses PHP’s count_chars() function to count up the ASCII-compatible bytes and multibyte sequence head bytes. It’s still a smidge slower than mb_strlen but it’s… much better than the old one.
/**
* Fallback implementation of mb_strlen, hardcoded to UTF-8.
* @param string $str
* @param string $enc optional encoding; ignored
* @return int
*/
function new_mb_strlen( $str, $enc="" ) {
$counts = count_chars( $str );
$total = 0;
// Count ASCII bytes
for( $i = 0; $i < 0x80; $i++ ) {
$total += $counts[$i];
}
// Count multibyte sequence heads
for( $i = 0xc0; $i < 0xff; $i++ ) {
$total += $counts[$i];
}
return $total;
}
Some quick benchmarks using the UTF-8 normalization benchmark pages (code):
Testing washington.txt:
strlen 31526 chars 0.007ms
mb_strlen 31526 chars 0.114ms
old_mb_strlen 31526 chars 4813.686ms
new_mb_strlen 31526 chars 0.132ms
Testing berlin.txt:
strlen 36320 chars 0.001ms
mb_strlen 35899 chars 0.129ms
old_mb_strlen 35899 chars 6328.748ms
new_mb_strlen 35899 chars 0.127ms
Testing bulgakov.txt:
strlen 36849 chars 0.001ms
mb_strlen 20418 chars 0.076ms
old_mb_strlen 20418 chars 3003.042ms
new_mb_strlen 20418 chars 0.133ms
Testing tokyo.txt:
strlen 36244 chars 0.001ms
mb_strlen 19936 chars 0.071ms
old_mb_strlen 19936 chars 2623.109ms
new_mb_strlen 19936 chars 0.131ms
Testing young.txt:
strlen 36694 chars 0.001ms
mb_strlen 16676 chars 0.063ms
old_mb_strlen 16676 chars 2246.179ms
new_mb_strlen 16676 chars 0.125ms

March 10th, 2007 at 6:13 pm
The usual hack to use strlen(utf8_decode($str)); and rely on anything non 8859-1 to be output as a single question mark.
March 12th, 2007 at 11:42 am
Hm, that’s clever too.
Turns out it’s actually slower than my count_chars() method, though, on article-size strings. (By about a factor of 4 for primarily-ASCII text, or three or two for 2-byte and 3-byte-per-char ranges.)
Your method is faster for short strings… but all are well under a millisecond on my 2.33 GHz Core Duo test box for long strings, and under a tenth of a ms for the short strings, so it perhaps gets into splitting hairs.