The Porter Stemming Algorithm was developed by Martin Porter for reducing English words to their word stems. For example, the word “connections” would be reduced to its stem form “connect.”
This PHP class is a fairly faithful implementation of the algorithm (the web page of which can be found here). The primary use of stemming words is in keyword indexing, if you’re building a search application.
Thanks to Mike Boone for finding finding a fatal error in the is_consonant() function dealing with short word stems beginning with “Y”. I’ve updated the class accordingly.
Additional thanks to Mark Plumbley for finding an additional problem with short words beginning with “Y”—the word “yves” for example. I fixed the _o() and is_consonant() functions to appropriately sanity check the values being passed around. (And, thanks to Jason for pointing this out to me again which reminded me that I had updated the class but forgot to upload it here.)
Thanks to Andrew Jeffries for discovering a bug for words beginning with “yy”—this would cause the
is_consonant() method checking either of these first “y”s to fall into a recursive infinite loop and crash the program.
Big update, 11/9/2005: Prompted by an email from Richard Shelquist, I went back over the class and fixed some errors in the algorithm; in particular I made sure to conform exactly to the written algorithm found at the Stemmer website. This class now takes the test vocabulary file here and stems every single word exactly as shown in the output file here, with two exceptions: “ycleped” and “ycliped”, which I believe my version stems correctly, due to assuming the “Y” at the beginning of a word followed by a consonant—as in “Yvette”—is to be treated as a vowel and not a consonant.
Yeah, that’s arrogant; allow me some, okay? 🙂
Also, Richard Heyes, a real PHP guru, has a PHP 5 version of the Stemmer class available online. You can grab it here. It’s especially notable because he credits me for certain elements, even though he notes my version was broken. Also, I’ve heard (but have not tested) that his version is faster than mine, though I don’t know if that’s because his is PHP 5 and mine is PHP 4. Either way, Richard has way more RegEx Kung Fu than me, and I bow to his skillz.
Thanks to Damon Sauve for suggesting a better fix to the handling of hyphenated words (in his case, multi-hyphenated words). His fix used a regular expression to extract the final part of the hyphenated word, while mine does a
substr() split instead. Also, his version allows dots and apostrophes in words, such as URLs and contractions, and I realize this is a real-world scenario that I didn’t account for, so it’s been incorporated.
You can view the direct source of the class here, or download it in one of the following formats: