Search Patch

While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.

First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.

Here’s the basic form of the query that I’m running, say searching for “porter”:

SELECT *,
MATCH(body) AGAINST('porter') AS relevance
FROM content
WHERE MATCH(body) AGAINST('porter')
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.

Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:

SELECT *,
MATCH(body) AGAINST('porter') +
  (1 / INSTR(body, 'php') + 1 / 2[position of word in string])
AS relevance
FROM content
WHERE ( MATCH(body) AGAINST('porter')
  OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]'
  )
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%' for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.

That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.

It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.

2 comments

  1. I was surfing around looking for ideas on how to parse boolean search strings, and happened across this dated article:

    http://evolt.org/article/Boolean_Fulltext_Searching_with_PHP_and_MySQL/18/15665/

    It takes a similar approach to yours in handling words < 4 characters in length, though he’s using a different scoring algorithm.

    BTW, it turns out that Serendipity uses MySQL’s fulltext search, and is hard coded to give an error if the search string is < 4 characters (even if MySQL is set to handle smaller word lengths).

  2. Looks interesting, but fortunately my host’s MySQL is already version 4, so IN BOOLEAN MODE is supported. And they may reduce the word length minimum to 3 (that’s what they said in response to my query to them), so there’s hope! 🙂

    I find it a bit odd that Serendipity is hard-coded with the less-than-4 error… it’s not that hard to work around. Weird.

Comments are closed.