BendSearch

Check this out: BendSearch.com. Alpine, the company I used to work for, has finally resurrected it and updated it. Good for them.

I was responsible for a lot of the work that went into that site back in the day (this would be in 2000-2001). Oh, I’m not bragging; that was, of course, early in my career as a PHP developer, so there was a lot of ugly, bad code floating around in there, courtesy of yours truly.

Hopefully they’ll be able to get somebody to give it the attention it deserves; it’s looking a little sparse right now but I rather like what I’m seeing.

Referrers, search engines, trends

Going through my site’s logfiles, I figured it’s about time for one of those navel-gazing site-analyzing posts. I’ve noticed some trends along the way, I think.

By far, the most search engine hits I get are from Google; over the past 30 days, I clocked 2,617 hits from Google, nearly four times more than Yahoo at 763 hits. In fact, the top ten search engines are:

1 Google 2,617
2 Yahoo 763
3 MSN 188
4 Altavista 82
5 AskJeeves 61
6 AOL Search 35
7 Netscape 20
8 AllTheWeb 16
9 Mamma 4
10 Lycos 4

I’m a little surprised by the amount of variation there.

The trends I’ve noticed are in the breakdown of what people are searching for from each site. Most of the Google searches are for free Palm ebooks, Matrix names, and variations on those themes; it seems that people are using Google to find specific types of information, knowing the parameters of what they’re looking for—targeted. The other search engines, on the other hand, seem to better reflect pop cultural references and more general searching. Among Yahoo searches, for instance, I see such phrases as, “boba fett” (number one), “kermit the frog,” “dell dude,” “a-team movie,” and so on. Same for the others.

So I’d guess that in Google searches, when they find me I’m near the top of the lists for what they’re searching for and the users are looking for specific things. On Yahoo and the others, though, it looks like people are more into browsing on vaguer searches, and clicking through on links that look interesting, but may not be relevant. The conclusion I’d draw from this (not surprisingly) is that Google users are power users, and the search engine people go to who want to really find something and get the job done, whereas Yahoo users are more casual, not so worried about the results, but they’ll do in a pinch.

And of course, the best part of this whole entry: listing some of the best/worst search phrases people have actually typed to get here. All verbatim.

  • thongs in public
  • what’s your name
  • purple flowers
  • jones green bean casserole soda
  • van helsing absinthe
  • donner party cannibalism
  • heroin
  • green bean soda
  • white trash sex
  • pong is a violent game
  • twas the night bush
  • green bean casserole soda
  • ugliest picture
  • topless rotten
  • skinsuits
  • donkey brew
  • if you had a male tiger what would you name it
  • snoop dog fir shizzle
  • frog master
  • fett ass
  • cracker ingredients
  • beer mugs carved in pumpkins
  • what is the proper way to charge cell and cordless phones
  • on the sierra nevada summerfest beer label what mountains are featured
  • is there a formula for figuring out when thanksgiving day will be
  • how do i clean vomit from couch
  • check out my wife
  • turkey soda
  • where is it snowing in the united states november 11, 2004
  • donner party beer
  • emachine turns it’s self on
  • halloween hooch drink

Search Patch

While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.

First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.

Here’s the basic form of the query that I’m running, say searching for “porter”:

SELECT *,
MATCH(body) AGAINST('porter') AS relevance
FROM content
WHERE MATCH(body) AGAINST('porter')
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.

Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:

SELECT *,
MATCH(body) AGAINST('porter') +
  (1 / INSTR(body, 'php') + 1 / 2[position of word in string])
AS relevance
FROM content
WHERE ( MATCH(body) AGAINST('porter')
  OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]'
  )
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%' for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.

That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.

It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.

Searching and Minimum Word Length

Mike Boone, in the comments section of yesterday’s entry on searching (“Updated Search“), correctly points out that searching my site for a word that is less than four characters in length (like “php” or “cow”) does not work—no results are returned. Obviously, since I write about PHP on occasion, this is untenable.

The problem is that MySQL‘s fulltext indexing, by default, only indexes words greater than three characters long, and I don’t think I have any way to change this, despite my initial reply to Mike’s comment. This site is running on a shared server setup on pair.com, and I have absolutely zero control over the MySQL server configuration. I might post a question to their tech support, but I’m not overly optimistic about the response. So, what to do?

Short term, here’s my solution (though it’s not implemented yet): examine each word in the search string, throwing out stopwords (like “the,” “and,” “so,” etc.), and for any word shorter than four characters long, do a LIKE search against the content for them. No, it’s not ideal, but it’s a patch. Comments?

Updated Search

I’ve been vastly updating the search functionality on my site. I’m still using MySQL‘s built-in FULLTEXT indexing to perform searches, but I’ve made the results page look a lot more (okay, almost exactly like) Google‘s. The main differences are that I’m not paginating search results (yet)—all searches limit to 10 results—and that I’m showing a relevance percentage, the first result being arbitrarily determined to be a 100% relevant.

To determine relevance, I’m relying on MySQL: a fulltext MATCH(field) AGAINST('search string') directive will return the relevance number that MySQL computes when used in the SELECT part of a query. (See MySQL Full-text Search in the online manual for detailed info on this.)

Further plans for searching that I haven’t implemented yet: utilizing MySQL’s IN BOOLEAN MODE parameter with searching to allow advanced things like phrase searches (with quotes), required word matching (using the plus sign), and subexpressions using parentheses. It’s pretty cool stuff. Oh, and I want to be smarter about presenting excerpts: Google tries to show you content excerpts with your search terms in them, I want to be able to do the same; currently I’m just showing the first 250 or so characters of the text with HTML stripped out of it.

And since I’m developing my whole Personal Publishing System in an open process, I’ll write up a detailed technical article soon on how to effectively use MySQL fulltext searching and show Google-like results. All real-world; the code will be cribbed right out of my search.php file.

Search Snafu

This article on Gadgetopia links to my content management post I made yesterday (er, today?) and brings up a drawback to my system that I forgot to include: searching.

Within the relational database world, you can do precise, structured queries against specific fields in your tables. In a properly normalized database, this is all-powerful.

However, when you bundle a bunch of content up in an XML package, and stuff that into a single field, you lose this functionality of doing atomic searches against those fields. In the example I wrote up—a geocaching XML record with latitude, longitude, etc.—there would be no way to this type of query:

SELECT * FROM content WHERE longitute BETWEEN -122.5 AND -120.5;

So, a problem. A big problem, since searching data is a pretty fundamental concept in content management—hell, in any application. I have some ideas that address this, but they’re still percolating. More to come.