Search Patch

While waiting to find out if my hosting provider will change the minimum fulltext word length for MySQL, here’s what I’ve done in the meantime to deal with viable three-character search terms.

First, I split the search string into the component words (an array). I subtract any stopwords (I’ve got a big list) and for any remaining words that are under four characters long, I add to the SQL query I’m running.

Here’s the basic form of the query that I’m running, say searching for “porter”:

SELECT *,
MATCH(body) AGAINST('porter') AS relevance
FROM content
WHERE MATCH(body) AGAINST('porter')
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

This uses fulltext indexing to search for “porter” with weighted relevance, and returns the appropriate content and its relevance score. Pretty straightforward, and it works really well.

Here’s what the modified query looks like, if there’s short words present, for the search “porter php”:

SELECT *,
MATCH(body) AGAINST('porter') +
  (1 / INSTR(body, 'php') + 1 / 2[position of word in string])
AS relevance
FROM content
WHERE ( MATCH(body) AGAINST('porter')
  OR body REGEXP '[^a-zA-Z]php[^a-zA-Z]'
  )
AND [additional conditions]
ORDER BY relevance DESC
LIMIT 10

Two new things are happening. First, in the WHERE clause, I’m using both the fulltext system to find “porter” and using a regular expression search for “php.” Why REGEXP and not LIKE? Because if I write LIKE '%cow%' for instance, I’ll not only get “cow” but also “coworker” and other wrong matches. A regular expression lets me filter those scenarios out.

That takes care of finding the words, but I also wanted to tie them into relevance, somehow. The solution I hit upon in the above SQL is relatively simple, and does the trick well enough for my tastes. Basically, the sooner the word appears in the content, the higher its relevance, which is reflected in the inverse of the number of characters “deep” in the content it appears. And I wanted to fudge the number a bit more by weighting the position of the keyword in the search string; the sooner the keyword appears, the higher the relative score it gets.

It’s not perfect, and I definitely wouldn’t recommend using this method on a sufficiently large dataset, but for my short-term needs it works just fine. The only thing really missing in the relevance factoring is how many times the keyword appeared in the content, but I can live without that for now.

Searching and Minimum Word Length

Mike Boone, in the comments section of yesterday’s entry on searching (“Updated Search“), correctly points out that searching my site for a word that is less than four characters in length (like “php” or “cow”) does not work—no results are returned. Obviously, since I write about PHP on occasion, this is untenable.

The problem is that MySQL‘s fulltext indexing, by default, only indexes words greater than three characters long, and I don’t think I have any way to change this, despite my initial reply to Mike’s comment. This site is running on a shared server setup on pair.com, and I have absolutely zero control over the MySQL server configuration. I might post a question to their tech support, but I’m not overly optimistic about the response. So, what to do?

Short term, here’s my solution (though it’s not implemented yet): examine each word in the search string, throwing out stopwords (like “the,” “and,” “so,” etc.), and for any word shorter than four characters long, do a LIKE search against the content for them. No, it’s not ideal, but it’s a patch. Comments?

Updated Search

I’ve been vastly updating the search functionality on my site. I’m still using MySQL‘s built-in FULLTEXT indexing to perform searches, but I’ve made the results page look a lot more (okay, almost exactly like) Google‘s. The main differences are that I’m not paginating search results (yet)—all searches limit to 10 results—and that I’m showing a relevance percentage, the first result being arbitrarily determined to be a 100% relevant.

To determine relevance, I’m relying on MySQL: a fulltext MATCH(field) AGAINST('search string') directive will return the relevance number that MySQL computes when used in the SELECT part of a query. (See MySQL Full-text Search in the online manual for detailed info on this.)

Further plans for searching that I haven’t implemented yet: utilizing MySQL’s IN BOOLEAN MODE parameter with searching to allow advanced things like phrase searches (with quotes), required word matching (using the plus sign), and subexpressions using parentheses. It’s pretty cool stuff. Oh, and I want to be smarter about presenting excerpts: Google tries to show you content excerpts with your search terms in them, I want to be able to do the same; currently I’m just showing the first 250 or so characters of the text with HTML stripped out of it.

And since I’m developing my whole Personal Publishing System in an open process, I’ll write up a detailed technical article soon on how to effectively use MySQL fulltext searching and show Google-like results. All real-world; the code will be cribbed right out of my search.php file.

Thoughts on Content Management

I’ve been thinking a long time about content management systems (which isn’t surprising considering developing various types of website CMSes is what I do for a living), how they pertain to weblogs and similar types of content, how to implement them in PHP and MySQL, and what type of system I would really like to have. Now, content management is a big topic, so let me clarify and narrow down what I’m talking about before I go on.

Some definitions
A piece of content can be anything—a blog entry, a fragment of text, a photo, an MP3 file, a recipe for carrot cake, a Palm Reader ebook, a scrap of a note written on a yellow sticky pad. A lot of what defines and contextualizes the content is the metadata that goes along with it—the date it was created, the size of the file, the author, the image format, where it was created, etc. Now, granted, different types of content can have vastly different types of metadata; for instance, a JPEG image taken with a digital camera will have attributes attached to it describing its resolution, compression quality, file size, camera specs, and date and time it was taken, while a piece of GIS data will have, say, latitude and longitude attributes, elevation, and place name information (which could be any or all of street name, city name, county name, etc.).

Some requirements
After using and extending my own homebrewed blog software for over a year and half, examining other systems like Movable Type, and getting lots of ideas from other blogs and smart folks online, I’ve decided that what I’m thinking about is what I call a Personal Publishing System (PPS?), which could be considered a subset of a CMS. The PPS should have some features of a CMS, but certainly doesn’t need all of them; allowing multiple users to manage content is okay, for instance, but a comprehensive workflow system is unnecessary—just being able to flag a content item as a draft or final version, and perhaps an approval tag, is all that’s needed. Here’s a list of some requirements I’d like to see in my PPS:

  • Web based.
  • Any type of content and its metadata can be handled.
  • Each piece of content has a globally unique identifier (“guid”) of some kind.
  • Each piece of content can be access/retrieved via a URL (probably incorporating the guid).
  • Content can be published in any format: HTML (browsers), RSS (syndication/aggregators), PDF, etc. etc.
  • Content can be categorized based on a hierarchical tree of categories. In fact, content can be assigned to multiple categories.

My general philosophy here is that I want to challenge my own notions about what constitutes a blog and see how far I can take it. Hubris, probably.

Database theory
A well-formed and normalized database would rightly split different types of content into their own properly modeled tables, which is the sane, efficient and right thing to do. I love data normalization, and I take a particular joy in modeling a data structure to a relational database and normalizing the hell out of its elements.

In fact, as any Web application developer using a relational database will tell you, this is critical; the database is one of the biggest bottlenecks in the entire system, and it can be Web suicide for even a moderately-loaded site to have unoptimized tables behind your code.

On the other hand, there is a drawback in trying to run a content management system this way: for every new type of content you want the system to handle, you have to create a new table (or several, depending on how normalized you want to get) and then add code into your system for handling the new table(s). (Okay, astute PHP programmers will realize you could create a master table that contains information and metadata about the new tables, and have PHP code that automagically handles the new tables based on this master table info—so you would only have to create the new tables and the system auto-populates the master table info and knows how to deal with that content in a general way. You wouldn’t have to recode for new additions. I’ve done this. It works reasonably well, considering.) Pretty soon, you’ve got so many tables handling every different case you can think of, that database performance degrades regardless of how optimized each table is. And managing potentially hundreds of tables becomes a nightmare in logistics.

Left field
So of course, in imagining a theoretical structure for my PPS, I went slightly insane and threw this stuff out the window. Here’s the gist of it:

Treat every piece of content as the same as every other, and store it all in a single table. Preposterous? Probably. But bear in mind that there will be a common set of metadata attributes that every piece of content will have (at least in this context): a unique name or identifier (the guid), a date it was created, a title, a description. And of course, there would have to be a “body” field for the content itself. Roll those into the table structure.

What about different types of content—text versus images? Easy—include a MIME type field in the table, that defines the content type—”text/html” or “image/jpeg,” for instance. (You could store the actual binary data of an image in a file somewhere, linked to by the guid stored in the name field.)

Let’s look at this real quick in the context of a MySQL table:

   content_id -> Primary key
   name -> varchar (unique key)
   title -> varchar
   description -> text (probably will be >255 characters)
   date_created -> datetime
   mime_type -> varchar (possibly enum?)
   body -> mediumtext (large data sets, up to 16MB)

That handles the basic metadata, and could be sufficient for something like a weblog. But what if I want to add some content that has additional metadata that the table doesn’t account for—like a geocaching record, and I want to track latitude and longitude coordinates somewhere? I can’t add more fields to the table—that’s a loser’s game for (I hope) obvious reasons. Once I had settled on the idea of a MIME type field, the answer seemed clear: XML. Bake XML into the database structure as content.

To be clearer: set the MIME type of that piece of content to “text/xml” and the populate the body field with XML data of the content in question, with the extra metadata fields rolled into it as part of its XML definition. So, you might populate the body field with something like:

   <content type="geocache">
      <latitude>45.6684776</latitude>
      <longitude>-121.3394771</longitude>
      <dateHidden>2003-12-05</dateHidden>
      <cache type="traditional" name="coffee can">
         <item>Spiral-bound logbook</item>
         <item>Yo-yo</item>
         <item>Deck of cards</item>
      </cache>
   </content>

What I like about this idea is its object-oriented analogy: start with a basic definition for content—a “class”—and each instance of content inherits from the base class and, via XML, can extend the base class for itself.

There’s limitations to account for, as well. Not all types of data can be easily shoehorned into this model, so it shouldn’t be attempted. For instance, a voting system: you need a table to store the poll topics, one to store each option/answer, and at least one more for storing user votes. There would be no sense in trying to hack this into the content table, and the system would suffer if it was. So there’s always room for specialized functionality.

And, I’ve modeled some compromises. Rather than trying to manage the category system as just another type of content (so that you’d end up with parent-child content relationships), I pulled the categories out into another table. It’s cleaner and there’s more benefit to the system this way—I can add a many-to-many lookup table to allow for multiple categorization. (Incidentally, in my PPS, I call these channels, because they might fulfill a purpose beyond that of a traditional category system.)

Another compromise is the concept of content nodes. A content node is basically a grouping that content can be classified into—another lookup table. All the content I write for my blog would be assigned to the “chuggnutt.com blog” node, for instance.

Oops, and don’t forget about a commenting system—user comments (and perhaps ratings?) are a valuable source of metadata for any given piece of content. So I’ve allowed for another table to store comments, rather than making them another type of content, because I want to stay away from the parent-child relationship situation I alluded to above.

Will it all work? I don’t know. The proof is in the pudding, though—I’m working to convert my own blog to this system, so I’ll find out firsthand just how good (or bad) my ideas are. I really don’t think this system is viable to run as a large-scale, enterprise-style content management solution—hence the reason I’m calling this a Personal Publishing System. Incidentally, the working name (or code name, if you will) in the back of my mind for this system is “Spokane.”

I’m making this an open process, too, to solicit comments on my ideas, and hopefully to give ideas to any other people out there looking to write their own systems in PHP. To that end, the next article I’ll post on this topic will move from theory to practice, and I’ll publish the MySQL database schema I’ve been developing (with comments). Exciting stuff!

On Blogs

This is a bit about the blog software I wrote for this site. If you’re into the technical aspects of blogs, or PHP and MySQL, you’ll be interested in this. If not, you can safely skip it and not really miss out on anything.

I’ve taken to calling my home-grown blog software blognutt (“blog + chuggnutt,” very clever, ha-ha), and I’ve noticed recently that at least one person found this out by viewing the HTML source of the site and searching for “blognutt” to see what they could find out. They didn’t find out much. Not for any reason of secrecy or anything like that; I just haven’t talked about it, no real mystery. It’s written in PHP 4 with MySQL on the backend, and that’s about it.

I use the Template and DB classes from the immensely helpful PHP Base Library, though I’ve modified them extensively for my own purposes. The reasons I use phplib instead of another package like PEAR, for instance, are simple: I’ve been using phplib forever so I’m very quick and comfortable with it, it’s easier to use with a much lower-overhead code base, and I’ve already hacked the code to do things I want to do. PEAR is a fine project and I hope to contribute some classes (like my Stemmer class) there someday, but for coding purposes I haven’t used it much. Yet.

Why did I write my own blog software, rather that using one of the many available blogging tools already available? I looked at several PHP blog packages, and looked at what other systems like Movable Type offer, but it boils down to the same approach I take to a lot of programming projects: I wanted to hack it out myself, because that’s the best way to learn. So I did. (Plus, I wanted to have absolute control of the software. I’m anal that way.) I started with what I determined were the core elements of a weblog, and made it work.

When you get down to it, a weblog is misleadingly simple: it’s a data retrieval and presentation system. Retrieve the top X most recent items and display them; offer the ability to browse and search past entries, and there you are. The trick is in the execution.

MySQL was the logical choice for the data store. I wanted to be able to sort and group by date, search entries, and make changes to the data structure on the fly. For me, the additional overhead introduced by adding a relational database like MySQL is worth it for the benefits I get, and since I’m doing all the programming I can make it do things that might not be available to users using other packages that they can’t control.

I can do anything any of the other blog software packages can do. (I think. I may be missing something somewhere.) Here’s a list of some common weblog features, and some commentary:

  • Entries: Full HTML, since I control the format and storage. Each entry is tied back to the user (just me so far), date- and time-stamped, and can be flagged as a draft (and therefore not displayed to the public).
  • names: I just re-implemented these in a bot-friendly and more human-intuition-friendly manner. Now they look like /2003/07/27/name.html rather than the (less friendly but just as workable/legal) /blog_entry.php?content_id=27 style of links.
  • Comments: I’ve got the code in place to handle comments, but I haven’t turned it on yet.
  • Archive: I’ve got archive links, sorted by year and month. I can control the sort and display of archive links by changing a single line of code. You can view the archive by year, month, day, or entry.
  • Search: Another advantage of using MySQL: its fulltext indexing capability, which allows you to do natural language queries against text and returns results by relevance.
  • Categories: Easy. I’ve been thinking about categorizing my entries, but it’ll be a pain in the ass to go back through 90+ entries.
  • Calendar: Just recently implemented the calendar, showing which days have entries.
  • Last X entries: I haven’t implemented this because it seems redundant as I keep the last 10 entries on the front page anyway. It’d be easy to do, though.
  • Blogroll: Fancy name for a list of links to other blog sites. I just put some up last night.
  • Syndication: Using RSS for aggregators. I’ve written the code to produce the XML files for this, which turned out to be extraordinarily easy, but I haven’t turned it on yet largely because I’m nervous about bandwidth issues.
  • Trackback: A way for bloggers to link to other bloggers’ entries such that the blog they’re linking to knows they’re being linked to. Clever. I don’t know if I’ll support it or not, since I’m the only one running blognutt software :-). Plus, I can already find out who’s linking to me from the server log files.

There’s more issues. One of the selling points of the bigger blogging systems is that you can update your blog from anywhere on the web, using XML-RPC. Well, the admin interface I wrote for my blog let’s me update from “anywhere on the web” too—from any computer connected to the web, no special software required, just a browser. Nothing fancy. Seems to me that XML-RPC will require some tool or client utility to use, or some interface somewhere, and I guess I don’t see the appeal in this, except possibly to save you time from opening your admin in another browser window. It’s entirely possible I’m missing something here. Having an XML-RPC API interface to a system is cool, I admit, but is it necessary? Maybe someone could enlighten me, here.

(Of course, I’m not developing software for use by a general audience, so my way of doing things may not be appropriate for a large user base, and XML-RPC might make perfect sense in that situation.)

One thing I am interested in using XML-RPC for is pinging sites like Weblogs.com to notify them when this site is updated. It would be easy to do; just include a checkbox on the entry add/edit screen that lets me decide when any changes should be pinged, and have PHP send the XML-RPC packet when the form is submitted. In fact, that will probably be the next change I make to the system.

I have done something cool that I haven’t seen elsewhere (on blog/personal sites, at least): if a user comes to a “top-level” page from a search engine, searching for something specific, I helpfully list up to 3 entries that might be related to what they were searching for. For example, a user searches on Google for “dealing with a strong willed child” (this was an actual search on my site) and follows the results to my site. If they don’t get to a specific entry, and instead come to the home page, or to all the listings for 2002 for instance, then that’s too vague— so I search the database and show the top 3 results for what they might be looking for. Hopefully, this leads to the user exploring the site a little more than just hitting the home page, not finding what they searched for, and leaving.

I’ve considered releasing my blognutt software as open source, but that raises an issue I’m not sure I want to tackle yet: support. The other issue is competition; there’s already a lot of weblogging software out there, some of it very good. Do I really want to play keeping up with the Joneses with everyone else, or should I just keep myself happy tinkering with my own system?

All the same, I’ve been going through and cleaning up a lot of the code and modularizing it better in anticipation of a possible release, and there’s still more to finish. If you’re interesting in chatting about blogs, or seeing my code, drop me a line and let’s chat.