Data Mining the Web

An interesting article today on MSNBC titled “Online search engines lift cover of privacy“, and the “InfoPorn” section of February’s Wired (can’t find a link, sorry) highlighting identity theft motivated me to write about a topic I’ve been thinking about for a while now: data mining the Web.

The article talks about the absurd amount of information that is freely available on the Web, and how much of it is accessible through Google—and then calls using Google to find this data “Google hacking.” I think a more accurate term would be Google mining—there’s really no mad hacker v00d00 ski11z involved, and let’s face it, being able to run a realtime query against a massive database containing billions of pieces of information is really the essence of data mining.

What got me thinking about mining the Web? Most recently, social networking software, and the data such software collects from its users. As I’ve written before, what a useful social networking system will do (among other things) is allow you to crawl the relationships among people and be able to drill-down by varying degrees into their data/life/online platform. But you know, you can already essentially do this with nothing more than a Web browser; it all goes back to the fact that there is an absurd amount of information freely and publicly available on the Web—much of it cheerfully self-published by people who should know better.

Example? Resumes. You’ve all seen them; half the personal sites out there have an online resume page, and you can find at least 45,300 more by searching Google for “resume.doc”. On average, they contain a shocking amount of personal information: what schools you went to, and when; who employed you, and when; your address and phone number; your skills; sometimes your Social Security number. Tip of the iceberg.

You can find out a lot about someone simply by reading their blog. My own is no exception, I’m sure, but sometimes even I’m amazed about how much personal detail people will reveal online.

And did you know you can search for wishlists at and often a user’s wishlist will also contain their birthday and the city and state in which they live? If that doesn’t work, try finding someone’s birthday on—they boast having over 130 million entries gleaned from public records.

Here’s where it gets tricky. The MSNBC article takes an alarmist tone, and in part it’s right to do so: companies and people that leave sensitive documents published on a crawler-accessible Web page are in danger of having their privacy violated. However, a lot of the information that’s out there is already public information, or information that’s freely volunteered by people and becomes public. Google is merely a tool that aggregates this information into one source. And me? Hell, I love Google, I frankly think it’s amazing. And I’m an information junkie, I salivate over the data mining possibilities—and I’ve got ideas rolling around my head on what could be done with this data, ways it can be manipulated, and linked, and so on.

We’ve barely scratched the surface when it comes to mining the Web—I think the untapped possibilities we’re sitting on are enormous, potentially dwarfing anything we’ve previously encountered. Google is a first step.

What’s next?