The Google Platform

I’ve already seen several links to this today (the first from UtterlyBoring), and it’s too interesting not to point to.

The post in question posits this: Google is a platform. Not a “platform,” used in the same sense that Amazon and eBay are platforms (custom Web applications that allow some programmatic user interfaces), but an actual computer/operating system/development platform—something I had suspected for some time, but I’ve never managed to coalesce my thoughts this succintly.

What is this platform that Google is building? It’s a distributed computing platform that can manage web-scale datasets on 100,000 node server clusters. It includes a petabyte, distributed, fault tolerant filesystem, distributed RPC code, probably network shared memory and process migration. And a datacenter management system which lets a handful of ops engineers effectively run 100,000 servers….

Google is a company that has built a single very large, custom computer. It’s running their own cluster operating system. They make their big computer even bigger and faster each month, while lowering the cost of CPU cycles. It’s looking more like a general purpose platform than a cluster optimized for a single application.

While competitors are targeting the individual applications Google has deployed, Google is building a massive, general purpose computing platform for web-scale programming.

It’s one of the better tech reads I’ve seen in awhile. Very eye-opening.

Now, of course, my curiosity is taking hold, and I’d love to take a crack at developing for that platform!

Conspiracies in Web Tracking

Despite my headline, I’m not really going to go all Mulder on you and start ranting about Big Brother and privacy issues and all that. Instead it’s just some thoughts I’ve been entertaining lately on technology and tracking people and habits on the Web. Some people may choose to see the things I’m writing about as conspiratorial, and that’s fine for them; they may not want to read on, though :) . Continue reading

Google Image Search

Playing around with Google‘s image search, I’ve thought of some advanced search features they need to implement. Hopefully someone at Google is reading this and will get right on it ;)

You need to be able to search by specific image dimensions (in pixels); for example, I’d like to be able to type “width:80 height:15” or maybe “dimensions:80x15” and have Google return all the images that are 80 by 15 pixels (yes, this idea is directly related to my last post on the 80×15 images). This can’t be hard; Google’s already caching the size of the image and displaying that on the search results pages, so why not be able to search them?

Is Google Broken?

Elsewhere on this site I’ve stated that I love Google. That still mostly holds true, but there’s been some things about Google lately that are making me pause a bit.

The first concerns Google’s apparent abandonment of RSS for (exclusively) the still-incubating Atom syndication format/API. I won’t bother rehashing the situation here; if you want more details, check out this wonderfully recursive-ironic Google search for “google atom” to get all the gory details. To me this seems like a highly questionable/irresponsible move for Google to make, frankly rather surprising. Hopefully they’ll come to their senses over there.

The other thing deals with their AdWords program. I think it’s broken. Here’s the deal: We’ve been toying with AdWords to run ads on a new project we’re working on, to see how the system worked and if it would be worth it to ramp it up. (Side note: very cool. You can get a nice in-depth look at Google’s internal keyword rankings without ever putting any money down.) Well, it worked for a while, we were very impressed, but then suddenly, over the weekend sometime (I think), it stopped working.

Completely. Our ad never shows up on the exact same searches that it was previously showing up under before. In fact—and here’s the biggest clue that something is seriously broken—as you page through the results, the exact same ads that appeared on the first page of results appears on every subsequent page of results.

WTF?

This did not happen before and should not be happening now. Something is broken. Period. For at least a week. Could it have something to do with Google doubling their index to over 6 billion items (4 billion web pages)? Maybe.

Ideas?

Data Mining the Web

An interesting article today on MSNBC titled “Online search engines lift cover of privacy“, and the “InfoPorn” section of February’s Wired (can’t find a link, sorry) highlighting identity theft motivated me to write about a topic I’ve been thinking about for a while now: data mining the Web.

The article talks about the absurd amount of information that is freely available on the Web, and how much of it is accessible through Google—and then calls using Google to find this data “Google hacking.” I think a more accurate term would be Google mining—there’s really no mad hacker v00d00 ski11z involved, and let’s face it, being able to run a realtime query against a massive database containing billions of pieces of information is really the essence of data mining.

What got me thinking about mining the Web? Most recently, social networking software, and the data such software collects from its users. As I’ve written before, what a useful social networking system will do (among other things) is allow you to crawl the relationships among people and be able to drill-down by varying degrees into their data/life/online platform. But you know, you can already essentially do this with nothing more than a Web browser; it all goes back to the fact that there is an absurd amount of information freely and publicly available on the Web—much of it cheerfully self-published by people who should know better.

Example? Resumes. You’ve all seen them; half the personal sites out there have an online resume page, and you can find at least 45,300 more by searching Google for “resume.doc”. On average, they contain a shocking amount of personal information: what schools you went to, and when; who employed you, and when; your address and phone number; your skills; sometimes your Social Security number. Tip of the iceberg.

You can find out a lot about someone simply by reading their blog. My own is no exception, I’m sure, but sometimes even I’m amazed about how much personal detail people will reveal online.

And did you know you can search for wishlists at Amazon.com and often a user’s wishlist will also contain their birthday and the city and state in which they live? If that doesn’t work, try finding someone’s birthday on Anybirthday.com—they boast having over 130 million entries gleaned from public records.

Here’s where it gets tricky. The MSNBC article takes an alarmist tone, and in part it’s right to do so: companies and people that leave sensitive documents published on a crawler-accessible Web page are in danger of having their privacy violated. However, a lot of the information that’s out there is already public information, or information that’s freely volunteered by people and becomes public. Google is merely a tool that aggregates this information into one source. And me? Hell, I love Google, I frankly think it’s amazing. And I’m an information junkie, I salivate over the data mining possibilities—and I’ve got ideas rolling around my head on what could be done with this data, ways it can be manipulated, and linked, and so on.

We’ve barely scratched the surface when it comes to mining the Web—I think the untapped possibilities we’re sitting on are enormous, potentially dwarfing anything we’ve previously encountered. Google is a first step.

What’s next?