Bots and JavaScript

Here’s something to think about: do any search engine bots and crawlers recognize and parse JavaScript? I haven’t heard of any (and I’m really too lazy right now to do any real research :) ), but I got to thinking about this today, and there’s really no reason that they shouldn’t be able to handle it.

Sure, there’s a lot of cruft and dross in JavaScript code that isn’t relevant in a searchable context, but what about something like I’ve been working on recently: dynamic menus? Each menu item points to a valid page with some contextual link text, but since the menus are generated in JavaScript, the search engine process parsing the content out of the code might easily pass it up and miss the links. Those same links are ultimately being repeated in the actual content of the page, so they’ll be picked up for sure, but what about next time?

Of course, then it would be easy to abuse search engine rankings, by stuffing JavaScript full of hidden and obfuscated content. Perfect for the snake oil of Search Engine Optimization. Even so, though, there might be a lot of content or linkage going unnoticed…

10 comments

Dennis Pallett says:

April 13, 2004 at 3:22 am

It’s rumoured that Google is beginning to parse JavaScript. Have a look at http://www.markcarey.com/googleguy-says/archives/google-superbots-coming.html and http://www.markcarey.com/googleguy-says/archives/google-crawls-javascript.html
Jon says:

April 13, 2004 at 11:41 pm

Very interesting. I wonder, as an experiment, if someone were to put content ONLY somewhere inside JavaScript code, if Google would pick it up. Or would they consider that SEO spam? Hmmm…
Jesse Thompson says:

April 14, 2004 at 12:25 am

Well I haven’t checked out Dennis’ links yet, but I’d have to imagine that Bots won’t be able to parse out url’s in javascript code with any real certitude. For instance:

var url="http://www.abc.com/";
document.write("");

maybe, but then things like..

function link(domain, uri)
{
document.write("");
}

get a little more complex. Of course you could just run a page through a JVM and obtain links that get document.written or dom.created by farming the resulting HTML, but a lot of JavaScript is based on interaction with the user and without interaction important links may never get created. And then finally they still wouldn’t be able to properly spider HomestarRunner.com until they begin delving into Flash, and it’s embeded ActionScript goodness 😉

All of these reverse-engineering based approaches are mighty inefficient to boot. The ideal would be a nifty way to discourage Search Engine Spamming coupled with simply trusting website authors to provide an auxillery list of links in the metadata somewhere (in meta headers, in RSS feeds, there are lots of possibilities) and then couple that with the normal kind of spidering just to nab some links the author forgets to explicitly publish. Sure discouraging Search Engine Spam would be difficult, but it’s already difficult. Anyone can put a link in a div and set the style of that div to "none" in JS just after the DIV displays.. and google really doesn’t have a way to know it happened. So if you solve that problem you can trust the author, and if you can trust the author you don’t have to reverse engineer his site. 🙂

– – Jesse
Jesse Thompson says:

April 14, 2004 at 12:27 am

😛 In the above code examples I was writing an anchor tag which used the mentioned variables. It looked like an HTML tag and got baleeted. <grin>
Jake Ortman says:

April 14, 2004 at 11:35 am

I would actually prefer that they don’t index javascript stuff for a couple reasons:

1) I’ll sometimes use a javascript based linked for added protection for keeping a bot out of a particular location (on top of my usual robots.txt file). I want users to go there, but I don’t want the location cached by bots.

2) More times than not, there’s really no need to force-feed a javascript link, and designers/programmers need to know how to do things right (on the server end) with a better language like PHP. I’ve seen so many folks who have written up horrendously complicated javascript that has to be processed in the browser for stuff that is stupidly simple in PHP. Don’t pawn the processing off on the browser because you don’t know how to code, folks 🙂
Jon says:

April 14, 2004 at 2:05 pm

Jesse-

Yeah, sorry about that. The code is stripping tags, so sometimes things look wonky as a result.

By and large, I’m not worried about interactive JavaScript, or JavaScript that needs to be executed, but let’s call it "passive": variables, hard-coded URLs, comments, things like that. And even in the dynamic examples you give, there’s still an instance of the URL coming from *somewhere* (unless it’s the user) in the page, so a bot could find and extract those. (Mostly what I had in mind was navigational JavaScript.)

As far as trusting the author of a site, all well and good, but what about when the author is under design constraints? The design calls for JavaScript navigation/dynamic menus, and we all know the customer is always right 😉 So the design requires JavaScript, and there’s no "contingency" links accounted for in the design… it gets messy.

And what about this? In some (many) available JavaScript libraries, the authors embed their info and documentation in the comments of the code. Imagine how *that* could wreak havoc with site indexing…
Jon says:

April 14, 2004 at 2:12 pm

Jake-

1) What guarantee is there that the bot doesn’t already know JavaScript and isn’t caching the "hidden" location already? I know you can check the logfiles and see whether or not a bot actually hit that location, but my point is that there are no guarantees 🙂

The only way to ensure a bot doesn’t get there is to require authentication of some sort.

2) Agreed. Simply write 2 or 3 lines of PHP code that looks at HTTP_USER_AGENT, and if it’s not a known bot, display the link.
Jake Ortman says:

April 14, 2004 at 2:26 pm

I’m not super concerned if the bot does follow the link (as I know there are no guarantees — I’m sure there are bots out there that scan looking for robots.txt file and indexing those pages), I would just prefer them not to. I also use the doc.write when I need the person to have Javascript enabled on the next page. Like, for example, I have a Javascript docwrite that writes the button on the "Book This Home" form on this page (as an example):

http://www.sunrayinc.com/propview2.php?view=168

Pages beyond that form reside on a server that relies heavily on Javascript (not my doing, outside provider wrote the code, and if I f**k with it too much, I void our support agreement — damn asp code).
Jon says:

April 14, 2004 at 2:40 pm

Yeah, I use the document.write trick to check for JavaScript, too. Ususally I have it write a hidden element into a form named "has_js" or something.

Fortunately, most people leave JavaScript enabled in their browsers these days.
Jon says:

May 4, 2005 at 1:27 pm

It is a good thing that Google ignores javascript. I don’t think it is such a spectacular thing to use. If there is some barrier that you can not do it in _any_ other way it is probably ok, but it is not ok to juice up your page with all sorts of stuff just for the sake of making it look cooler. That is what CSS, graphics and HTML is for.

Comments are closed.

Related

10 comments