October 16, 2007

Google, the Primadonna

A few days ago I tried to be a nice guy to the search engines. My web server (Tomcat) adds a parameter to URLs if a client has no cookies enabled, to still enable Session tracking. Usually it does not matter, but the robots of the search engines like Google in general don’t support cookies, and as a result, they will index a lot of superfluous pages with the jsessionid in them. So I wrote some code for my application to remove those session ids, and even added a 301 Redirect for old urls that still included the jsessionid. This should tell the search engines to remove the old URLs from their database and replace it with the new URLs. As an example for how the “ugly” URLs look for google: http://mondhandy.de/mondkalender.html;jsessionid=6F330BA76565A5100BE997DACD7D27BA?monat=9&jahr=2012&tag=27, and it should just be replaced with http://mondhandy.de/mondkalender.html?monat=9&jahr=2012&tag=27

Today I did the usual search for “Mondkalender” in Google.de. Usually Mondhandy.de is found around page four of the search results, which is bad. Today, however, it appeared nowhere. Ooops… I had changed a few other details recently, and I had started AdWords two days ago, but I think the most likely cause must have been the URL Redirection thing. Indeed, Google Webmaster tools listed a few pages (15) with “Network unreachable” errors, all with the old jsessionid style. So my guess was that Google stopped searching after getting a few errors. Naturally, I frantically started looking for a cause.

There were some confusing aspects to this: for one thing, Tomcat also doesn’t log the jsessionid in URLs, so if a request like above comes in (with the jsessionid parameter), in the log file it just looks like a request for the same URL without jsessionid. It took me a while to figure that out (and it annoys the hell out of me) – still, I have lot’s of requests that were answered with 301, that were followed with a 200 for the “same” URL (presumably without jsessionid), so that seemed to do the right thing.

Another confusing thing was that Firefox apparenly also filters the jsessionid parameter if Cookies are disabled – at least the parameter doesn’t show up in HTTPLiveHeaders (brilliant Firefox Plugin, btw.). That was really confusing for a while when trying to understand the behaviour of my redirection code: sometimes it would redirect, sometimes it wouldn’t.

At last I experimented with command line tools like curl and wget and finally arrived at the (tentative) conclusion that my code works fine. Most likely it was just an accident that lead to the errors for the Google bot. Could be that I really restarted the Server just when the Googlebot came looking. By Murphy’s law, that seems quite likely. Still, I find it rather extreme that Google punishes my whole site because of a few errors (hence the “Primadonna” title). But I guess I will just wait and see if the situation normalizes after a few days.

Meanwhile, I found a web site that let’s you check your site stats and position in the search results directly, thanks to a pointer in the Xing internet marketing group: www.ranking-check.de seem to still have access to the legendary Google API (lucky bastards). They also have a handy list of web directories with their page rank. I guess I should add my web site to more of those catalogues, but something inside of me is reluctant to do so. It just doesn’t feel right, Pagerank algorithm schmalgorithm. I have heard rumors that Pagerank is becoming less and less important for Google, too. I just resent having to twist and betray my principles just to look good for Google.

Another web site made me think today, too: What if Google had to design
their user interface for Google
? (found on news. ycombinator) – I think I will at least remove those bookmarklet icons (” Social-Bookmark-Spam Facilitators”) from my front page again. I have also checked and so far nobody seems to have used them anyway. Just as I had expected, but some people had told me that their users were clicking on them. Anyway, I liked that page, it really is crazy at what lengths people go to make their pages look good for Google.

Actually for a while now I have been wondering about a “new” search algorithm that I have termed “girl-friend search”: most dating advice goes along the lines that if you are interested in someone, you should act as if you are especially non-interested, and that will get your attention the fastest way. So if Google already has implemented that algorithm, not doing anything to get into Google’s index might be the best strategy. Is that strategy (only get interested if other is not interested) a stupid strategy? I suspect that since it has survived many human generations, there is probably some advantage to using it (some Game theory might be able to show it). So eventually Google or another search engine might pick it up.

Last Google anecdote: tonight there is an open house event for Google in Munich. they have been in Munich for a while with a division focussed on mobile services, but apparently they have something new going on that they want to tell the world about. In a typical Google way, they sent out invitations with a twist: if you got an invitation, you could register for the event at their web site and add another person you would also like to invite. Google would then send the same invitation to that person, who could invite another person and so on. Actually I suspect the “invite another person” thing was completely unnecessary, as the link in the invitation email has no personalization. So anybody could just have invited themselves. But doing it the Google way, Google now has a nice social network graph of the developer (or tech?) scene of munich for free. They are ever so clever!

By now they have also revealed some details about the event: apparently some bavarian politicians (economics department) are also going to be there, presumably holding a speech about how they proudly wasted some tax money. No offence to our politicians, but I would be a lot more interested in what Google has to say… Hopefully that will be interesting, anyway.


