07.07.07
Just basic things: sitemap.xml and robots.txt
It’s a small world: yesterday I went to a party, and a friend of mine introduced me to one of his friends who has also programmed a moon calendar. I didn’t think it was that common
Apparently I understood the url incorrectly, but I hope he’ll mail it to me.
–
Another thing I learned: it is probably a good idea to link to other pages you like, because people tend to look at who has linked to them, and check back on your site.
–
When I began working on eliminating the session id’s that my web server encodes in the URLs on my page, I realized that Sage, my news reader of choice, also leads to marking my browser with some cookies upon each startup. I have configured Firefox to only allows session cookies, which means all cookies are being deleted every time I close Firefox (basic internet hygiene). However, Sage checks the feeds of the blogs I have signed up for on every startup. And apparently the feeds can set cookies, too.
Cookies are a way for web pages to track you. They also have “legitimate” uses, but tracking is a common application. While it does no visible harm for the time being, I simply don’t like it. Nobody knows for sure what kind of interferences can be made from my movements on the web. Just from Sage, they probably don’t see my movements, only the times when I start Firefox. Maybe they can diagnose me with ADS syndrome just from that, or detect that I have irregular sleeping patterns. They know that I am probably an office worker, because if I was, say, a lumberjack, I wouldn’t be able to surf the internet all the time. I don’t want strangers to infer anything about me, if possible…
I have now configured Sage to only check the feeds when I tell it to do so manually. It’s probably all futile, anyway, as there other ways web pages might be able to track you, but still.
I didn’t manage to eliminate the session id’s from my URLs yet (it’s just a small configuration, but it’s hidden somewhere in the Servlet specification). The Google Bot seems to have had some problems with them, but I hope it will improve now that I have submitted a sitemap. There seems to be a school of thought that recommends against the use of sitemaps, but I am not convinced. At least I am not convinced by the arguments from that article: what the search engines consider to be “good link juice” might be different from what is good structure for human visitors. Google was visiting my pages in a weird way, and if the sitemap can fix that, why not give it a try. I am pleased that I don’t even need a link from my web page, the sitemap is a separate file just for the search engines.
Also, there are pages I would like to exclude from crawling: since I have a calendar on my page that currently reaches ten years into the future, in theory a bot could be tempted to crawl over all days, which makes thousands of pages with little differences. While most crawlers are smart enough not to do that, I’d rather have them visit the right pages. If the sitemap helps with that, great.
The robots.txt exclusion standard is not powerful enough to prevent the crawling of the calendar, as it can only filter based on directories, not based on request parameters. Still, since the robots.txt can be used to direct the bots to the sitemap (see bottom of the sitemaps page for instructions), I now have a robots.txt, too. Besides, I think it is good to do all these standard things. Who knows, it might give some bonus points with the search engines, a basic mark of quality.
I experimented with wget to somehow extract the links for the sitemap for me, but unseccessfully. The tool Google provides is weird, too, some python script you are supposed to run on your web server. In the end I just wrote it from hand – I guess it is not important enough to somehow automate it, as they don’t change that often.
I also tried the sitemap submission method described on sitemaps.org: by specification it should be possible to submit a sitemap to the search engines by simply calling the url
<searchengine_URL>/ping?sitemap=sitemap_url
Too bad it didn’t work – it would have made it so easy to (re-)submit a web page to search engines. Can anybody confirm that it doesn’t work, or did I make a mistake? Google doesn’t even mention the mechanism, though, so perhaps the search engines really abandoned it.
–
Found on the web: Interesting post about how somebody managed to test his business idea and have 4000 users waiting for his launch. Usually I would also be too scared to publicly announce my idea before I have made enough progress to be sure that I have enough headstart to any competitor who might pick it up. But many people say that it doesn’t matter and it’s unlikely that anybody but you yourself would really have the drive to pursue the idea anyway. I think they are probably right.
Andreas said,
July 8, 2007 at 12:35 pm
About sitemaps also see C’t 15/07, p.196
mondhandy said,
July 8, 2007 at 7:19 pm
Damn, that is the current issue – I should go back to the habit of reading c’t immediately, might have saved some time. Although the c’t article doesn’t really give a lot of additional information. The only new thing I learned is this URL: http://code.google.com/sm_thirdparty.html which lists a lot of available sitemap generators (online and executables).
My impression so far is that the sitemaps did what I wanted, and Google now has a cleaner view of my web site. At least all the major pages were in the index when I looked today.