[IMC-docs] Further reducing crawler traffic
Alster
alster at indymedia.org
Mon Apr 18 16:31:55 PDT 2005
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
looking at the logs I realized that while Intrigeri made it to reduce
the traffic caused by crawlers a lot, the traffic regularly generated by
crawlers is still higher than it may need to be.
Googlebot, for example, which generates the most traffic (>1 GB per
visit), may be instructed not to download all the content of each page
if it has not been updated since its last visit.
The third hint in the 'Technical Guidelines' paragraph on
http://www.google.com/intl/en/webmasters/guidelines.html
says:
"Make sure your web server supports the If-Modified-Since HTTP header.
This feature allows your web server to tell Google whether your content
has changed since we last crawled your site. Supporting this feature
saves you bandwidth and overhead."
I do not know whether this feature does and can work in combination with
TWiki generated pages. It probably cannot. But as I don't know for sure,
I thought mentioning it won't hurt.
It is also interesting to observe that the amount of pages MSN indexes
is actually higher than the one indexed by Google, however Google causes
twice the traffic.
https://boum.org/stats/awstats/docs.indymedia.org/awstats.docs.indymedia.org.lastrobots.html
This might be an indicator for Google not interpreting the robots.txt
file as intended, i.e. by still indexing URLs containing parameters.
More information on how Googlebot treats the robots.txt is available at
http://www.google.com/intl/en/webmasters/faq.html#nocrawl and in the
second paragraph of
http://www.google.com/intl/en/webmasters/faq.html#robots
The special tag $ was new to me, as was the priority handling of longer
Dis-/Allow specifications vs. shorter ones.
Instead of what we use in robots.txt to exclude search engines from
hitting dynamic pages
Disallow: /*?*
Google proposes to use
Disallow: /*?
in its example on
http://www.google.com/intl/en/webmasters/faq.html#12
More on other things in a couple of hours or days when I skimmed through
all the emails...
Alster
- --
Info & GPG key at http://docs.indymedia.org/view/Main/AlsteR
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCZENrWJn/duft18sRAnusAJ9i7IpuGOkFESxm9N0DAlObL2lRqgCgo37x
veSJm0m4XCBK3UE2rUYG9Pc=
=g6y7
-----END PGP SIGNATURE-----
More information about the IMC-Docs
mailing list