• Howdy! Welcome to our community of more than 130.000 members devoted to web hosting. This is a great place to get special offers from web hosts and post your own requests or ads. To start posting sign up here. Cheers! /Peo, FreeWebSpace.net
managed wordpress hosting

blocking spiders from crawling sites

GeorgeB

Chairman/CEO TMCG
NLC
I had an issue with this on my forum site. Got that fixed, but I just thought about something. How do you stop certain spiders from crawling your site?
 
http://www.delphifaq.com/faq/web_publishing/f760.shtml said:
; block Google's image crawler completely
User-agent: Googlebot-Image
Disallow: /

; block all spiders and bots from those 2 directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /pictures/

; allow Googlebot to access everything except /cgi-bin
; and all other bots can access nothing
; finally allow ia_archive (alexa.com) to access everything!
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
User-agent: ia_archiver
Allow: /

I would think you use this in the robots.txt
User-agent: *
Disallow: /
 
Thats right, but you could also add this to your meta tags:

Code:
<meta name="googlebot" content="noarchive,noindex,nofollow,nosnippet" />
<meta name="robots" content="noarchive,noindex,nofollow" />
 
Either way, remember that the bots/spiders don't always respect your meta tags or your robots.txt. Although most legitimate ones like Google MSN and Yahoo search indexers will obey, less than honest types will often completely ignore your restrictions unless they are backed up by a hard limit such as a .htaccess or some type of browsing rate limiter.

It can be beneficial for certain kinds of content- such as protecting an image gallery or file host from having the content leeched to death by hotlinking after getting indexed. You would also want to look into such methods if you have a forum with relatively personal content on it for instance that really doesn't need to be indexed to be found quickly by other people. I actually use the method myself for preventing certain files of my site from appearing in search engines. They're special purpose scripts that hook into other APIs, so although they are public facing they need not be indexed as only their corresponding APIs should ever need to access them.

Here's a piece of one of my robots.txt files. This blocks all clients that respect robots.txt from accessing the named folders, or the named php file- which contains code I am developing to be later implemented in my billing system.
Code:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: source.php
 
Last edited:
Back
Top