parsing robots.txt in PHP

the web robots FAQ defines a web robot as a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. i think the part about retrieving all referenced documents is really beside the point. a web robot is any application that acts as a web browser, but has no human controlling it. if a human is controlling a web browser, a server can send a "go away" message and the human (if they are well-behaved) will go away. if there is no human, we need a standardized system for sending "go away" messages that the application can understand (and if the application is well-behaved, it will also go away).

robots.txt is this format. i've built quite a few web applications that load pages from other servers. and i've done a bit of worrying that one of these web applications would bother the owner of one of the loaded pages. the solution, of course, is to follow the robots.txt standard. by following this standard, i allow the owner of the pages my scripts are loading to say "go away" to my applications whenever they chose to do so, and i no longer need to worry about that cease-and-decist letter arriving.

i started looking for something coded in PHP to check robots.txt files, but i found nothing. so i wrote my own. robots.inc contains the function ok_for_robots which will take a URL and tell you if it's okay for a robot to load this URL. it will also take an optional name for the robot, but this part isn't as widely useful yet because there is no quick way in PHP to specify the name of the application while loading remote content. so the next step is making this easier to do in PHP. meanwhile if you - like me - have PHP scripts that are loading remote pages and you - like me - are concerned about being a good neighbor, you might consider using robots.inc.

Be number 1:

 
 
 
knows 2 + one =