View Full Version : phpDig ignores robots.txt

09-09-2003, 06:53 AM
Hi, everyone,
searching for a suitable alternative to the postnuke search engine (which can't be used for a multisite setup) I've stumbled over yours.
So far it works nicely, just some things I can't resolve:

I've told the machine to index http://www.subdomain.domain.com/html/ and put a robots.txt in the html-directory. But phpDIG keeps ignoring it...even when stating

User-agent: PhpDig
Disallow: /

it continues to spider into the subdirectories...

Is there any other way to exclude single directories ?It is said "Warning ! Erase is permanent" on the update form site but it isn't. This would be neat if I could just erase here all not-wanted pages but if I start reindexing the rest, again it starts to spider the just-erased pages. Adding the exclude-Tag to a single file didn't work either...again this page is indexed.

Maybe this is due to the postnukeCMS, no idea...it's a modular system and I wanted to limit access to some of the modules otherwise it would start to index without limits...so I need to restrict access to the dics.

Another problem is, that each spidering action causes damages to the postnuke-mySQL-files...I need to reinstall all tables of the site. This is weird; maybe due to the server configs (Apache 2.0) and not to phpdig.

Any ideas how to control this tool ?

Thanks for your input !


09-12-2003, 06:54 AM
Hi. The "Warning ! Erase is permanent" error is being produced because there is not a lock, i.e., $locked = 0. If you have access to raw log files, is the URL to robots.txt correct? Otherwise, in robot_functions.php there is a function called phpdigReadRobotsTxt. In that function, you might try echoing $site.'robots.txt' to see if it is correct. Not sure why the PostNuke tables are damaged. Didn't see any conflicting tables, even when the PhpDig prefix is set to nuke. What kind of damage is done to the PostNuke tables?