PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Troubleshooting (http://www.phpdig.net/forum/forumdisplay.php?f=22)
-   -   1.8.5 won't spider (http://www.phpdig.net/forum/showthread.php?t=1624)

kevinz 12-13-2004 09:16 AM

1.8.5 won't spider
 
I just downloaded phpDig 1.8.5 and installed it, and am having some troubles getting it to spider my site.

I had installed ver 1.6.2 and had it working well before, but I decided to start over as if it were a new install, as my databases are pretty small. I downloaded 1.8.5 into a directory on my server at /var/www/coreinitiative/htdocs/search. I created a new phpDig DB in MySQL called 'phpdig-ci' and a MySQL user 'phpdig-ci-user' and gave it a password and privilidges over the DB. I chown -R the directories text_content/, include/ and admin/temp/ to the Apache user and group. I edited include/config.php to change the admin password and made ABSOLUTE_SCRIPT_PATH equal '/var/www/coreinitiative/htdocs/search'.

Running admin/install.php seems to work fine, and doesn't give me any errors. The tables are created.

I can access admin/index.php and enter the URI of my site, http://www.coreinitiative.org. It seems to correctly find and read the index page, but doesn't find any links. It only finds one page.

Running the spider from the command line gives:
Code:

www:/var/www/coreinitiative/htdocs/search# php4 -f admin/spider.php http://www.coreinitiative.org
3499: old priority 0, new priority 18
Spidering in progress...
-----------------------------
SITE : http://www.coreinitiative.org/
Exclude paths :
- @NONE@
XDuplicate of an existing document
1:http://www.coreinitiative.org/
(time : 00:00:06)
No link in temporary table
links found : 1
Optimizing tables...
Indexing complete !
www:/var/www/coreinitiative/htdocs/search#

Any suggestions on what I'm doing wrong? My problems seem similar to other posting here that talk about not spidering all of a site.

Thanks for your help and suggestions.

-Kevin

vinyl-junkie 12-13-2004 06:13 PM

Hi, Kevin. PhpDig works differently now than it did with 1.6.2. If you want it to spider more than just the one link, set a non-zero value for "links per" and "search depth." That should take care of your problem.

Xavi 12-14-2004 12:56 AM

Very similar but a little difference on exclusion list
 
I'm having nearly the same trouble as Kevinz. I was also running phpdig1.6.2 before. I tried an update (new files onto old ones, update templates, connect,php, etc.) in my localhost (wampp win2k pro, apache 1.3.26, php 4.3.1, mysql 3.23 I think....) before uploading it to the right place in the server for production, to spider my website on a linux server, as always was with phpdig 1.6.2, ... but it didn't work. It finds only one link, and I tried many combinations of depth X * Y links per page (0, 0; 3m 3; 10, 10; 0,10; 10, 0)... but no way.:bang:

Btw, It says something about excluding .*.php and *.php3. All my site is created through php3 files, but so it was before with phpdig 1.6.2...
I couldn't find where to set up the exclusion list (and delete just in case .*.php and .*.php3). I saw a table in the ddbb, but it was empty ... (???)

Just in case, I also tried a blank new installation (with new database installation, just in case there was some trouble with update process on files opr ddbb), but results where the same.

So far, I've requested the sysadmin to completely delete our phpdig 1.6.2, but I'd like to able to include a new search engine there.... And I like phpdig a lot....

The URL of my site:

http://estel.bib.ub.es/ecolo/

(search disabled :-( until I see phpdig 1.8.5 or higher to work fine again as always with this site I administer)

Hints welcome (I'm not computer scientist)

And thanks for all your hard and nice work with phpdig! ;-)

vinyl-junkie 12-14-2004 03:16 AM

Xavi, did you read my post in this thread? Did you try what I said to do?

kevinz 12-14-2004 05:42 AM

THANKS!
 
vinyl-junkie, thanks, that worked. Note that it took two tries, however. First, I just reindexed the site and increased the links and depth both to 20. This didn't work; I got the same result. I then deleted the site entirely and re-input the URL, with the depth and links set to 20. Now, it's going to town indexing the pages. It's been running 20 minutes now, and is on the 127th page.

Note that most of my content is .php files, just like Xavi. I, too, thought that this was the problem and didn't know where to set the inclusion, or turn off the exclusion. However, it doesn't seem to be necessary; I'm indexing the php files just fine, it seems.

So, '0' is no longer a code for 'unlimited' depth or links? Is there a code? How do I index all the links on pages with more than 20 links on them?

Thanks, again, so much for your help with my problem.

-Kevin

Xavi 12-14-2004 11:37 AM

Hi Vinyl-Junkie:

Yes, I had read your message, tried what you suggested, and reported previously what I got (did you read all my message?).

I've tried again. Same results.This time tired the combination of 20 "earch depth"and 20 "inks per". Strings are in Catalan, but structures of answers is the same as in English.
I tell phpdig to dig this:

http://estel.bib.ub.es/ecolo

or

http://estel.bib.ub.es/ecolo/index.php3?lg=en
(because I added the var lg in the code, to make it compatible with the language var in the whole site)

And the results:

---
SITE : http://estel.bib.ub.es/
Exclou les rutes :
- cgi-bin/
- .*.php
- .*.php3
1:http://estel.bib.ub.es/ecolo/
(temps : 00:00:11)

No existeix l'enllaƧ a la taula temporal


enllaƧos trobats : 1

http://estel.bib.ub.es/ecolo/
Optimizing tables...
Indexat complet!
[Enrere] a la pĆ*gina d'administraciĆ³.
---

Any ideas of what can be wrong?

And by the way, where can I define or reset the exclusion paths, just in case?

Thanks, Xavier

Charter 12-14-2004 11:52 AM

Look in the config and set LIMIT_TO_DIRECTORY to false. The LIMIT_TO_DIRECTORY set to true makes it so that only links in that directory get indexed.

Xavi 12-15-2004 08:30 AM

I did, but no change, so far...
 
Hi Charter. I changed LIMIT_TO_DIRECTORY to true, but same results while trying to spider http://estel.bib.ub.es/ecolo

I've rechecked pages from my site to see if there was some exclude tag, but there are not.

What does "no link in temporary table mean? Is it any clue?
Can somebody try to spider my site, to see if there is a problem with the info in my site??? (it worked fine when digged by phpdig 1.6.2...)

Thanks for your nice software, and for your support .

Xavi
---

SITE : http://estel.bib.ub.es/
Exclude paths :
- cgi-bin/
- .*.php
- .*.php3
1:http://estel.bib.ub.es/ecolo/
(time : 00:00:09)

No link in temporary table
links found : 1
http://estel.bib.ub.es/ecolo/
Optimizing tables...
Indexing complete ! [Back] to admin interface.

Charter 12-15-2004 08:42 AM

First, go to http://estel.bib.ub.es/robots.txt and edit the robots.txt file:
Code:

# remove these two lines
Disallow: *.php
Disallow: *.php3

Next, set search depth to a large number, links per to zero, and LIMIT_TO_DIRECTORY to false, and try an index.

Xavi 12-17-2004 02:43 AM

It worked finally! :-)
 
Thanks, Charter, that was it! :dance:
In short, my sysadmin will have the phpdig (1.8.6) back again as our search engine.
Cheers, thanks for the support again, and Merry Christmas :santa:
Xavier


All times are GMT -8. The time now is 04:50 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.