PDA

View Full Version : Indexing problem: PhpDig will not spider all of the site


mih
03-23-2004, 01:18 PM
Installed PhpDig version 1.8.0 successfully

-------------------------------------
in the config.php

define('SPIDER_MAX_LIMIT',900);
define('SPIDER_DEFAULT_LIMIT',900);
define('RESPIDER_LIMIT',900);

define('LIMIT_DAYS',0);


// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','/usr/ports/textproc/catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/ports/print/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
----------------------------------------------------------
Server information as follows:

Platform: FreeBSD 4.8-RELEASE #0
Web Server version: Apache/1.3.29 (Unix)
PHP 4.3.4
MySQL 4.0.13
PERL v5.8.0 built for i386-freebsd
-----------------------------------------
only 1 tld
----------------------
I have tried to re-create the index more than one time and get very similar result everytime
----------------------------------------------------------
I have created a page that includes a link to all the pages/files that I want to index. and I can not get it to spider the whole

site.
------------------------------------------------------------
It will not spider all the site. On some directories it will only do the first 13 while others it did the first 27 files. It will only do

html files only even though it is supposed to do 'doc' and 'pdf' files.

What am I doing wrong?

The more urgent problem is that it does not spider all the site. In some directories there are more than 100 files and some

are very large (over 1 meg), some of the PDF files contain only graphics and are as big as 40 megs.

Please help and thank you in advance.
mh

Charter
03-24-2004, 06:16 PM
Hi. Is it only the DOC and PDF files that don't get indexed, or is it that the process stops after encountering a large file? Perhaps the issue is related to the one in this (http://www.phpdig.net/showthread.php?threadid=534) thread.

mih
03-24-2004, 09:43 PM
Thank you kindly for your reply.

Some stats:
I have 161 files which are over 1 meg the biggest is 36 megs.

Again these are pdf files with nothing but graphics and no text.


In one directory which has 41 htm files the largest is 697kb. In this directory it froze (locked up) after completing 16 files only.


I do not know if it gets stuck on large files or large files over time. I looked over the link you provided, do you recommend that I make those changes?

I am not sure if I have shell access to the server but I do have ftp access.

Thank you again.

Charter
03-24-2004, 10:00 PM
Hi. If you have access to your error logs, check to see if the "allowed memory size of X bytes exhausted" error is there. With these larger files, it seems that memory may get exhausted so perhaps try the code in that other thread.

mih
03-24-2004, 11:06 PM
Thank you once more

I have tried both suggestions.

if (memory_get_usage() + 2000000 > 8000000) {
return array('tempfile'=>0,'tempfilesize'=>0);
}


and


$f_handler = fopen($tempfile1,'wb');
if (is_array($file_content)) {
fwrite($f_handler,implode('',$file_content));
}
fclose($f_handler);
unset($file_content);
$tempfilesize = filesize($tempfile1);


----------

This only locks the spider faster!

I am at a loss.

Charter
03-24-2004, 11:54 PM
Hi. Perhaps try lowering the numbers in the below code or making a list of smaller files to index.

if (memory_get_usage() + 2000000 > 8000000) {
return array('tempfile'=>0,'tempfilesize'=>0);
}

Also, maybe try changing the 900's to a much lower number. There is a search depth example in this (http://www.phpdig.net/showthread.php?postid=1919#post1919) post.