PDA

View Full Version : running out of memory


tomas
02-16-2004, 12:22 PM
hello list,

spidering a bunch of pdf-files (about 250) in one directory -
spider.php runs out of mem (8k in php.ini) after file 50 -
setting php.ini to 32k error after file 110 -
setting to 128k error after file 220 -

i think there is a bug in spider.php with freeing mem ???

any ideas???

tomas

Charter
02-16-2004, 01:42 PM
Hi. Is it kb or mb? Maybe try breaking the list into smaller lists and/or index from shell.

tomas
02-16-2004, 02:47 PM
sorry charter,

of course 8mb -> 32mb ->128mb

why a smaller list - is spider.php eating my memory :-)
when is called from browser ???

tomas

Charter
02-16-2004, 03:35 PM
Hi. Using shell bypasses the web server. What version of PHP are you using and what's your OS? Maybe this is a timeout issue? What are the actual errors that you are receiving?

tomas
02-16-2004, 03:47 PM
hi charter,

php-4.3.3
fedora core_1
apache 2

by the way - maybe tricky helpful for other who open pdf-files via javascript which is not recognized by spider.php:

1) on one of the websites make a dummy-link eg. <a href="pdf.php"></a>
2) setup pdf.php in website-root:

<?php
$files = explode("\n",`find .|sort`);
for ($i = 0; $i < count($files); $i++) {
$file=$files[$i];
if (!is_dir($file) and strpos($file, ".pdf", "0")!="") {
printf("<a href=\"%s\"></a><br>\n", $file);
}
}
?>

regards
tomas

Charter
02-16-2004, 08:40 PM
Hi. I'm not sure if the issue is related to pdftotext and/or PhpDig. Maybe try memory_get_usage (http://www.php.net/manual/en/function.memory-get-usage.php) and get_defined_vars (http://www.php.net/manual/en/function.get-defined-vars.php) within the spider.php file to see if anything unusual shows.

tomas
02-18-2004, 02:22 PM
hello charter,

setting php.ini back to 8mb and running spider.php with bash/cron:

spider dies - and his last words were :-)

<b>Fatal error</b>: Allowed memory size of 8388608 bytes exhausted (tried to allocate 653 bytes) in <b>/var/www/html/search/admin/robot_functions.php</b> on line <b>707</b><br />


?

Charter
02-19-2004, 06:04 AM
Hi. In the phpdigTempFile function of robot_functions.php, perhaps replace the following:

$f_handler = fopen($tempfile1,'wb');
if (is_array($file_content)) {
fwrite($f_handler,implode('',$file_content));
}
fclose($f_handler);
$tempfilesize = filesize($tempfile1);

with the following:

$f_handler = fopen($tempfile1,'wb');
if (is_array($file_content)) {
fwrite($f_handler,implode('',$file_content));
}
fclose($f_handler);
unset($file_content);
$tempfilesize = filesize($tempfile1);

tomas
02-19-2004, 07:52 AM
hi charter,

i tried and tested a bit -
now i'm sure the reason are pdf's larger than 2or3 mb
with lots of vector-graphics inside.

so - how could we setup spider.php - to go on
spidering the next files even if one or more files
are too big for allowed memory setting in php.ini.

thanks
tomas

Charter
02-19-2004, 09:10 AM
Hi. If you are asking to do something like "if fatal error, no more memory, so skip this file and go to next file" I doubt this can be done because, by the time PHP encounters the fatal error, no more memory, there isn't room to do anything else.

Untested but what you might try though is the following. In the phpdigTempFile function, add the following:

if (memory_get_usage() + 2000000 > 8000000) {
return array('tempfile'=>0,'tempfilesize'=>0);
}

right before the following line:

$f_handler = fopen($tempfile1,'wb');

That way at least if the current memory being used (in bytes) plus 2MB is greater than 8MB then the function will end, the file shouldn't be indexed, and the index process should continue.

tomas
02-19-2004, 10:52 AM
sorry charter - to bother you again and again,
but nothing works.

tomas

Charter
02-19-2004, 11:12 AM
Hi. Try changing the numbers like in the below code or just make a list of PDFs that are less than the 2 or 3 MB ones that are using so much memory.

if (memory_get_usage() + 1000000 > 3000000) {
return array('tempfile'=>0,'tempfilesize'=>0);
}