Problem with PDF indexing [Archive]

View Full Version : Problem with PDF indexing

Phantom

07-24-2005, 03:47 AM

Hi, I'm using PhpDig v.1.8.7

Indexing of PDFs via document specific URLs in the Admin Command line Interface works fine.

Problem very similiar to this one (http://www.phpdig.net/forum/showthread.php?t=1860)

I checked your external binaries checklist (http://www.phpdig.net/forum/showthread.php?t=799)

Everything is as you suggest, except that I'm running PHP 4.2.3.
For PHP 4.2.3. you link to a post on this topic (http://www.phpdig.net/showthread.php?threadid=570) but the link doesn't work.

I've added the source code debug changes you suggested to robot_functions.php and spider.php and have included a section below for a page that refers to many PDF documents. It's as if the crawler doesn't find the PDF files which are referred/linked to in each of the pages.

phpdigTestUrl(http://www.nhs.vic.edu.au/system/style.css) Parse content-type header : text : css

phpdigTestUrl(http://www.nhs.vic.edu.au/system/printer.php?id=38) Parse content-type header : text : html
+

phpdigTestUrl(http://www.nhs.vic.edu.au/index.php?id=40) Parse content-type header : text : html

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: c:\newnhsweb\system\cms\phpdig\pdftotext\pdftotext.exe
Does parse pdf exist: 1
Is parse pdf executable: 1
42:http://www.nhs.vic.edu.au/index.php?id=40
(time : 00:04:52)

thanks.

Charter

07-24-2005, 04:13 AM

Hi. For broken links like http://www.phpdig.net/showthread.php?threadid=570 try adding what's in bold like http://www.phpdig.net/forum/showthread.php?threadid=570 to the link. The forum moved from the main directory to the forum subdirectory, but not all links got updated.

To try and index the PDFs at http://www.nhs.vic.edu.au/index.php?id=40 open the config file and set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and then stick the link in the PhpDig admin panel text box, set search depth to a large number, links per to zero, and choose the no option.

Phantom

07-25-2005, 02:26 AM

Problem fixed...

It turns out that the embedded page hyperlinks to the PDFs that I was attempting to index were invalid. However, ever browser known to man seemed to compensate for the invalid path, so I never picked up the error (until now).

The phpDig crawler didn't compensate for the error (No surprise really).

The incorrect relative path from the root level was:
../content/docs/newsletter/newsletter501.pdf

The relative path from the root level should have been either:
./content/docs/newsletter/newsletter501.pdf
or
content/docs/newsletter/newsletter501.pdf

Thanks for your help. Now on to MS Word Documents.....

:)