PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 07-24-2005, 03:47 AM   #1
Phantom
Green Mole
 
Phantom's Avatar
 
Join Date: Jul 2005
Location: Melbourne, Australia
Posts: 2
Problem with PDF indexing

Hi, I'm using PhpDig v.1.8.7

Indexing of PDFs via document specific URLs in the Admin Command line Interface works fine.

Problem very similiar to this one

I checked your external binaries checklist

Everything is as you suggest, except that I'm running PHP 4.2.3.
For PHP 4.2.3. you link to a post on this topic but the link doesn't work.

I've added the source code debug changes you suggested to robot_functions.php and spider.php and have included a section below for a page that refers to many PDF documents. It's as if the crawler doesn't find the PDF files which are referred/linked to in each of the pages.

Quote:
phpdigTestUrl(http://www.nhs.vic.edu.au/system/style.css) Parse content-type header : text : css

phpdigTestUrl(http://www.nhs.vic.edu.au/system/printer.php?id=38) Parse content-type header : text : html
+

phpdigTestUrl(http://www.nhs.vic.edu.au/index.php?id=40) Parse content-type header : text : html


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: c:\newnhsweb\system\cms\phpdig\pdftotext\pdftotext.exe
Does parse pdf exist: 1
Is parse pdf executable: 1
42:http://www.nhs.vic.edu.au/index.php?id=40
(time : 00:04:52)

thanks.
Phantom is offline   Reply With Quote
Old 07-24-2005, 04:13 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. For broken links like http://www.phpdig.net/showthread.php?threadid=570 try adding what's in bold like http://www.phpdig.net/forum/showthread.php?threadid=570 to the link. The forum moved from the main directory to the forum subdirectory, but not all links got updated.

To try and index the PDFs at http://www.nhs.vic.edu.au/index.php?id=40 open the config file and set LIMIT_TO_DIRECTORY to false, PHPDIG_IN_DOMAIN to true, and then stick the link in the PhpDig admin panel text box, set search depth to a large number, links per to zero, and choose the no option.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-25-2005, 02:26 AM   #3
Phantom
Green Mole
 
Phantom's Avatar
 
Join Date: Jul 2005
Location: Melbourne, Australia
Posts: 2
Thumbs up Fixed via correct href path

Problem fixed...

It turns out that the embedded page hyperlinks to the PDFs that I was attempting to index were invalid. However, ever browser known to man seemed to compensate for the invalid path, so I never picked up the error (until now).

The phpDig crawler didn't compensate for the error (No surprise really).

The incorrect relative path from the root level was:
../content/docs/newsletter/newsletter501.pdf

The relative path from the root level should have been either:
./content/docs/newsletter/newsletter501.pdf
or
content/docs/newsletter/newsletter501.pdf

Thanks for your help. Now on to MS Word Documents.....

Phantom is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
problem with .pdf and .doc files mleray External Binaries 11 12-09-2004 10:26 PM
Problem PDF indexing lelectronique External Binaries 7 11-15-2004 08:02 AM
indexing pdf Hoek External Binaries 9 02-25-2004 02:42 AM
indexing pdf philippeguerind External Binaries 11 02-21-2004 10:50 AM
PDF indexing lelandv External Binaries 15 12-08-2003 04:23 PM


All times are GMT -8. The time now is 02:43 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.