not indexing with pdftotext [Archive]

davideyre

03-30-2004, 10:29 AM

i am having problems getting phpdig to index pdf files. pdftotext is installed and works fine from the command line.

i have read several of the other posts and have tried the error reporting code suggested. it seems what is happening is that my pdf file does not get recognised as such, instead gets recognised and indexed as html. so if i look in the mysql spider table i can see the begining of the raw pdf file just stripped of a tag that appears in <>, e.g.

%PDF-1.2
%Çì¢
7 0 obj
<</Length 8 0 R/Filter /FlateDecode>>
stream
xœ3Ð3T0

becomes....

%PDF-1.2 %Çì¢ 7 0 obj > stream xœ3Ð3T0

this is for the simple hello world test file that comes with pdftotext.

i have included a sample output of the spider below:
HTTP/1.1 200 OK
Date: Tue, 30 Mar 2004 20:29:36 GMT
Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) mod_ssl/2.8.12 OpenSSL/0.9.6 PHP/4.3.4 mod_perl/1.27
Last-Modified: Tue, 30 Mar 2004 19:10:13 GMT
ETag: "34ac212-395-4069c615"
Accept-Ranges: bytes
Content-Length: 917
Connection: close
Content-Type: application/pdf

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

13:http://www.tist.org/tist/docs/welcomepack/test/hello1.pdf

Can you please suggest what i need to do to get the spider to recognise pdf files as pdf files rather than html. i am using phpdig 1.8, xpdf 3.00, and php 4.3.4.

thanks for your help, david

Charter

03-30-2004, 12:19 PM

Hi. It looks like you stuck the following code in the robot_functions.php file. This code was meant only when a content type was not returned, which is generally not the case, so just take the code out of the robot_functions.php file.

elseif (!eregi("Content-Type: *([a-z]+)/([a-z.-]+)",$answer,$regs)) {
$status = 'HTML'; // no content-type so set to html
}