PDA

View Full Version : Not indexing pdf files


jayhawk
02-12-2004, 11:23 AM
I am using pdftotext to index my pdf files. It converts the pdf to a txt file. I can do this successfully from the command prompt. However, when I try to index my site with phpdig it does not index the txt file. I have the following set in my config.php file:

define('PHPDIG_PDF_EXTENSION','.txt');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','F:/internet/search/pdftotext/pdftotext');
define('PHPDIG_OPTION_PDF','');

Any suggesions?

jayhawk
02-13-2004, 08:13 AM
Any ideas? Anyone?

:bang:

tomas
02-13-2004, 10:57 AM
hi jayhhawk,

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/path/to/your/pdftotext');
define('PHPDIG_OPTION_PDF','');

//---------EXTERNAL TOOLS EXTENSIONS
define('PHPDIG_PDF_EXTENSION','.txt');

this settings shold work - please make shure that here:
define('PHPDIG_OPTION_PDF','');
after the comma there are two single quotes!

hope this helps :-)
tomas

jayhawk
02-13-2004, 11:50 AM
They are two single quotes. I have been trying to track down the problem. One thing to note is that when I index the site it lists the url for the pdf, but it does not have a green checkmark next to it. Does that provide any clues to the problem I am having?

Charter
02-14-2004, 12:43 PM
Hi. Perhaps check that the permissions are 755 for the directories to pdftotext and also for the pdftotext file.

jayhawk
02-16-2004, 01:58 PM
Permissions are full control (just to see if I can get it to work). Still no luck.

Charter
02-16-2004, 02:48 PM
Hi. What version of PHP are you running? Perhaps you are experiencing the same problem as in this (http://www.phpdig.net/showthread.php?threadid=522) thread.

jayhawk
02-16-2004, 03:09 PM
I'm running PHP version 4.3.4.

Charter
02-16-2004, 03:18 PM
Hi. Try echoing out the statements like was done in this (http://www.phpdig.net/showthread.php?threadid=522) thread. What do you get?

jayhawk
02-16-2004, 03:43 PM
Here is what I get:


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
1:http://dhi-internet/
(time : 00:00:08)
+ + + + +
2: http://dhi-internet/ Was recently indexed
(time : 00:00:14)

3: http://dhi-internet/ Was recently indexed
(time : 00:00:19)

4: http://dhi-internet/ Was recently indexed
(time : 00:00:24)

level 1...


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
5:http://dhi-internet/index.php?=PHPB8B5F2A0-3C92-11d3-A3A9-4C7B08C10000
(time : 00:00:35)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
6:http://dhi-internet/test/acobook.pdf
(time : 00:00:40)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
7:http://dhi-internet/docs/seanresume0204.pdf
(time : 00:00:45)


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
8:http://dhi-internet/test/regs.html
(time : 00:00:53)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\Ghostgum\pstotxt\pstotxt3
Does parse pdf exist:
9:http://dhi-internet/test/dhi.html
(time : 00:01:00)

No link in temporary table

Charter
02-16-2004, 04:16 PM
Hi. It looks like the following is not returning a value:

echo "Does parse pdf exist: " . file_exists(PHPDIG_PARSE_PDF) . "<br>";

Try setting different paths in the following code, run it from the browser, and then use the path that produces "Does parse pdf exist: 1" onscreen.

<?php
$filename = "F:\\\\dhi-internet\\\\search\\\\Ghostgum\\\\pstotxt\\\\pstotxt3";
echo "Does parse pdf exist: " . file_exists($filename);
?>

jayhawk
02-17-2004, 07:23 AM
I fixed the path problem, but it is still not indexing the pdfs. Here is what it displays:

SITE : http://dhi-internet/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
1:http://dhi-internet/
(time : 00:00:08)
+ + + + + Error: Couldn't open file '.txt' Error: Couldn't open file '.txt'
level 1...


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
2:http://dhi-internet/index.php?=PHPB8B5F2A0-3C92-11d3-A3A9-4C7B08C10000
(time : 00:00:20)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
Hello PDFRValue 13:http://dhi-internet/docs/seanresume0204.pdf
(time : 00:00:25)


Is result test http an array: 1
What is result test http status: PDF



Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
Hello PDFRValue 14:http://dhi-internet/test/acobook.pdf
(time : 00:00:59)


Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
5:http://dhi-internet/test/regs.html
(time : 00:01:07)



Is result test http an array: 1
What is result test http status: HTML



Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: F:\dhi-internet\search\pdftotext\pdftotext.exe
Does parse pdf exist: 1
6:http://dhi-internet/test/dhi.html
(time : 00:01:14)

No link in temporary table

Charter
02-17-2004, 12:10 PM
Hi. Do you have the following in the config file?

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','F:\\dhi-internet\\search\\pdftotext\\pdftotext');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

jayhawk
02-18-2004, 06:13 AM
I got this to work finally. Thanks for all of your help!:D