PDA

View Full Version : PDF indexing blocked


pascalp
07-22-2005, 09:10 AM
Hi,
I'm trying to index a pdf file which I know for sure it exists like http://..../foo.pdf

The console prints this :
Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: c:/bin/pdftotext.exe
Does parse pdf exist:

And stay blocked.

What happens ?
Thanx in advance.

pascalp
07-22-2005, 09:19 AM
sorry it's this :

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: c:/bin/pdftotext.exe
Does parse pdf exist: 1

Charter
07-23-2005, 11:57 AM
In robot_functions.php, find the appropriate $command variable:

// it can have _PDF or _MSWORD or _MSEXCEL depending on binary
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2;

Change to the following:

// it can have _PDF or _MSWORD or _MSEXCEL depending on binary
$command = PHPDIG_PARSE_PDF.' '.PHPDIG_OPTION_PDF.' '.$tempfile2.' 2>&1';

And copy-paste what, if anything, is displayed upon reindex.

pascalp
07-24-2005, 03:31 AM
I had already changed that line but nothing more is displayed...

Charter
07-24-2005, 04:00 AM
Is it just one PDF that won't index, or is it all? If just one, how large is the file? Maybe you are running out of memory? Try uncommenting error_reporting(E_ALL); in the config file, and see if a memory error occurs on reindex of the PDF file.

pascalp
07-24-2005, 05:18 AM
Any PDF file won't index.
the pdf file I tried is 796 kbytes... but I tried another which is 350 kb, it won't index either.
I try the error_reporting...

pascalp
07-24-2005, 05:23 AM
I tried the "error_reporting"... nothing more.
Besides, why doesn't the line "Is parse pdf executable:" display ?

Charter
07-25-2005, 08:09 AM
If you are only getting the following to print out, then check the code edits again to see if echo "Is parse pdf executable: " . is_executable(PHPDIG_PARSE_PDF) . "<br>"; is in there:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: c:/bin/pdftotext.exe
Does parse pdf exist: 1

pascalp
07-25-2005, 11:47 AM
I finally got the pdf indexing working...

BUT it only indexes a pdf when I indicate the full URL to the pdf.
I have a page :
http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm
it contains 2 simple pdf links. The spider doesn't find any pdf link into it whereas there's at least two obvious ones.
I use "no" and 20 depth and 20 links parameters.

Any idea ?

Charter
07-25-2005, 12:48 PM
When you use the code in this (http://www.phpdig.net/forum/showpost.php?p=8538&postcount=3) post, does any error message print onscreen?

pascalp
07-25-2005, 10:43 PM
No error message indeed...

Charter
07-29-2005, 08:51 AM
Tried a test on your site with search depth one and links per four, and got the below output. Try using...

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','c:/bin/pdftotext.exe');
define('PHPDIG_OPTION_PDF','');
define('PHPDIG_PDF_EXTENSION','.txt');

And see if this gets it to index. Also, if you use pdftotext from command line on a PDF file, does it create a TXT file?


Spidering in progress... [Stop spider]
SITE : http://www.ville-magny-les-hameaux.fr/
Exclude paths :
- library
- moteur
- Pics
- plan_site
- x_element_base
- a_mieux_connaitre/jpg
- a_mieux_connaitre/geo/jpg
- a_mieux_connaitre/histoire/jpg
- a_mieux_connaitre/magny_chiffres/jpg
- a_mieux_connaitre/patrimoine/jpg
- a_mieux_connaitre/vie_municipale/jpg
- actualite/jpg
- b_vie_pratique/jpg
- b_vie_pratique/se_deplacer/jpg
- b_vie_pratique/serv_public/jpg
- c_vie_eco/jpg
- d_vie_cult_sport/jpg
- e_vie_associative/jpg

Wait...
1:http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm
(time : 00:00:10)
+ + + + + + +
level 1...

Wait...
2:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf
(time : 00:00:36)
+ + +

Wait...
3:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc
(time : 00:00:52)


Wait...
4:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc
(time : 00:01:12)


Wait...
5:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf
(time : 00:01:28)

level 2...
links found : 5
http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm
http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf
http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf
Optimizing tables...
Indexing complete ! [Back] to admin interface.

pascalp
07-31-2005, 02:13 PM
I already use the code you give here.
The result is that "the page has been recently indexed" so it doesn't index anymore.
As you saw, there are 2 pdfs in it. My spider found no pdf.
He just found the dc5.pdf because I gave him explicitly...

Any idea ?

Charter
07-31-2005, 02:26 PM
Do you have shell or command line access? If so, then "touch" the PDFs to give them a new file date. Otherwise, if you are making your PDFs, resave and FTP a new version over, so the file appears updated. See if this will let you reindex the PDFs. Of course, if the PDFs haven't changed content, no reindex is necessary, and PhpDig does look for a "Last-Modified" date. One other thing is that you should be able to delete a page/document from the PhpDig admin panel, so if you want to reindex without touching the file, try a delete and then a reindex, both from the admin panel.

pascalp
08-11-2005, 01:36 AM
I deleted the page from phpdig admin panel and tried to reindex... it indexes the html file itself but doesn't index the pdf links into it. As I said earlier, when I index the pdf url directly, it works...
For information, no problem of timeout.

pascalp
08-11-2005, 01:41 AM
Finally, I made it work with following values :
- no
- depth : 1
- links : 20
How comes this fact, I mean it doesn't work with depth 20 but works with depth 1 ???

Charter
08-11-2005, 04:20 AM
Here's another test using search depth 20, links per 20, and the no option:

Spidering in progress... [Stop spider]
SITE : http://www.ville-magny-les-hameaux.fr/
Exclude paths :
- library
- moteur
- Pics
- plan_site
- x_element_base
- a_mieux_connaitre/jpg
- a_mieux_connaitre/geo/jpg
- a_mieux_connaitre/histoire/jpg
- a_mieux_connaitre/magny_chiffres/jpg
- a_mieux_connaitre/patrimoine/jpg
- a_mieux_connaitre/vie_municipale/jpg
- actualite/jpg
- b_vie_pratique/jpg
- b_vie_pratique/se_deplacer/jpg
- b_vie_pratique/serv_public/jpg
- c_vie_eco/jpg
- d_vie_cult_sport/jpg
- e_vie_associative/jpg

Wait...
1:http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm
(time : 00:00:10)
+ + + + + + +
level 1...

Wait...
2:http://www.ville-magny-les-hameaux.fr/actualite/com_public/boucher.htm
(time : 00:00:31)
+ +

Wait...
3:http://www.ville-magny-les-hameaux.fr/actualite/com_public/assurance.htm
(time : 00:00:45)

Wait...
4:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc
(time : 00:01:04)

Wait...
5:http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf
(time : 00:01:20)

Wait...
6:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf
(time : 00:01:37)

Wait...
7:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc
(time : 00:01:50)

Wait...
8:http://www.ville-magny-les-hameaux.fr/actualite/com_public/fleur2005.htm
(time : 00:02:00)
+ +
level 2...

Wait...
9:http://www.ville-magny-les-hameaux.fr/actualite/com_public/ccboucher.DOC
(time : 00:02:29)

Wait...
10:http://www.ville-magny-les-hameaux.fr/actualite/com_public/AEboucher.doc
(time : 00:02:45)

Wait...
11:http://www.ville-magny-les-hameaux.fr/actualite/com_public/fleur2005.doc
(time : 00:02:56)

Wait...
12:http://www.ville-magny-les-hameaux.fr/actualite/com_public/AEfleur.doc
(time : 00:03:10)
No link in temporary table
links found : 12
http://www.ville-magny-les-hameaux.fr/actualite/com_public/main_public.htm
http://www.ville-magny-les-hameaux.fr/actualite/com_public/boucher.htm
http://www.ville-magny-les-hameaux.fr/actualite/com_public/assurance.htm
http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/dc5.pdf
http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.pdf
http://www.ville-magny-les-hameaux.fr/actualite/com_public/ae.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/fleur2005.htm
http://www.ville-magny-les-hameaux.fr/actualite/com_public/ccboucher.DOC
http://www.ville-magny-les-hameaux.fr/actualite/com_public/AEboucher.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/fleur2005.doc
http://www.ville-magny-les-hameaux.fr/actualite/com_public/AEfleur.doc
Optimizing tables...
Indexing complete ! [Back] to admin interface.