PDA

View Full Version : pstotext issue


killer27
04-28-2004, 08:00 AM
For me it only index the titlte of pdf file and the hour of the indexation and also the weight of the pdf file in the database in table keywords but there is no content of the pdf in the database.
It is strange because when I index a site with pdf files it seems to index see below :


Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 1
Index the pdf is set to: 1
Parse the pdf is set to: /usr/bin/pstotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/bin/pstotext -cork ../admin/temp/13874292.tmp
Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => )
Return value is: 0

5:http://monsiteweb.fr/pdf/01123SOC2004013.PDF
(temps : 00:01:49)

Pas de liens dans la table temporaire

Charter
04-28-2004, 08:59 AM
Hi. That all looks like it's working. What do you get when you run the following query?

select first_words from spider where file like '%.pdf%';

Also, look in the keywords table for words from the PDF file:

select keyword from keywords where keyword like '%word%';

If you have both define('CONTENT_TEXT',1); and define('DISPLAY_SNIPPETS',true); set in the config file, then there should be a text file in the text_content directory with the PDF content.

If you have define('CONTENT_TEXT',0); set in the config file, then when searching on a keyword just $text from list($title,$text) = explode("\n",$first_words); will be shown regardless of keyword.

killer27
04-28-2004, 09:29 AM
Hi,

When I run the following query :

select first_words from spider where file like '%.pdf%';

I got nothing

When I run the following query :
select keyword from keywords where keyword like '%word%';

I got :
key_id=61365
twoletters=en
keyword=entrymainbodyfirstwords

But i think this is because in my index.html I have a word called{{entrymainbodyfirstwords 25}} that has been indexed as keyword=entrymainbodyfirstwords
and there is no link with pdf (i think)


Here is the configuration in config.php :


define('SNIPPET_DISPLAY_LENGTH',150);
define('DISPLAY_SNIPPETS',true);
define('DISPLAY_SNIPPETS_NUM',4);
define('DISPLAY_SUMMARY',true);


define('TEXT_CONTENT_PATH','text_content/');
define('CONTENT_TEXT',1);



then there should be a text file in the text_content directory with the PDF content.
Yes I have text files but the pdf text file only show :

Index of /pdf Name Last modified Size Description Parent Directory
28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html
28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004
17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at
monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf


then when searching on a keyword just $text from list($title,$text) = explode("\n",$first_words); will be shown regardless of keyword.
I don't understand the last part of your message ???

Thanks for your great job and your quick answer.

Paul

Charter
04-28-2004, 09:56 AM
Hi. When you try the following query, change word to some word that could only be in the PDF file:

select keyword from keywords where keyword like '%word%';

The file in the text_content directory that contains the following:

Index of /pdf Name Last modified Size Description Parent Directory
28-Apr-2004 16:35 - 01123SOC2004013.PDF 28-Apr-2004 18:30 69k pdf.html
28-Apr-2004 18:18 1k test.doc 28-Apr-2004 17:24 19k zyz.xls 28-Apr-2004
17:24 14k Apache1.3.29 - ProXad [Apr 1 2004 16:04:22] Server at
monsiteweb.fr Port 80 Index of /pdf Index of /pdf Index of /pdf

That seems like a directory listing rather than for the actual PDF file. The $result array contains the following:

Result contains: Array ( [0] => Hébergement [1] => Facture [2] => partners -- 5 Sq de tuile_ 78000 Versailles -- Tél. / Fax : 0666666666 -- Email : contact@partners.com [3] => SARL au capital de 3000# -- Siret545454445RCS Versailles -- APE 222Z -- Web : www.partners.com [4] => [5] => FACTURE [6] => partners CLIENT [7] => 5 Sq de tuile Adzd MAdzNdzAS [8] => 78000 Versailles [9] => Tél./fax. : 01 3226222626 [10] => Prestation : Hébergement [11] => Facture du: 01/04/2004 au 31/06/2004 [12] => N° de Facture: 12122/66 [13] => Article Objet Quantité [14] => / [15] => Slots [16] => Prix [17] => unitaire / [18] => Trimestre [19] => Montant TVA [20] => Hébergement Serveur [21] => Total HT 122.36 [22] => Total TVA 23.61 [23] => Total TTC 122.00 [24] => A payer 122.00 EUROS [25] => Mode de paiement : A réception de facture [26] => )

And with $retval being zero, the following code should make a temp file containing the stuff from the $result array:

if (!$retval) {
// the replacement if Å¡ is for unbreaking spaces
// returned by catdoc parsing msword files
// and '0xAD' "tiret quadratin" returned by pstotext
// in iso-8859-1
// Adjust with your encoding and/or your tools
if ((is_array($result)) && (count($result) > 0)) {
$f_handler = fopen($tempfile1,'wb');
fwrite($f_handler,str_replace('Å¡',' ',str_replace(chr(0xad),'-',implode(' ',$result))));
fclose($f_handler);
}
}
else {
return array('tempfile'=>0,'tempfilesize'=>0);
}

Also, what do you get with the following query:

select file,first_words from spider where file like '%01123SOC2004013%';

And are the admin/temp and text_content directories set to 777 permissions?

killer27
04-29-2004, 02:08 PM
hi,

when I execute the two queries :
select keyword from keywords where keyword like '%Facture%';
and
select file,first_words from spider where file like '%01123SOC2004013%';

I got no results from mysql, so I am sure it is not indexing the pdf.

I also open all the txt files in the admin/temp directory and I saw the content of the pdf file in 95593951.tmp :
Hébergement Facture partners -- 9 Sq de Bgdgg - 79699 paris -- Tél. / Fax : 565995465559 -- Email : contact@-partners.com SARL au capital de 3000# -- Siret +6++++RCS Versailles -- APE 698Z -- Web : www.partners.com FACTURE partners CLIENT 9 Sq ghffg Antoine gdgd 75995 paris Tél./fax. : 065965659959 Prestation : Hébergement Facture du: 01/04/2004 au 31/06/2004 N° de Facture: 0899999 Article Objet Quantité / Slots Prix unitaire / Trimestre Montant TVA Hébergement Serveur Vietcong 6+6+5488484 Total HT 120.39 Total TVA 23.61 Total TTC 144.00 A payer 144.00 EUROS Mode de paiement : A réception de facture
When I open all the files in text_content directory there is no file with pdf content.

All my permissions are good, I am able to index doc and xls files.

I have php 4.2.2 but I have installed this patch :

http://www.phpdig.net/showthread.php?threadid=570

and check everything describe in this thread :
http://www.phpdig.net/showthread.php?s=&threadid=799
(I also add the code include in this thread)


I attach here my config.php file, spider.php and robot_functions.php in a zip file maybe it can help you to help me.

Thanks a lot.
Paul

Charter
05-01-2004, 06:13 AM
Hi. Change define('PHPDIG_PDF_EXTENSION','.txt'); to define('PHPDIG_PDF_EXTENSION',''); in the config file (two single quotes, no space between).

The '.txt' is for when an external PDF binary outputs to a TXT file as with pdftotext, however catdoc goes to STDOUT so no '.txt' is needed.

killer27
05-03-2004, 07:31 AM
Thanks a lot, now it works fine.

If someone has the same issue and is using php4.2.2 on linux redhat 7.3 I can share my files.

Only two more issues :

First one : when trying to index large pdf files like 5 Mo, indexation is impossible, with small pdf files it works (like 200 ko or 500 ko).

Second: when I index doc files, the spider transform é, Ã*, è, into special characters like é=é or être=ètre, may you have some explanations about this ?

Thanks again and again...

Paul

Pulsar-san
05-12-2004, 01:28 PM
Hi !

For the é=é
it looks like the é is translated into UTF-8 (like in Google).