PDA

View Full Version : problem with .pdf and .doc files


mleray
10-01-2004, 02:30 AM
Hi,

As I'm not very good in english, I'm a little losted in this Forum.
I've seen many topics speaking about issues with indexing pdf but can't find a solution. I'm sure it is on the forum...

So, my problem is that my pdf files seem to be indexed. But when I search a keyword or the filename of one of them, I can't find it.
I've searched in the database and never seen any pdf file (never .doc file..., but .xls seem to be ok)

I use PHP 4.3.3, MySQL 4.0.15 on Windows XP
The PHPDig version is 1.8.3
The site I'm trying to index is the Intranet site, so I can't make a link for you to see..

//---------EXTERNAL TOOLS SETUP
// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);//*** false

define('PHPDIG_PARSE_MSWORD','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true); //*** false
define('PHPDIG_PARSE_PDF','C:/Stage_Manuella/moteur/PHPDIG_DIR/Ghostgum/pstotext');
define('PHPDIG_OPTION_PDF','-cork');

define('PHPDIG_INDEX_MSEXCEL',true);//*** false
define('PHPDIG_PARSE_MSEXCEL','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSEXCEL','');

define('PHPDIG_INDEX_MSPOWERPOINT',false);
define('PHPDIG_PARSE_MSPOWERPOINT','/usr/local/bin/ppt2text');
define('PHPDIG_OPTION_MSPOWERPOINT','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');

Examples of what I get in my browser after indexing :
niveau 2...
4:http://10.37.1.240/dossier_presse/dp_2004_a.pdf (not checked)
(temps : 00:01:22)

5:http://10.37.1.240/arrete_100903.pdf (not checked)
(temps : 00:01:30)

6:http://10.37.1.240/Ressources-Humaines/annuaire_telephonique.htm (checked)
(temps : 00:01:51)
+ + + + + +

And in the summary :
http://10.37.1.240/dossier_presse/dp_2004_a.pdf

mleray
10-01-2004, 05:05 AM
I try what is writing in the readme topic and this is what I obtain :

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:/Stage_Manuella/moteur/PHPDIG_DIR/Ghostgum/pstotext
Does parse pdf exist: 1

Fatal error: Call to undefined function: is_executable() in c:\stage_manuella\moteur\phpdig_dir\phpdig-1.8.3\admin\robot_functions.php on line 963

Charter
10-01-2004, 05:55 AM
Set USE_IS_EXECUTABLE_COMMAND to zero in the config file.

// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

mleray
10-01-2004, 06:23 AM
I've done it.
Use is executable is set to: 0

But nothing changes. I always have the error message.

Should I put the path to the executable with the name of the file (pstotxt3.exe) or not ?

like this :
define('PHPDIG_PARSE_PDF','C:\Stage_Manuella\moteur\PHPDIG_DIR\Ghostgum\pst otext');
or like this :
define('PHPDIG_PARSE_PDF','C:\Stage_Manuella\moteur\PHPDIG_DIR\Ghostgum\pst otext\pstotxt3');
(there are no spaces in my code : psto text = pstotext)
or something else ? should I put relative path or absolute ?

mleray
10-01-2004, 07:19 AM
I try with pdftotext, seems to be better but not perfect ...

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\Stage_Manuella\moteur\PHPDIG_DIR\xpdf-3.00-win32\pdftotext.exe
Does parse pdf exist: 1

Command is: C:\Stage_Manuella\moteur\PHPDIG_DIR\xpdf-3.00-win32\pdftotext.exe ../admin/temp/95662532.tmp 2>&1
Result contains: Array ( [0] => Error: Copying of text from this document is not allowed. )
Return value is: 3

What does this error mean ?
:what:

mleray
10-04-2004, 11:22 PM
No more help ?
Is there any frenchies here ?

Charter
10-06-2004, 03:29 AM
>> Result contains: Array ( [0] => Error: Copying of text from this document is not allowed. )

The issue is with the PDF, not PhpDig. The PDF permissions are set such that "copying of text from this document is not allowed."

mleray
10-07-2004, 11:59 PM
Seems to be ok now. Thanks.

But now I've got new problem with catdoc and xls2csv :(

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
******************************************************
Does parse xls exist: 1
Index the xls is set to: 1
Parse the xls is set to: C:\Stage_Manuella\moteur\PHPDIG_DIR\catdoc-0.93.4\xls2csv.exe
******************************************************
Command is: C:\Stage_Manuella\moteur\PHPDIG_DIR\catdoc-0.93.4\xls2csv.exe -s 8859-1 ../admin/temp/64971482.tmp 2>&1
Result contains: Array ( [0] => Le systÅ*me ne peut ex‚cuter le programme sp‚cifi‚. )
Return value is: 1
In english : The system cannot carry out the specified program...

It's the same with catdoc.exe

If I try to launch the program in MS-DOS like this :

Microsoft Windows XP [version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Administrateur.EFSTSE>cd../..

C:\>cd Stage_Manuella\moteur\PHPDIG_DIR\catdoc-0.93.4

C:\Stage_Manuella\moteur\PHPDIG_DIR\catdoc-0.93.4>xls2csv test.xls
"NOM","PRENOM","AGE"
"Leray","Manuella","27"
"Leray","Sylvain","24"
"Rauturier","Myriam","52"


You can see that it works... :confused:

mleray
10-12-2004, 06:57 AM
I've found a solution to my problem with these external binaries. :banana:
I'd got PHP install with EasyPHP but it should be instal in CGI mode !
So now I've change robot_function.php to robot_function.cgi and spider.php to spider.cgi and the links to these files should be change as you had guess...
And it works ! No I just have problem with accent as I'm french but that's all.

Hope that will help.

-----------------------------------------------------------------
Traduction française...

J'ai trouvé la solution Ã* mon problème avec les external binaries. :banana:
J'avais installé PHP en module avec EasyPHP mais il fallait l'installer en CGI parce que sinon la fonction exec() ne marchait pas (erreur : Le système ne peut exécuter le programme demandé).
J'ai donc ensuite renommé mais fichier robot_functions.php et spider.php en .cgi et modifié les liens correspondants dans les fichiers où c'était nécessaire.
Et ça marche ! Il me reste juste un petit souci de conversion des accents mais c'est un moindre mal.

En espérant que cela puisse vous aider. (vous pouvez laisser un post sur developpez.com au cas z'où, j'y suis souvent)

Manuella

mleray
10-13-2004, 01:14 AM
Precision :
I use
PHP 4.3.3
MySQL 4.0.15
Apache 1.3.27
on Windows XP installed with EasyPHP 1.7
My PHPDig version is 1.8.3

Charter
12-09-2004, 02:27 AM
bump for xperienss...

xperienss
12-09-2004, 10:26 PM
ohhhhhhhhhh thanx a lot @ Charter for bumping this post.

----

Ce message va Ã* Mleray
Apparement nous avons les mêmes configuration (WinXP, easyPHP 1.7,...)
Pour le moment j'ai réussi a faire marcher l'indexation de pdf avec Xpdf/pdftotext.exe v3.
Mais pour ce qui est de catdoc et xls2csv, je n'arrive toujours pas Ã* indexer les fichiers.

Tu disais que tu avais trouvé la solution... alors si tu peux m'aider car cela fait 1 semaine qur je galère en essayant toutes les configs possibles.
Merci d'avance (si tu reçois ce message)

----

Well, as soon as i ll got everything working, i ll post a topic with all explanations to install phpdig/catdoc/xpdf-pdftotext on WinXP/EasyPHP 1.7...

I am sure this would help lots of people.

Xperienss