View Single Post
Old 10-01-2004, 02:30 AM   #1
mleray
Orange Mole
 
Join Date: Sep 2004
Location: Nantes (44) FRANCE
Posts: 31
problem with .pdf and .doc files

Hi,

As I'm not very good in english, I'm a little losted in this Forum.
I've seen many topics speaking about issues with indexing pdf but can't find a solution. I'm sure it is on the forum...

So, my problem is that my pdf files seem to be indexed. But when I search a keyword or the filename of one of them, I can't find it.
I've searched in the database and never seen any pdf file (never .doc file..., but .xls seem to be ok)

I use PHP 4.3.3, MySQL 4.0.15 on Windows XP
The PHPDig version is 1.8.3
The site I'm trying to index is the Intranet site, so I can't make a link for you to see..

PHP Code:
//---------EXTERNAL TOOLS SETUP
// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);//*** false

define('PHPDIG_PARSE_MSWORD','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true); //*** false
define('PHPDIG_PARSE_PDF','C:/Stage_Manuella/moteur/PHPDIG_DIR/Ghostgum/pstotext');
define('PHPDIG_OPTION_PDF','-cork');

define('PHPDIG_INDEX_MSEXCEL',true);//*** false
define('PHPDIG_PARSE_MSEXCEL','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSEXCEL','');

define('PHPDIG_INDEX_MSPOWERPOINT',false);
define('PHPDIG_PARSE_MSPOWERPOINT','/usr/local/bin/ppt2text');
define('PHPDIG_OPTION_MSPOWERPOINT','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION',''); 
Examples of what I get in my browser after indexing :
niveau 2...
4:http://10.37.1.240/dossier_presse/dp_2004_a.pdf (not checked)
(temps : 00:01:22)

5:http://10.37.1.240/arrete_100903.pdf (not checked)
(temps : 00:01:30)

6:http://10.37.1.240/Ressources-Humain...lephonique.htm (checked)
(temps : 00:01:51)
+ + + + + +

And in the summary :
http://10.37.1.240/dossier_presse/dp_2004_a.pdf
mleray is offline   Reply With Quote