mleray
10-01-2004, 02:30 AM
Hi,
As I'm not very good in english, I'm a little losted in this Forum.
I've seen many topics speaking about issues with indexing pdf but can't find a solution. I'm sure it is on the forum...
So, my problem is that my pdf files seem to be indexed. But when I search a keyword or the filename of one of them, I can't find it.
I've searched in the database and never seen any pdf file (never .doc file..., but .xls seem to be ok)
I use PHP 4.3.3, MySQL 4.0.15 on Windows XP
The PHPDig version is 1.8.3
The site I'm trying to index is the Intranet site, so I can't make a link for you to see..
//---------EXTERNAL TOOLS SETUP
// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries
// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);//*** false
define('PHPDIG_PARSE_MSWORD','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true); //*** false
define('PHPDIG_PARSE_PDF','C:/Stage_Manuella/moteur/PHPDIG_DIR/Ghostgum/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);//*** false
define('PHPDIG_PARSE_MSEXCEL','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSEXCEL','');
define('PHPDIG_INDEX_MSPOWERPOINT',false);
define('PHPDIG_PARSE_MSPOWERPOINT','/usr/local/bin/ppt2text');
define('PHPDIG_OPTION_MSPOWERPOINT','');
//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');
Examples of what I get in my browser after indexing :
niveau 2...
4:http://10.37.1.240/dossier_presse/dp_2004_a.pdf (not checked)
(temps : 00:01:22)
5:http://10.37.1.240/arrete_100903.pdf (not checked)
(temps : 00:01:30)
6:http://10.37.1.240/Ressources-Humaines/annuaire_telephonique.htm (checked)
(temps : 00:01:51)
+ + + + + +
And in the summary :
http://10.37.1.240/dossier_presse/dp_2004_a.pdf
As I'm not very good in english, I'm a little losted in this Forum.
I've seen many topics speaking about issues with indexing pdf but can't find a solution. I'm sure it is on the forum...
So, my problem is that my pdf files seem to be indexed. But when I search a keyword or the filename of one of them, I can't find it.
I've searched in the database and never seen any pdf file (never .doc file..., but .xls seem to be ok)
I use PHP 4.3.3, MySQL 4.0.15 on Windows XP
The PHPDig version is 1.8.3
The site I'm trying to index is the Intranet site, so I can't make a link for you to see..
//---------EXTERNAL TOOLS SETUP
// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries
// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);//*** false
define('PHPDIG_PARSE_MSWORD','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true); //*** false
define('PHPDIG_PARSE_PDF','C:/Stage_Manuella/moteur/PHPDIG_DIR/Ghostgum/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);//*** false
define('PHPDIG_PARSE_MSEXCEL','C:/Stage_Manuella/moteur/PHPDIG_DIR/catdoc-0.93.3');
define('PHPDIG_OPTION_MSEXCEL','');
define('PHPDIG_INDEX_MSPOWERPOINT',false);
define('PHPDIG_PARSE_MSPOWERPOINT','/usr/local/bin/ppt2text');
define('PHPDIG_OPTION_MSPOWERPOINT','');
//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');
Examples of what I get in my browser after indexing :
niveau 2...
4:http://10.37.1.240/dossier_presse/dp_2004_a.pdf (not checked)
(temps : 00:01:22)
5:http://10.37.1.240/arrete_100903.pdf (not checked)
(temps : 00:01:30)
6:http://10.37.1.240/Ressources-Humaines/annuaire_telephonique.htm (checked)
(temps : 00:01:51)
+ + + + + +
And in the summary :
http://10.37.1.240/dossier_presse/dp_2004_a.pdf