PDA

View Full Version : Word and Excel converted but not indexed!


Topaz
10-12-2004, 08:47 PM
Hello

Now we are getting to my last problem (I do hope so at least ;) ).

It seems, that the spider indexes my word and excel-files, but they cannot be searched. They do not appear in my list of indexed documents. If I try to parse the documents on the commandline with

/usr/local/bin/catdoc -s 8859-1 test.doc

I have no problems.

And the spider itself creates a file in /admin/temp/ with correct content. So it parses it flawlessly, but it seems to write nothing into the database. I search the table 'spider' without success. Indexing PDFs is not a problem.

I tried different mime-settings in 'robot_functions.php' (application.msword - according to 'mime.conf' from apache) but with no luck.

I use the latest version of PHPDig 1.8.3, PHP 4.3.0, MySQL 3.23.49 and Apache 1.3.24 on a Redhat 7.2 (Enigma).

Thank you very much for kind help

Regards

Topaz

mleray
10-13-2004, 01:10 AM
Take a look at the External Binaries Forum...
I hope you'll find a solution here.

Topaz
10-14-2004, 03:28 AM
Take a look at the External Binaries Forum...
I hope you'll find a solution here.

Malheuresement ça ne marche pas.

I tried everything. Followed the instructions on http://www.phpdig.net/forum/showthread.php?t=799. My php.ini settings are fine. I also copied all the debugging code and got the following:


SITE : http://www.vips.ch/
Ausgeschlossene Pfade :
- administration/
- cgi-bin/
- css/
- db/
- flash/
- icongraphics/
- images/
- images_nav/
- scripts/
- search/
- stuff/
- de/login/
- fr/login/


Is result test http an array: 1
What is result test http status: HTML
Relative Path: ../admin/temp/

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1
1:http://www.vips.ch/test.php
(Zeit : 00:00:04)
+ + +
2: <http://www.vips.ch/test.php> Wurde gerade indiziert
(Zeit : 00:00:07)

Level 1...


Is result test http an array: 1
What is result test http status: MSWORD
Relative Path: ../admin/temp/

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/catdoc -s 8859-1 ../admin/temp/66114322.tmp
Result contains: Array ( [0] => BESTELL-FORMULAR [1] => [2] => Medikamentenpackung/Broschüre "Behandlungserfolge" [3] => [4] => Die Broschüre ist ab 5. Mai 2003 lieferbar. [5] => [6] => Lieferung bis spätestens: [7] => ... )
Return value is: 0

3:http://www.vips.ch/test.doc
(Zeit : 00:00:13)



Is result test http an array: 1
What is result test http status: PDF
Relative Path: ../admin/temp/

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/pdftotext ../admin/temp/38538542.tmp
Result contains: Array ( )
Return value is: 0

4:http://www.vips.ch/test.pdf
(Zeit : 00:00:16)


Is result test http an array: 1
What is result test http status: MSEXCEL
Relative Path: ../admin/temp/

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pdftotext
Does parse pdf exist: 1
Is parse pdf executable: 1

Command is: /usr/local/bin/xls2csv ../admin/temp/57661852.tmp
Result contains: Array ( [0] => "Schritte","Beschreibung" [1] => , [2] => "1","Produktname eingeben" [3] => "2","Darreichungsformen und Packungen eingeben" [4] => "3","BAG Nummer ein... )
Return value is: 0

5:http://www.vips.ch/test.xls
(Zeit : 00:00:20)
Kein Link in der temporäreren Tabelle

I snipped the contents of the documents, but as you can see, the documents get converted but nothing is put into the database! How come?

Thanks for any further help.

Topaz

mleray
10-14-2004, 06:21 AM
What are your options in config file here :
//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','.txt');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');
I put .txt for all tools but it's necessary only for pdf (using pdftotext)
--------------------------------------------------------------------------
Qu'as-tu mis dans les options du fichier de configuration ici :
//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','.txt');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');
J'avais mis .txt pour tout et ça convertissait bien les fichiers mais sans indexer.

Topaz
10-15-2004, 02:54 AM
What are your options in config file here :

I put .txt for all tools but it's necessary only for pdf (using pdftotext)
--------------------------------------------------------------------------
Qu'as-tu mis dans les options du fichier de configuration ici :

J'avais mis .txt pour tout et ça convertissait bien les fichiers mais sans indexer.

AHHHHHHHHH, its unbelievable.

It's true. I just had to remove this stupid suffix! Now it works flawlessly. Life can be cruel to fools like me.

Merci beaucoup pour le tipp. Si tu es en Suisse un bel jour, je t'invite pour une fondue :-).

I would suggest to add that to the external binaries README.

Topaz

vinyl-junkie
10-15-2004, 03:24 AM
I would suggest to add that to the external binaries README.
What?! Read the directions FIRST? What a novel concept! :D ;)

Topaz
10-15-2004, 03:36 AM
What?! Read the directions FIRST? What a novel concept! :D ;)

Well, is it written somewhere? I probably read through all the manuals, readmes and threads I could find. If it can be found somewhere I'll definitely need a vacation :-).

Topaz

vinyl-junkie
10-15-2004, 04:18 AM
Well, is it written somewhere? I probably read through all the manuals, readmes and threads I could find. If it can be found somewhere I'll definitely need a vacation :-).

TopazOops! You're right. I thought I was just being funny, and that you meant that the info was in the manual and you didn't read it. Sorry about that. I agree, that should definitely be in the documentation.

mleray
10-15-2004, 04:49 AM
I am charmed to have been able to help someone.

Je suis ravie d'avoir pu aider quelqu'un :)

Topaz
10-15-2004, 01:40 PM
Sorry about that. I agree, that should definitely be in the documentation.

No problem, I was concerned about myself :-).