PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   spider documents without extensions (http://www.phpdig.net/forum/showthread.php?t=2544)

jguert 08-17-2006 07:39 AM

spider documents without extensions
 
I have some problems with correct mime type detection on our linux server. The documents are pdf and word (doc) files, uploaded with a form an saved without fileextension. Normally Phpdig should read the header and spider the file with the correct external binary. The files are named like 22_upload, 23_upload ...

I'm using catdoc and pstotext with phpdig version 1.8.5. The binary installation should be correct, because
catdoc /path to file/file and
pstotext -cork /path to file/file
returns the content text

file -i /path to file/file shows the mime-type:
application/pdf or application/msword

Spider ist running, but the files in text_content (*.txt) and the column first_words in the database contains the binary code of the files not text content. I'm using # php -f /path/spider.php http://path/documents/ >> /var/log/phpdig.log

So it seems, that robot_functions.php does not recognise the mime-type of the documents and does not know, which external binary is correct. Therefore binary code is written into database.

Thanks for any suggestions,
Joe


All times are GMT -8. The time now is 12:16 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.