spider documents without extensions
I have some problems with correct mime type detection on our linux server. The documents are pdf and word (doc) files, uploaded with a form an saved without fileextension. Normally Phpdig should read the header and spider the file with the correct external binary. The files are named like 22_upload, 23_upload ...
I'm using catdoc and pstotext with phpdig version 1.8.5. The binary installation should be correct, because catdoc /path to file/file and pstotext -cork /path to file/file returns the content text file -i /path to file/file shows the mime-type: application/pdf or application/msword Spider ist running, but the files in text_content (*.txt) and the column first_words in the database contains the binary code of the files not text content. I'm using # php -f /path/spider.php http://path/documents/ >> /var/log/phpdig.log So it seems, that robot_functions.php does not recognise the mime-type of the documents and does not know, which external binary is correct. Therefore binary code is written into database. Thanks for any suggestions, Joe |
All times are GMT -8. The time now is 12:16 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.