PDA

View Full Version : spider documents without extensions


jguert
08-17-2006, 08:39 AM
I have some problems with correct mime type detection on our linux server. The documents are pdf and word (doc) files, uploaded with a form an saved without fileextension. Normally Phpdig should read the header and spider the file with the correct external binary. The files are named like 22_upload, 23_upload ...

I'm using catdoc and pstotext with phpdig version 1.8.5. The binary installation should be correct, because
catdoc /path to file/file and
pstotext -cork /path to file/file
returns the content text

file -i /path to file/file shows the mime-type:
application/pdf or application/msword

Spider ist running, but the files in text_content (*.txt) and the column first_words in the database contains the binary code of the files not text content. I'm using # php -f /path/spider.php http://path/documents/ >> /var/log/phpdig.log

So it seems, that robot_functions.php does not recognise the mime-type of the documents and does not know, which external binary is correct. Therefore binary code is written into database.

Thanks for any suggestions,
Joe