PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > External Binaries

Reply
 
Thread Tools
Old 08-17-2006, 07:39 AM   #1
jguert
Green Mole
 
Join Date: Aug 2006
Posts: 1
Question spider documents without extensions

I have some problems with correct mime type detection on our linux server. The documents are pdf and word (doc) files, uploaded with a form an saved without fileextension. Normally Phpdig should read the header and spider the file with the correct external binary. The files are named like 22_upload, 23_upload ...

I'm using catdoc and pstotext with phpdig version 1.8.5. The binary installation should be correct, because
catdoc /path to file/file and
pstotext -cork /path to file/file
returns the content text

file -i /path to file/file shows the mime-type:
application/pdf or application/msword

Spider ist running, but the files in text_content (*.txt) and the column first_words in the database contains the binary code of the files not text content. I'm using # php -f /path/spider.php http://path/documents/ >> /var/log/phpdig.log

So it seems, that robot_functions.php does not recognise the mime-type of the documents and does not know, which external binary is correct. Therefore binary code is written into database.

Thanks for any suggestions,
Joe

Last edited by jguert; 08-17-2006 at 07:43 AM.
jguert is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Probleme avec l'indexation des documents niptan Troubleshooting 1 11-06-2005 10:26 AM
Documents disappear kzant Troubleshooting 7 07-30-2005 07:26 AM
How to scan XML documents batman1056 How-to Forum 1 05-19-2005 07:34 AM
Textual content of indexed documents Dreamory How-to Forum 2 10-25-2004 07:50 AM
Duplicate Documents Problem... vonbrocklin Troubleshooting 3 11-25-2003 01:16 PM


All times are GMT -8. The time now is 03:21 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.