Catdoc garbage
I use catdoc ('newest' version 0.93) to index msword documents. Word-97 format works fine, but newer Word-format (2000 or 2002) gives a lot of garbage in the search result. Is it possible to prevent this?
|
Hi. From http://www.45.free.net/~vitus/ice/catdoc/ ...
Current development version of catdoc is 0.93.3. It finally is able to autodetect unicode/non-unicode Word files and also recognizes (and hopefully parses) MS-Write files and rtf. It also eliminates garbage which troubled prevoius version of catdoc. Note that footnotes and fastsaves still not handled. Maybe try version 0.93.3 instead. |
We are already using version 0.93.3. Sometimes you get garbage when indexing short doc-files, due to lots of drivers-info in the specific word-documents.
|
Hi. Perhaps, if the garbage is in the same format, you could strip it out by using a regex in the phpdigCleanHtml function, or maybe the author of catdoc can give you some ideas to prevent the garbage.
|
All times are GMT -8. The time now is 12:02 AM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.