View Full Version : Catdoc garbage
I use catdoc ('newest' version 0.93) to index msword documents. Word-97 format works fine, but newer Word-format (2000 or 2002) gives a lot of garbage in the search result. Is it possible to prevent this?
Charter
02-16-2004, 01:22 PM
Hi. From http://www.45.free.net/~vitus/ice/catdoc/ ...
Current development version of catdoc is 0.93.3. It finally is able to autodetect unicode/non-unicode Word files and also recognizes (and hopefully parses) MS-Write files and rtf. It also eliminates garbage which troubled prevoius version of catdoc. Note that footnotes and fastsaves still not handled.
Maybe try version 0.93.3 instead.
We are already using version 0.93.3. Sometimes you get garbage when indexing short doc-files, due to lots of drivers-info in the specific word-documents.
Charter
02-23-2004, 01:57 PM
Hi. Perhaps, if the garbage is in the same format, you could strip it out by using a regex in the phpdigCleanHtml function, or maybe the author of catdoc can give you some ideas to prevent the garbage.
vBulletin® v3.7.3, Copyright ©2000-2025, Jelsoft Enterprises Ltd.