PDA

View Full Version : Catdoc garbage


Hoek
02-16-2004, 10:23 AM
I use catdoc ('newest' version 0.93) to index msword documents. Word-97 format works fine, but newer Word-format (2000 or 2002) gives a lot of garbage in the search result. Is it possible to prevent this?

Charter
02-16-2004, 02:22 PM
Hi. From http://www.45.free.net/~vitus/ice/catdoc/ ...

Current development version of catdoc is 0.93.3. It finally is able to autodetect unicode/non-unicode Word files and also recognizes (and hopefully parses) MS-Write files and rtf. It also eliminates garbage which troubled prevoius version of catdoc. Note that footnotes and fastsaves still not handled.

Maybe try version 0.93.3 instead.

Hoek
02-23-2004, 01:12 AM
We are already using version 0.93.3. Sometimes you get garbage when indexing short doc-files, due to lots of drivers-info in the specific word-documents.

Charter
02-23-2004, 02:57 PM
Hi. Perhaps, if the garbage is in the same format, you could strip it out by using a regex in the phpdigCleanHtml function, or maybe the author of catdoc can give you some ideas to prevent the garbage.