PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   Catdoc garbage (http://www.phpdig.net/forum/showthread.php?t=532)

Hoek 02-16-2004 09:23 AM

Catdoc garbage
 
I use catdoc ('newest' version 0.93) to index msword documents. Word-97 format works fine, but newer Word-format (2000 or 2002) gives a lot of garbage in the search result. Is it possible to prevent this?

Charter 02-16-2004 01:22 PM

Hi. From http://www.45.free.net/~vitus/ice/catdoc/ ...

Current development version of catdoc is 0.93.3. It finally is able to autodetect unicode/non-unicode Word files and also recognizes (and hopefully parses) MS-Write files and rtf. It also eliminates garbage which troubled prevoius version of catdoc. Note that footnotes and fastsaves still not handled.

Maybe try version 0.93.3 instead.

Hoek 02-23-2004 12:12 AM

We are already using version 0.93.3. Sometimes you get garbage when indexing short doc-files, due to lots of drivers-info in the specific word-documents.

Charter 02-23-2004 01:57 PM

Hi. Perhaps, if the garbage is in the same format, you could strip it out by using a regex in the phpdigCleanHtml function, or maybe the author of catdoc can give you some ideas to prevent the garbage.


All times are GMT -8. The time now is 12:02 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.