PDA

View Full Version : Indexing MS Word docs under Windows


phil_ballard
12-07-2003, 02:35 AM
I've had no success trying to index MS Word (.DOC) documents under Windows. I have:

define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\Program Files\\EasyPHP1-7\\www\\k3\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

Can anyone comment as to why phpdig makes no attempt to index Word docs?

Any help appreciated

Phil

Charter
12-07-2003, 11:06 AM
Hi. Is USE_IS_EXECUTABLE_COMMAND set to true (one) or false (zero) in the config file?

phil_ballard
12-08-2003, 06:05 AM
USE_IS_EXECUTABLE_COMMAND is set at the default value of 1. But things have got worse ... :(
I decided to can 1.6.2 and try 1.6.5, so I removed all code and the DB tables and re-installed 1.6.5 - install seemed to go OK, but now I can't get past here:

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://[whatever]
Exclude paths :
- @NONE@

Fatal error: Call to undefined function: is_executable() in
c:\program files\easyphp1-7\www\k3\phpdig\admin\robot_functions.php on line 633


Any offers/advice gratefully received.

Phil

Rolandks
12-08-2003, 07:59 AM
Take a look here: http://www.phpdig.net/showthread.php?s=&threadid=272

-Roland-

phil_ballard
12-08-2003, 08:58 AM
That seemed to work - thanks!

phil_ballard
12-08-2003, 10:14 AM
we-ell, we're improving .... now it spiders OK without giving errors, but it still isn't indexing the contents of the .doc files ... I tried spidering directly to the URL of a .doc file I knew existed:


Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://[mysite IP]/
Exclude paths :
- @NONE@
1:http://[mysite IP]/k3/CVs/4.doc
(time : 00:00:03)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://[mysite IP]/k3/CVs/4.doc
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------
[Back] to admin interface.


... but there's still no keywords indexed .....

Any more help welcome.

Charter
12-08-2003, 02:07 PM
Hi. From the command line what does the following produce?


C:\\Program Files\\EasyPHP1-7\\www\\k3\\catdoc -s 8859-1 change-me-4.doc

phil_ballard
12-09-2003, 01:44 AM
"Cannot load charset cp1251 - file not found"

phil_ballard
12-09-2003, 02:50 AM
OK, sorted out the charset paths, now seems to extract text OK from the command line, but still not via the web interface...:(

phil_ballard
12-09-2003, 04:02 AM
OK, all working; it seems that it didn't like the path name having a space in it at C:\\Program Files\\.......
Once I moved catdoc (and it's config subdirectories) to a path not requiring a space (C:\\ for instance) all was well.
Many thanks for your help, guys. (Though I'm sure I'll be back with more dopy questions :)
BTW my own requirement is for index searching on just one, local directory full of MS Word files. To facilitate this I have a file index.php which provides a link for the spider to all Word files in the directory:

<HTML>
<HEAD></HEAD>
<BODY>
<?
// function to return file extension (converts extn to lower case)

function gfext($filename)
{
$pathinfo = pathinfo($filename);
$ext = $pathinfo['extension'];
return strtolower($ext);
}

// read this directory
if ($handle = opendir('.')) {
while (false !== ($file = readdir($handle))) {
if (gfext($file) == "doc") { // we only want the Word files
echo "<a href=\"".$file."\">".$file."</a><br>";
}
}
closedir($handle);
}
?>
</BODY>
</HTML>

At this page the spider encounters a list of href links, one to each word document. Simple stuff, I know, but may help someone?

All the best

Phil