PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   External Binaries (http://www.phpdig.net/forum/forumdisplay.php?f=36)
-   -   Indexing MS Word docs under Windows (http://www.phpdig.net/forum/showthread.php?t=274)

phil_ballard 12-07-2003 01:35 AM

Indexing MS Word docs under Windows
 
I've had no success trying to index MS Word (.DOC) documents under Windows. I have:
Code:

define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\Program Files\\EasyPHP1-7\\www\\k3\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

Can anyone comment as to why phpdig makes no attempt to index Word docs?

Any help appreciated

Phil

Charter 12-07-2003 10:06 AM

Hi. Is USE_IS_EXECUTABLE_COMMAND set to true (one) or false (zero) in the config file?

phil_ballard 12-08-2003 05:05 AM

USE_IS_EXECUTABLE_COMMAND is set at the default value of 1. But things have got worse ... :(
I decided to can 1.6.2 and try 1.6.5, so I removed all code and the DB tables and re-installed 1.6.5 - install seemed to go OK, but now I can't get past here:

Code:

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://[whatever]
Exclude paths :
- @NONE@

Fatal error: Call to undefined function: is_executable() in
c:\program files\easyphp1-7\www\k3\phpdig\admin\robot_functions.php on line 633

Any offers/advice gratefully received.

Phil

Rolandks 12-08-2003 06:59 AM

Take a look here: http://www.phpdig.net/showthread.php?s=&threadid=272

-Roland-

phil_ballard 12-08-2003 07:58 AM

That seemed to work - thanks!

phil_ballard 12-08-2003 09:14 AM

we-ell, we're improving .... now it spiders OK without giving errors, but it still isn't indexing the contents of the .doc files ... I tried spidering directly to the URL of a .doc file I knew existed:

Code:

Spidering in progress...

--------------------------------------------------------------------------------
SITE : http://[mysite IP]/
Exclude paths :
- @NONE@
1:http://[mysite IP]/k3/CVs/4.doc
(time : 00:00:03)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://[mysite IP]/k3/CVs/4.doc
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------
 [Back] to admin interface.

... but there's still no keywords indexed .....

Any more help welcome.

Charter 12-08-2003 01:07 PM

Hi. From the command line what does the following produce?


C:\\Program Files\\EasyPHP1-7\\www\\k3\\catdoc -s 8859-1 change-me-4.doc

phil_ballard 12-09-2003 12:44 AM

"Cannot load charset cp1251 - file not found"

phil_ballard 12-09-2003 01:50 AM

OK, sorted out the charset paths, now seems to extract text OK from the command line, but still not via the web interface...:(

phil_ballard 12-09-2003 03:02 AM

OK, all working; it seems that it didn't like the path name having a space in it at C:\\Program Files\\.......
Once I moved catdoc (and it's config subdirectories) to a path not requiring a space (C:\\ for instance) all was well.
Many thanks for your help, guys. (Though I'm sure I'll be back with more dopy questions :)
BTW my own requirement is for index searching on just one, local directory full of MS Word files. To facilitate this I have a file index.php which provides a link for the spider to all Word files in the directory:
Code:

<HTML>
<HEAD></HEAD>
<BODY>
<?
// function to return file extension (converts extn to lower case)

function gfext($filename)
{
$pathinfo = pathinfo($filename);
$ext = $pathinfo['extension'];
return strtolower($ext);
}

// read this directory
if ($handle = opendir('.')) {
    while (false !== ($file = readdir($handle))) {
        if (gfext($file) == "doc") {  // we only want the Word files
            echo "<a href=\"".$file."\">".$file."</a><br>";
        }
    }
    closedir($handle);
}
?>
</BODY>
</HTML>

At this page the spider encounters a list of href links, one to each word document. Simple stuff, I know, but may help someone?

All the best

Phil


All times are GMT -8. The time now is 08:37 AM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.