PDA

View Full Version : catdoc and xls2csv not indexing


greener_02445
04-07-2004, 07:39 AM
Can anyone help me? I have been trying to get word documents
and excel files to index. I am using apache on a win XP system. It will work for text files only. this is how my config settings look :


define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries
// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\catdoc\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\catdoc\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');

I have tried the xls2csv and the catdoc programs through the MSDOS interface and they work fine. When I try to submit a URI with a .doc or a .xls This is what I get:

SITE : http://localhost/
Exclude paths :
- @NONE@
No link in temporary table

--------------------------------------------------------------------------------

links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

any advice muchly appreciated

-Rich

maza
04-08-2004, 03:56 AM
My configuration :

phpdig 1.8.0-Easy php 1.7-Windows Xp

Config File :

define('USE_IS_EXECUTABLE_COMMAND','0'); //use is_executable for external binaries

// if set to true, full path to external binary required
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\Ghostgum\\pstotext\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','C:\\Ghostgum\\pstotext\\pstotxt3');
define('PHPDIG_OPTION_PDF','');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\\Ghostgum\\pstotext\\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');

//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');

There is no way to index Pdf, .doc nor .xls, only html or text files. Catdoc, Pstotxt and xls2csv are functional in a windows shell.

I've read quite all the external binaries topics without finding a clue.

If you have the begining of an idea, i'am desperate.

Thanks. Axel

greener_02445
04-08-2004, 06:55 AM
I see people posting that claim the can get catdoc , xls2csv and pdftotext to work on windows systems. I have read through all the posts on this topic and still can not get phpdig to see these documents,these programs do work through dos only. All I have changed is the config which I showed before. Are there other things that need to be altered in the spider.php file perhaps?
Can anyone direct toward some additional online documentation or show me what they have done to solve this problem.
help help somebody!

Charter
04-09-2004, 08:18 AM
Hi. What version of PHP? Prehaps try this (http://www.phpdig.net/showthread.php?threadid=570) thread.

greener_02445
04-09-2004, 09:12 AM
Thank you for responding ,I'm using PHP4.3.3 and I tried the suggestion you listed , I'm still getting:

No link in temporary table
links found : 0

Charter
04-09-2004, 10:35 AM
Hi. Just posted this (http://www.phpdig.net/showthread.php?threadid=799) thread.

greener_02445
04-11-2004, 01:05 PM
Hi so I followed all of your instructions. I went ahead and inserted those echo statements. It doesn't seem to work yet but I will keep trying at it. What does everyone think? All the programs:

catdoc ,pdftotext and xls2csv all work in the command line

This is the output that I recieve :

Spidering in progress...

SITE : http://localhost/
Exclude paths :
- @NONE@

Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
1:http://localhost/grants/
(time : 00:00:05)
+ + + + + +
level 1...

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
2:http://localhost/grants/test.xls
(time : 00:00:15)

Is result test http an array: 1
What is result test http status: PLAINTEXT

Is result test an array: 1
What is result test status: PLAINTEXT
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
3:http://localhost/grants/Solutions.txt
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
4:http://localhost/grants/Outline%20of%20my%20last%20presentation2.doc.doc
(time : 00:00:26)


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
5:http://localhost/grants/MidtermSolutions.doc.doc
(time : 00:00:31)



Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
6:http://localhost/grants/Debate.doc
(time : 00:00:36)



Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:
7:http://localhost/
(time : 00:00:41)

No link in temporary table

links found : 7
http://localhost/grants/
http://localhost/grants/test.xls
http://localhost/grants/Solutions.txt
http://localhost/grants/Outline of my last presentation2.doc.doc
http://localhost/grants/MidtermSolutions.doc.doc
http://localhost/grants/Debate.doc
http://localhost/
Optimizing tables...
Indexing complete !

Any advice please send it my way!
-Rich

Charter
04-11-2004, 01:49 PM
Hi. For the external binaries, there aren't any PDF files in your post, just Word and Excel files, and it looks like three Word documents and one Excel file were indexed. What happens when you try a search on a word in one of those DOC/XLS files?

Also, in this (http://www.phpdig.net/showthread.php?threadid=799) thread, I've added a comment and some extra code to echo more stuff. The comment shows where to change _PDF to either _MSWORD or _MSEXCEL in the posted code in order to echo stuff specific for those binaries.

greener_02445
04-11-2004, 04:06 PM
My apologies Charter for replying to your previous thread.
I have had made some progress. That is I can get phpdig to see and identify my files , read the titles but not read them ... no green check mark
my config:
define('USE_IS_EXECUTABLE_COMMAND','0');
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc');
define('PHPDIG_OPTION_MSWORD','-s 8859-1');
define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext');
define('PHPDIG_OPTION_PDF','-cork');
define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','');
define('PHPDIG_PDF_EXTENSION','.txt');

Here is the output when I try and index a excel, word, and pdf file :

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
3:http://localhost/testfiles/Book1.xls
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
4:http://localhost/testfiles/GFP.doc
(time : 00:00:26)

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext
Does parse pdf exist:

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc
Does parse msword exist:

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv
Does parse msexcel exist:
5:http://localhost/testfiles/GeneChips.pdf
(time : 00:00:31)
No link in temporary table

If anyone can tell me what I am missing please drop me a reply
-Rich

Charter
04-11-2004, 04:27 PM
Hi. All of the following are coming up blank, likely meaning false, so the external binary isn't applied to the file.

Does parse pdf exist:
Does parse msword exist:
Does parse msexcel exist:

Try the following script, and keep changing the $filename variable until you get a 'file exists' for each binary, and use those paths. If the paths that you are using are actually correct, the blank results may be coming from cache, so running the script below will also clear that.

<?php
$filename = "C:\\\\catdoc\\\\catdoc";
clearstatcache();
if (file_exists($filename)) {
echo "file exists";
} else {
echo "try again";
}
?>

greener_02445
04-11-2004, 05:25 PM
O.k. Now it seems to be reading things in
changed the locations in the config to

define('PHPDIG_PARSE_MSWORD','C:\\catdoc\\catdoc.exe');
define('PHPDIG_PARSE_PDF','C:\\pdftotext\\pdftext.exe');
define('PHPDIG_PARSE_MSEXCEL','C:\\catdoc\\xls2csv.exe');

and it seems to be reading them in, but I still do get the green check? and it's indexing but this is some of the output for the excel and pdf files:

Is result test http an array: 1
What is result test http status: PDF

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext.exe
Does parse pdf exist: 1

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc.exe
Does parse msword exist: 1

Is result test an array: 1
What is result test status: PDF
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\pdftotext\pdftext.exe -cork ../admin/temp/44266892.tmp
Result contains: Array ( )
Return value is: 1

3:http://localhost/testfiles/GeneChips.pdf
(time : 00:00:21)

Is result test http an array: 1
What is result test http status: MSEXCEL

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: C:\pdftotext\pdftext.exe
Does parse pdf exist: 1

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msword is set to: 1
Parse the msword is set to: C:\catdoc\catdoc.exe
Does parse msword exist: 1

Is result test an array: 1
What is result test status: MSEXCEL
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1

4:http://localhost/testfiles/Book1.xls
(time : 00:00:26)

No link in temporary table

greener_02445
04-11-2004, 06:15 PM
I just noticed you posted two responses. Thank you for getting back to me and the script. That worked great. The files seem to be getting read:
Use is executable is set to: 0
Index the msexcel is set to: 1
Parse the msexcel is set to: C:\catdoc\xls2csv.exe
Does parse msexcel exist: 1

Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1
Parse is working and it looks like the programs are being running and the files are sent somewhere ../admin/temp/72672312.tmp
when I indexed I still do not get the green check mark and when
I search for terms in the documents I get nothing. It seems like I'm so close to getting it to run what else could I be missing?
-Rich

Charter
04-11-2004, 08:27 PM
Hi. Try removing the .exe extension from the paths and check this (http://www.phpbuilder.com/lists/php-general/2003051/0640.php) page, and also search this (http://www.php.net/function.exec) page for IUSR and see if that fixes it.

greener_02445
04-12-2004, 09:40 AM
Thanks for the info. So is the problem apache? or the catdoc, pdftotext programs? In either case I changed the config (got rid of the .exe) I made sure the folders were shared (catdoc/ pdftotext/ and their permissions were read/write. It is not indexing . The links you sent Charter were very helpful thank you ..I am still having a bit of problem when I go into service.msc to change my permissions, I don't see apache as a listing.. I am using easyphp . Does anyone know what service easyphp is listed as in win2000 or XP? Again thank you for reading and any advice please send it this way.

-Rich

Charter
04-13-2004, 08:33 PM
Command is: C:\catdoc\xls2csv.exe ../admin/temp/72672312.tmp
Result contains: Array ( )
Return value is: 1

Hi. The above output means that, when PhpDig tried to do the exec, an error occurred. I'm not familiar with EasyPHP, but perhaps the user comments on this (http://www.php.net/manual/en/ref.exec.php) page might help.