PDA

View Full Version : Index MSWORD But No search result


wessam
08-20-2004, 09:02 AM
Hi All
I'm try indexing MSWORD Files but when im try search the content of this file i got nothing
my config file look like :
define('PHPDIG_INDEX_MSWORD',true);
define('PHPDIG_PARSE_MSWORD','c:\appserv\www\catdoc\catdoc');
define('PHPDIG_OPTION_MSWORD','');

define('PHPDIG_INDEX_PDF',true);
define('PHPDIG_PARSE_PDF','/usr/local/bin/pstotext');
define('PHPDIG_OPTION_PDF','-cork');

define('PHPDIG_INDEX_MSEXCEL',true);
define('PHPDIG_PARSE_MSEXCEL','c:\appserv\www\catdoc\xls2csv');
define('PHPDIG_OPTION_MSEXCEL','-s 8859-1');



//---------EXTERNAL TOOLS EXTENSIONS
// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt
define('PHPDIG_MSWORD_EXTENSION','');
define('PHPDIG_PDF_EXTENSION','');
define('PHPDIG_MSEXCEL_EXTENSION','');
define('PHPDIG_MSPOWERPOINT_EXTENSION','');

and i add this line of code to robot_functions.php:
$command = PHPDIG_PARSE_MSWORD.' '.PHPDIG_OPTION_MSWORD.' '.$tempfile2.' 2>&1';


when im try catdoc in command line its work and got my MSWORD
c:\Appserv\www\catdoc\catdoc w.doc

im try check this Information (http://www.phpdig.net/showthread.php?threadid=799)
but still can't search my word

document files
please any help

Charter
08-20-2004, 09:26 AM
Did you try it with .exe added on to catdoc?

wessam
08-20-2004, 01:02 PM
yes and i got the same things

Charter
08-20-2004, 01:09 PM
Like this?

define('PHPDIG_PARSE_MSWORD','C:\\\\appserv\\\\www\\\\catdoc\\\\catdoc.exe' );

wessam
08-20-2004, 01:12 PM
thanks for you fast answers

and yes i try this one and also 'c:\appserv\........'

Charter
08-20-2004, 01:22 PM
Hi. Go back to this (http://www.phpdig.net/showthread.php?threadid=799) thread and add the code, and then reindex, and let me know what it says when it encounters the Word document.

wessam
08-20-2004, 01:32 PM
hi..
this the output
--------------------------------------------------------------------------------
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to: 1
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !
--------------------------------------------------------------------------------
[Back] to admin interface.

Charter
08-20-2004, 01:42 PM
Set the following and do another reindex:

define('PHPDIG_INDEX_PDF',false);

wessam
08-20-2004, 01:52 PM
Hi i did but still can't search my word document
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

Charter
08-20-2004, 01:57 PM
Oh, you need to edit the code you added so that it is for Word documents, not for PDFs. For example...

// it can have _PDF or _MSWORD or _MSEXCEL depending on binary
$command = PHPDIG_PARSE_MSWORD.' '.PHPDIG_OPTION_MSWORD.' '.$tempfile2.' 2>&1';

wessam
08-20-2004, 02:03 PM
im sorry coz im bother you
I did but nothing new :((
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

Charter
08-20-2004, 02:18 PM
I mean throughout, including for these things...

// in the next four lines change _PDF to either _MSWORD or _MSEXCEL for those binaries
echo "Index the pdf is set to: " . PHPDIG_INDEX_PDF . "<br>";
echo "Parse the pdf is set to: " . PHPDIG_PARSE_PDF . "<br>";
echo "Does parse pdf exist: " . file_exists(PHPDIG_PARSE_PDF) . "<br>";
echo "Is parse pdf executable: " . is_executable(PHPDIG_PARSE_PDF) . "<br>";

It's still using _PDF because "/usr/local/bin/pstotext" is getting printed.

wessam
08-20-2004, 02:19 PM
Hi this is what i got now

SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:

Command is: c:\appserv\www\catdoc\catdoc.exe -s 8859-1 ../admin/temp/75689462.tmp 2>&1
Result contains: Array ( [0] => The system cannot execute the specified program. )
Return value is: 1

2:http://localhost/test/w.doc
(time : 00:00:16)

No link in temporary table

wessam
08-20-2004, 02:25 PM
after that i remove the .exe from the path and got
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the pdf is set to:
Parse the pdf is set to: /usr/local/bin/pstotext
Does parse pdf exist:
Is parse pdf executable:
2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

Charter
08-20-2004, 02:31 PM
Why is "Parse the pdf is set to: /usr/local/bin/pstotext" still printing?

It should be the following code...

// in the next four lines change _PDF to either _MSWORD or _MSEXCEL for those binaries
echo "Index the doc is set to: " . PHPDIG_INDEX_MSWORD . "<br>";
echo "Parse the doc is set to: " . PHPDIG_PARSE_MSWORD . "<br>";
echo "Does parse doc exist: " . file_exists(PHPDIG_PARSE_MSWORD) . "<br>";
echo "Is parse doc executable: " . is_executable(PHPDIG_PARSE_MSWORD) . "<br>";

Try that and also keep the following:

define('PHPDIG_OPTION_MSWORD',''); // two single quotes, no space between

wessam
08-20-2004, 02:42 PM
Hi :)
Now got this ....
--------------------------------------------------------------------------------
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: c:\appserv\www\catdoc\catdoc
Does parse doc exist:
Is parse doc executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: c:\appserv\www\catdoc\catdoc
Does parse doc exist:
Is parse doc executable:
2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

Charter
08-20-2004, 02:48 PM
Okay, now use this:

define('PHPDIG_PARSE_MSWORD','C:\\\\appserv\\\\www\\\\catdoc\\\\catdoc.exe' );

and stick this back in:

// it can have _PDF or _MSWORD or _MSEXCEL depending on binary
$command = PHPDIG_PARSE_MSWORD.' '.PHPDIG_OPTION_MSWORD.' '.$tempfile2.' 2>&1';

and keep the code that is there, and reindex.

wessam
08-20-2004, 03:04 PM
Hi ..
i got this : ....

--------------------------------------------------------------------------------
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\catdoc\catdoc.exe
Does parse doc exist: 1
Is parse doc executable: 1
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\catdoc\catdoc.exe
Does parse doc exist: 1
Is parse doc executable: 1

Command is: C:\appserv\www\catdoc\catdoc.exe ../admin/temp/19578262.tmp 2>&1
Result contains: Array ( [0] => The system cannot execute the specified program. )
Return value is: 1

2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

Charter
08-20-2004, 03:10 PM
Okay, now go check your PHP info page and see if you are in safe mode.

<?php
phpinfo();
?>

Seach the PHP info page for safe_mode and see if it says on or off.

wessam
08-20-2004, 03:15 PM
Safe_mode off

Charter
08-20-2004, 03:48 PM
How did you FTP catdoc.exe to your server? It is a binary file and needs to be FTPed in binary mode, not ASCII. Go and FTP catdoc.exe again, in binary mode, to be sure.

wessam
08-20-2004, 04:02 PM
Sorry but how i can ftp catdoc ??
i have catdoc.exe file donwloaded with the catdoc folder

Charter
08-20-2004, 04:11 PM
Oh wait, you can execute catdoc from command line, right? If so, forget about the FTP in binary thing.

Maybe there is a local versus network drive issue? What happens if you copy catdoc.exe and stick it somewhere else? Remember to change PHPDIG_PARSE_MSWORD in the config so that is has the new location.

Also, is catdoc.exe set to rwxr-xr-x permission?

wessam
08-20-2004, 04:24 PM
yes i can run it from command line and i can see my word document when try run it from command line

i try copy it to another place and reindex after change the PHPDIG_PARSE_MSWORD path but still got the saame error
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\phpdig\catdoc\catdoc.exe
Does parse doc exist: 1
Is parse doc executable: 1
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\phpdig\catdoc\catdoc.exe
Does parse doc exist: 1
Is parse doc executable: 1

Command is: C:\appserv\www\phpdig\catdoc\catdoc.exe ../admin/temp/51727552.tmp 2>&1
Result contains: Array ( [0] => The system cannot execute the specified program. )
Return value is: 1

2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

and folder catdoc have read write permission

Charter
08-20-2004, 04:31 PM
Okay, try this. In robot_functions.php find:

|| $result_test['status'] == 'MSWORD' && PHPDIG_INDEX_MSWORD == true && file_exists(PHPDIG_PARSE_MSWORD) && $is_exec_command_msword

and change it to:

|| $result_test['status'] == 'MSWORD' && PHPDIG_INDEX_MSWORD == true && $is_exec_command_msword

and also change:

define('PHPDIG_PARSE_MSWORD','C:\\\\appserv\\\\www\\\\catdoc\\\\catdoc.exe' );

back to:

define('PHPDIG_PARSE_MSWORD','C:\\\\appserv\\\\www\\\\catdoc\\\\catdoc');

Work now?

wessam
08-20-2004, 04:47 PM
got this
SITE : http://localhost/
Exclude paths :
- @NONE@


Is result test http an array: 1
What is result test http status: HTML

Is result test an array: 1
What is result test status: HTML
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\catdoc\catdoc
Does parse doc exist:
Is parse doc executable:
1:http://localhost/test/
(time : 00:00:05)
+
level 1...


Is result test http an array: 1
What is result test http status: MSWORD

Is result test an array: 1
What is result test status: MSWORD
Use is executable is set to: 0
Index the doc is set to: 1
Parse the doc is set to: C:\appserv\www\catdoc\catdoc
Does parse doc exist:
Is parse doc executable:

Command is: C:\appserv\www\catdoc\catdoc ../admin/temp/48711482.tmp 2>&1
Result contains: Array ( [0] => The system cannot execute the specified program. )
Return value is: 1

2:http://localhost/test/w.doc
(time : 00:00:15)

No link in temporary table

--------------------------------------------------------------------------------

links found : 2
http://localhost:10/test/
http://localhost:10/test/w.doc
Optimizing tables...
Indexing complete !

Charter
08-20-2004, 04:53 PM
Try copying the catdoc.exe file to the PhpDig includes directory and change PHPDIG_PARSE_MSWORD.

Also try the following from shell (copy w.doc to the PhpDig admin/temp directory):

C:\appserv\www\catdoc\catdoc ../admin/temp/w.doc 2>&1

Try changing this command until you get something that works from shell.

wessam
08-20-2004, 05:10 PM
c:\appserv\www\phpdig\catdoc\catdoc ../admin/temp/67513632.tmp 2>&1

it give me

catdoc: no such file or dirctory

Charter
08-20-2004, 05:29 PM
>> when im try catdoc in command line its work and got my MSWORD
c:\Appserv\www\catdoc\catdoc w.doc

There was no 'phpdig' in the path before.

Also, you need to copy w.doc over to the PhpDig admin/temp directory and use w.doc in the command:

C:\appserv\www\catdoc\catdoc ../admin/temp/w.doc 2>&1

Call this command from the PhpDig admin directory as that is where the sipider.php file resides.

wessam
08-22-2004, 03:29 PM
Hi All ..
im still can't search content of MSWORD catdoc didn't create .txt file for this documents ..
im try it on solaris machine and it work but i can't use it on windows server machine please help me