PDA

View Full Version : Search PDF files


chazter
09-29-2003, 12:32 PM
Im a newbie at this and I am following the instructions per documentation but there is one part that I am not clear on
3.3. File types wich can be indexed PhpDig indexes HTML and text files by itself.
PhpDig could index PDF, MS-Word and MS-Excel files if you install external binaries on the spidering machines to this purpose. PhpDig could index PDF, MS-Word and MS-Excel files if you install external binaries on the spidering machines to this purpose.

I dont understand the "PhpDig could index PDF, MS-Word and MS-Excel files if you install external binaries on the spidering machines to this purpose."

I have access of my files when I FTP to a directory that my webhost gives me, but as for adding external binaries, I am not sure.

All my pdf files are in a specific directory buy how does somebody search a particular pdf file?

If anyone can give me clarification or instructions on how to do this, I would really appreciate it.

Thanks in advance.

Charter
09-29-2003, 05:54 PM
Hi. External binaries are certain programs that your host may, or may not, have to convert PDF/DOC/XLS files to text files.

Here is a short list of such external binaries and their uses:


name purpose
-----------------------------------
catdoc convert DOC to TXT
pstotext convert PS/PDF to TXT
pdftotext convert PDF to TXT
xls2csv convert XLS to CSV

If you know, or can find, the path to such external binaries from your host, then just use that path in the appropriate defintion in the config file.

If your host doesn't have such external binaries, or you cannot find the path, then you could FTP them to one of your directories, and then include that path in the appropriate defintion in the config file.

Depending on the type of output that the external binaries produce, you may find this thread (http://www.phpdig.net/showthread.php?threadid=68) useful. Also, this thread (http://www.phpdig.net/showthread.php?threadid=95) may be useful.

chazter
09-29-2003, 07:08 PM
Originally posted by Charter
Hi. External binaries are certain programs that your host may, or may not, have to convert PDF/DOC/XLS files to text files.

Here is a short list of such external binaries and their uses:


name purpose
-----------------------------------
catdoc convert DOC to TXT
pstotext convert PS/PDF to TXT
pdftotext convert PDF to TXT
xls2csv convert XLS to CSV

If you know, or can find, the path to such external binaries from your host, then just use that path in the appropriate defintion in the config file.

If your host doesn't have such external binaries, or you cannot find the path, then you could FTP them to one of your directories, and then include that path in the appropriate defintion in the config file.

Depending on the type of output that the external binaries produce, you may find this thread (http://www.phpdig.net/showthread.php?threadid=68) useful. Also, this thread (http://www.phpdig.net/showthread.php?threadid=95) may be useful.


Thanks for the reply . A couple of follow-up questions.

1. I am having a hard time contacting and getting answers from my ISP. Where do I get the binary "pdftotext"?

2. Once I get it what do I do with it. Do I create a directory called PDFTOTEXT in my website root directory and put the file there?

3. Once I put it there, do I run anything? and I assume I would have to configure my config file to point to that path.

Sorry for asking these questions if they seem obvious.

Thanks again in Advance

Charter
10-01-2003, 07:44 PM
1. To download the binary pdftotext, just find the one you need from Google (http://www.google.com/search?q=pdftotext).

2. You can place the binary pdftotext file wherever you'd like.

3. If you download the binary pdftotext, then it's ready to use, so just put the path to it in the config file.

This thread (http://www.phpdig.net/showthread.php?threadid=68) may also be useful.

chazter
10-02-2003, 07:47 AM
Thanks Charter,

I talked to my ISP and found out that they had the external binaries installed. I did configure the path as you suggested and the links were helpful too.

Thanks for the other suggestion regarding Google as others may encounter the problem in the future.

Have a great day.