PDA

View Full Version : API function indexpage(URL, words)


renehaentjens
08-25-2004, 05:17 AM
I've been away from PhpDig and from this forum for a while. Today I've started again, and I installed a fresh 1.8.3 from scratch.

The site that I have to index is on my own PC, together with PhpDig. It's a dynamic site, with PHP script that I write myself. The script works with a database which defines the links and the indexable words that have to appear in the generated HTML pages.

So here's my question, a "How to" question, or, if not currently possible, a mod request: Can I shortcut the spider, is there an API function that I can call from my script, telling PhpDig: for URL such-and-so, please put this list of words in your tables?

And a question on the side: I have a new PC with lots of memory (1 GB) and yet it takes the spider 30 minutes to index 90 relatively short pages, even after I've commented out the "sleep(5)". Are there other admins in similar situations who have comparable experiences? (After spidering there is 1 site with 90 pages, 2050 keywords and 8880 references in engine, so really peanuts! With the browser, I can visit all the pages in about 1 minute...)

Charter
08-25-2004, 08:44 AM
For 1: http://www.phpdig.net/forum/showthread.php?t=454

For 2: my guess is that some servers may not like the way I dealt with chunk encoding

renehaentjens
08-26-2004, 01:47 AM
Thanks, Head Mole!

1. Topic 454 is not quite about an API call, it looks to me like making spidering even slighly more complex. I am searching for a more direct channel for feeding the database with URL+keywords...

2. Can you be more specific? What is chunk encoding and where and how do you deal with it? I found the term in a couple of earlier notes and in the 1.8.3 CHANGELOG but without further explanations.

Charter
08-26-2004, 07:33 AM
1. It doesn't exist like you want so it's a mod request.

2. Chunk encoding is when content is sent bytes, content, bytes, content, etcetera.

renehaentjens
09-02-2004, 12:19 AM
I've started to implement the requested API myself. If I can make it work with a reasonable amount of effort, I'll report back here.

renehaentjens
09-08-2004, 03:52 AM
See http://www.phpdig.net/forum/showthread.php?p=5644

renehaentjens
09-08-2004, 10:58 PM
The reason why it took the spider 30 minutes to index 90 relatively short pages, is probably not chunk encoding. I now discovered that I have about 10 MB logfile and 9 MB error-logging by Apache during that half hour. The spider is making zillions of requests for "funny" kinds of URLs the whole time (see below).

Any idea what happened, Charter?

157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl092&thumb=pptsl092_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl093&thumb=pptsl093_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "GET /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 8087
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/css/default.css HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/courses.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/profile.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/document/document.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/scorm/scormdocument.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "POST /dokeos/VELODLA/183phpdig/admin/spider.php HTTP/1.1" 200 48484
157.193.197.26 - - [25/Aug/2004:12:01:58 +0200] "GET /dokeos/VELODLA/183phpdig/admin/index.php HTTP/1.1" 200 4427

Error log:
[Wed Aug 25 11:31:32 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/robots.txt
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">

Charter
09-11-2004, 06:29 AM
It could be JavaScript. Back whenever, spaces and parentheses were allowed. See this (http://www.phpdig.net/forum/showthread.php?p=2141#post2141) post for where to remove such characters, but note that the post is no longer slashed correctly due to the vB upgrade.