PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   Mod Requests (http://www.phpdig.net/forum/forumdisplay.php?f=23)
-   -   API function indexpage(URL, words) (http://www.phpdig.net/forum/showthread.php?t=1214)

renehaentjens 08-25-2004 05:17 AM

API function indexpage(URL, words)
 
I've been away from PhpDig and from this forum for a while. Today I've started again, and I installed a fresh 1.8.3 from scratch.

The site that I have to index is on my own PC, together with PhpDig. It's a dynamic site, with PHP script that I write myself. The script works with a database which defines the links and the indexable words that have to appear in the generated HTML pages.

So here's my question, a "How to" question, or, if not currently possible, a mod request: Can I shortcut the spider, is there an API function that I can call from my script, telling PhpDig: for URL such-and-so, please put this list of words in your tables?

And a question on the side: I have a new PC with lots of memory (1 GB) and yet it takes the spider 30 minutes to index 90 relatively short pages, even after I've commented out the "sleep(5)". Are there other admins in similar situations who have comparable experiences? (After spidering there is 1 site with 90 pages, 2050 keywords and 8880 references in engine, so really peanuts! With the browser, I can visit all the pages in about 1 minute...)

Charter 08-25-2004 08:44 AM

For 1: http://www.phpdig.net/forum/showthread.php?t=454

For 2: my guess is that some servers may not like the way I dealt with chunk encoding

renehaentjens 08-26-2004 01:47 AM

Thanks, Head Mole!

1. Topic 454 is not quite about an API call, it looks to me like making spidering even slighly more complex. I am searching for a more direct channel for feeding the database with URL+keywords...

2. Can you be more specific? What is chunk encoding and where and how do you deal with it? I found the term in a couple of earlier notes and in the 1.8.3 CHANGELOG but without further explanations.

Charter 08-26-2004 07:33 AM

1. It doesn't exist like you want so it's a mod request.

2. Chunk encoding is when content is sent bytes, content, bytes, content, etcetera.

renehaentjens 09-02-2004 12:19 AM

I've started to implement the requested API myself. If I can make it work with a reasonable amount of effort, I'll report back here.

renehaentjens 09-08-2004 03:52 AM

See http://www.phpdig.net/forum/showthread.php?p=5644

renehaentjens 09-08-2004 10:58 PM

The reason why it took the spider 30 minutes to index 90 relatively short pages, is probably not chunk encoding. I now discovered that I have about 10 MB logfile and 9 MB error-logging by Apache during that half hour. The spider is making zillions of requests for "funny" kinds of URLs the whole time (see below).

Any idea what happened, Charter?

Code:

157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl092&thumb=pptsl092_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl093&thumb=pptsl093_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:31:51 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001_t.jpg"%20/> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "GET /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/index.php?sid=pptsl008&thumb=pptsl008_t.jpg HTTP/1.1" 200 8087
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/css/default.css HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/courses.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/claroline/auth/profile.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:46 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/VELODLA/index.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/document/document.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:47 +0200] "HEAD /dokeos/claroline/scorm/scormdocument.php HTTP/1.1" 200 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg">/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:11:33:48 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl001.jpg"> HTTP/1.1" 404 0
...
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/>/ HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "HEAD /dokeos/VELODLA/scorm/Lectures/Algal_blooms_1of2/"pptsl093.jpg"%20/> HTTP/1.1" 404 0
157.193.197.26 - - [25/Aug/2004:12:00:43 +0200] "POST /dokeos/VELODLA/183phpdig/admin/spider.php HTTP/1.1" 200 48484
157.193.197.26 - - [25/Aug/2004:12:01:58 +0200] "GET /dokeos/VELODLA/183phpdig/admin/index.php HTTP/1.1" 200 4427

Error log:
[Wed Aug 25 11:31:32 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/robots.txt
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg">
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001_t.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl001.jpg"/>
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">/
[Wed Aug 25 11:31:51 2004] [error] [client 157.193.197.26] File does not exist: h:/easyphp1-7/www/dokeos/velodla/scorm/lectures/algal_blooms_1of2/"pptsl002.jpg">


Charter 09-11-2004 06:29 AM

It could be JavaScript. Back whenever, spaces and parentheses were allowed. See this post for where to remove such characters, but note that the post is no longer slashed correctly due to the vB upgrade.


All times are GMT -8. The time now is 11:11 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.