PDA

View Full Version : Indexing problems


ashyra
09-08-2005, 05:42 PM
I've got similar problems. I'm trying to index several sites. Sites that have literally thousands of links.

one in particular I was able to index with a different search script (that I didn't like nearly as much) and received over 20,000 links (then I stopped it after I realized I liked this one better).

Anyway, the most I can get out of that same site with phpdig is 357 pages.

ashyra
09-09-2005, 06:09 AM
last night I manually submitted 4445 pages to phpdig (which had a total of 90,000 + links contained within them).

The script placed all 4445 links in tempspider, but then deleted them one by one without indexing them. First it would change the valus in "index" to '1', then it would delete the page all together (in tempspider).

What am I doing wrong?

Charter
09-09-2005, 06:15 AM
Try setting LIMIT_TO_DIRECTORY to false and PHPDIG_IN_DOMAIN to true (both in the config file) and then, from the admin panel, use a large search depth, set links per to zero, and use the no option.

ashyra
09-09-2005, 06:26 AM
thanks for the reply.

trying that now.

ashyra
09-09-2005, 06:28 AM
BTW. This is a fantastic script. I'm sure I'm going to love it once I get it working!

ashyra
09-09-2005, 07:03 AM
I must be missing something huge.

It found 312 more links (I tested at a low level). Then said:

Optimizing tables...
Indexing complete !

but there's still only 311 links for that particular site.

in other words, it didn't index the new pages (although some were dupes).

Charter
09-09-2005, 08:31 AM
If any of the links are in heavy JavaScript then PhpDig won't follow them. Try setting a larger search depth, use links per of zero, and the no option. You can increase search depth beyond twenty by changing SPIDER_MAX_LIMIT in the config file. Also, if there are any META revisit after HTML tags, PhpDig attempts to obey those times. Tip: start indexing the site from the sitemap if present and check out this (http://www.phpdig.net/forum/showthread.php?t=1139) thread.

ashyra
09-09-2005, 09:57 AM
ok, I'll check that out.

at first the links were a problem as they weren't visible to the spider script. But then I dug them out myself to index.

sooo... now I've deleted the site and I'm attempting to start over...

ashyra
09-09-2005, 10:41 AM
yep. one of my major problems here is the java links.

no way around it, eh?

ashyra
09-10-2005, 04:48 AM
DOH! :bang:

I first set phpdig up in a /test/ dir

then I moved it to the main dir and forgot to set my permissions.

I think this is probably my problem (I hope).

As far as that java link thing though. It'd be great if there were a way to get around that issue. Some of the sites I'm indexing are going to be a real problem with that.

Thanks again...

Charter
09-11-2005, 07:02 AM
PhpDig tries to follow links in window.open() or window.location() JavaScript but nothing complex. If you want PhpDig to try and follow other JavaScript type links, such as window.navigate(), try editing the following line in the robot_functions.php file, but note that edits to this line won't parse heavy JavaScript like a browser can do:

while (eregi("(<frame[^>]*src[[:blank:]]*=|href[[:blank:]]*=|http-equiv=['\"]refresh['\"] *content=['\"][0-9]+;[[:blank:]]*url[[:blank:]]*=|window[.]location[[:blank:]]*=|window[.]open[[:blank:]]*[(])[[:blank:]]*[\'\"]?((([a-z]{3,5}://)+(([.a-zA-Z0-9-])+(:[0-9]+)*))*($allowed_link_chars\[?$allowed_link_chars\]?$allowed_link_chars))(#[.a-zA-Z0-9-]*)?[\'\" ]?",$eval,$regs)) {