PhpDig.net - View Single Post

mark · 11-26-2003, 01:29 PM

Thanks Charter, that makes sense. I must have had a crawl that didn't get stopped when I thought it did.

Again this is a great tool, especially for the price, thanks!

Suggestions:

There is a config variable for holding a particular SessionId tag to remove from the URLs, it would be nice if this could be an array, because there are so many different ones... SID, SESSID, etc. I tried adding it, but my PHP isn't too good, so it didn't work.

When the crawl is done and the links found are shown, there could be an option to select a subset of these links with checkboxes, and a button to start the spidering of these links.

Is it possible to create a simplified PageRank feature (like Google), that skips all the fancy calculations, but does determine the number of links to a given page (from the PhpDig database pages) and factors this into the search results?

I'm have a little problem where I enter a search with a single keyword that is known to appear on Page X, and the results show Page X, but sometimes the snippet doesn't contain the keyword and neither does the page title. Why might the snippet not be the one that contains the keyword?

I want to allow PhpDig to jump to different domains and have my configuration set this way:
define('PHPDIG_IN_DOMAIN',true);
Is the correct for what I want?

If so, that leads to another question. I spidered a site which links to dozens of different sites right on the home page (no frames) with a depth of 2, but only pages from this domain were added. (If my config above is wrong for this then nevermind.)

When I use a spidering depth of 1, it grabs the target URL plus links directly from that page. But what if I wanted to only grab the home page of each domain, there doesn't seem to be an easy way to do this. Could there be a search depth option of 0, which only grabs that page?

Some spiders take into account the load that they might put on the servers they crawl and space the individual downloads out. Is this possible to integrate into PhpDig, in order to keep the webmasters out there from all banning PhpDig?

I read the thread here, about PhpDig indexing the meta tags and comments, things the user would never see. I tried all the suggestions posted there for regular expressions to zap that, but couldn't get that to work. Maybe this could be worked out correctly and added as an option.

I was getting results where my keyword was not in the page title for result 1, and the keyword was in the page title for result 2, so I tried changing the TITLE_WEIGHT config variable, changing it from 10 to 10000 to -1000. but never saw any change in the results. Is this setting only applicable to spider time, or can it be changed globally at any time?

Thanks again!

11-26-2003, 01:29 PM	#10
mark Green Mole Join Date: Nov 2003 Posts: 5	Thanks Charter, that makes sense. I must have had a crawl that didn't get stopped when I thought it did. Again this is a great tool, especially for the price, thanks! Suggestions: There is a config variable for holding a particular SessionId tag to remove from the URLs, it would be nice if this could be an array, because there are so many different ones... SID, SESSID, etc. I tried adding it, but my PHP isn't too good, so it didn't work. When the crawl is done and the links found are shown, there could be an option to select a subset of these links with checkboxes, and a button to start the spidering of these links. Is it possible to create a simplified PageRank feature (like Google), that skips all the fancy calculations, but does determine the number of links to a given page (from the PhpDig database pages) and factors this into the search results? I'm have a little problem where I enter a search with a single keyword that is known to appear on Page X, and the results show Page X, but sometimes the snippet doesn't contain the keyword and neither does the page title. Why might the snippet not be the one that contains the keyword? I want to allow PhpDig to jump to different domains and have my configuration set this way: define('PHPDIG_IN_DOMAIN',true); Is the correct for what I want? If so, that leads to another question. I spidered a site which links to dozens of different sites right on the home page (no frames) with a depth of 2, but only pages from this domain were added. (If my config above is wrong for this then nevermind.) When I use a spidering depth of 1, it grabs the target URL plus links directly from that page. But what if I wanted to only grab the home page of each domain, there doesn't seem to be an easy way to do this. Could there be a search depth option of 0, which only grabs that page? Some spiders take into account the load that they might put on the servers they crawl and space the individual downloads out. Is this possible to integrate into PhpDig, in order to keep the webmasters out there from all banning PhpDig? I read the thread here, about PhpDig indexing the meta tags and comments, things the user would never see. I tried all the suggestions posted there for regular expressions to zap that, but couldn't get that to work. Maybe this could be worked out correctly and added as an option. I was getting results where my keyword was not in the page title for result 1, and the keyword was in the page title for result 2, so I tried changing the TITLE_WEIGHT config variable, changing it from 10 to 10000 to -1000. but never saw any change in the results. Is this setting only applicable to spider time, or can it be changed globally at any time? Thanks again! Last edited by mark; 11-26-2003 at 02:00 PM.