PhpDig Version 1.6.4 Released [Archive]

View Full Version : PhpDig Version 1.6.4 Released

Charter

11-16-2003, 11:07 PM

Hi. PhpDig version 1.6.4 has been released as a minor release. The changes can be found in the Changelog (http://www.phpdig.net/info/changelog.txt) file. Assuming the bugs from version 1.6.3 have been fixed in version 1.6.4, a future release will likely be a major release.

Charter

11-17-2003, 05:14 PM

Hi. Please, if you've installed PhpDig version 1.6.4, I'd like to hear from you. What I am looking for is feedback, good or bad, on how version 1.6.4 is working for you. This will let me know where the code shines and where the code needs improvement, especially before releasing a major release. Thanks for helping.

David J Harmon

11-19-2003, 05:02 PM

I loading it up tonight and see how it works...

wish me luck and I'll give feedback on it.

sid

11-19-2003, 08:28 PM

Works excellent... Just what I wanted...... Well done and good work, keep it up...

There is a Idea in the "Mod requests" forum about making an "Image Search" which I have classified it and made it into an Idea that is "Quite Possible" by you, as being such a great PHP developer, can it be Done, and don't forget to name it "PhpDig Image search, An possible Idea by sid :)"

David J Harmon

11-19-2003, 08:38 PM

Charter take note of this, I would like to have PhpDig Image Search, it will work great with my site...

David J Harmon

11-20-2003, 05:48 PM

Well I just added 100 more host to my database (out of 525 host, 16,304 pages, 4,648,468 indexs,264,792 keywords) and it still working great. I've been spidering all day, well I did take a break and watch screen savers on tech tv. So what is on the burner for the next major upgrade? I like the ideal on an image search, but I would like to see some more option on the admin page. But other than that I think its a strong program.

mark

11-26-2003, 11:44 AM

Hello, I'm using 1.6.4. I really like this package, it has been fun playing with it.

I installed PhpDig and spidered a few of my sites, then was busy with other things for a few days, when I came back I was pleasantly surprised to see many new domains had been spidered while I was away, even though I didn't recognize any of the new sites (in other words not sure where it found the links to them...?). After installing PhpDig, it seemed that all spidering must be manually initiated. Is the spider actually running automatically, and if so, what is the algorithm that it uses to branch out? Is this what I'm seeing when I see the "locked" sites in my domain list in the admin panel? Another question is what happens to all the links that the spider finds when spidering a site? Does it save them all and eventually come back and spider them as well?

Charter

11-26-2003, 11:57 AM

Hi. Depending on the level used, PhpDig will go and index sites from the links it finds. Locked sites are sites that are currently being crawled. Sometimes, if a crawl terminates prematurely, a site can remain locked, but you can unlock the site from the admin panel. The timeframe for the crawl process can take some time, especially with a lot of links and a high level. PhpDig will not start by itself unless you set a cron job. Link information from the sites is stored in the database tables, and text from the pages is stored in flat files.

David J Harmon

11-26-2003, 01:27 PM

I never had it start working by itself, which I don't want becasue I like to see what site are be added. I have a Gaming Search Site and I have all different ages looking for sites and I don't want any adult site or other garbage to come up.

mark

11-26-2003, 01:29 PM

Thanks Charter, that makes sense. I must have had a crawl that didn't get stopped when I thought it did.

Again this is a great tool, especially for the price, thanks!

Suggestions:

There is a config variable for holding a particular SessionId tag to remove from the URLs, it would be nice if this could be an array, because there are so many different ones... SID, SESSID, etc. I tried adding it, but my PHP isn't too good, so it didn't work.

When the crawl is done and the links found are shown, there could be an option to select a subset of these links with checkboxes, and a button to start the spidering of these links.

Is it possible to create a simplified PageRank feature (like Google), that skips all the fancy calculations, but does determine the number of links to a given page (from the PhpDig database pages) and factors this into the search results?

I'm have a little problem where I enter a search with a single keyword that is known to appear on Page X, and the results show Page X, but sometimes the snippet doesn't contain the keyword and neither does the page title. Why might the snippet not be the one that contains the keyword?

I want to allow PhpDig to jump to different domains and have my configuration set this way:
define('PHPDIG_IN_DOMAIN',true);
Is the correct for what I want?

If so, that leads to another question. I spidered a site which links to dozens of different sites right on the home page (no frames) with a depth of 2, but only pages from this domain were added. (If my config above is wrong for this then nevermind.)

When I use a spidering depth of 1, it grabs the target URL plus links directly from that page. But what if I wanted to only grab the home page of each domain, there doesn't seem to be an easy way to do this. Could there be a search depth option of 0, which only grabs that page?

Some spiders take into account the load that they might put on the servers they crawl and space the individual downloads out. Is this possible to integrate into PhpDig, in order to keep the webmasters out there from all banning PhpDig?

I read the thread here, about PhpDig indexing the meta tags and comments, things the user would never see. I tried all the suggestions posted there for regular expressions to zap that, but couldn't get that to work. Maybe this could be worked out correctly and added as an option.

I was getting results where my keyword was not in the page title for result 1, and the keyword was in the page title for result 2, so I tried changing the TITLE_WEIGHT config variable, changing it from 10 to 10000 to -1000. but never saw any change in the results. Is this setting only applicable to spider time, or can it be changed globally at any time?

Thanks again!

Charter

11-26-2003, 02:16 PM

Hi. Thanks for the suggestions. Here are some answers.

If you go to the admin panel, click a site, and click a blue arrow, you'll see the links in that (sub)tree. If you click a green check mark, PhpDig should reindex that (sub)tree using the setup in the config file.

For jumping domains, this (http://www.phpdig.net/showthread.php?threadid=177) thread might help with what you want.

Adding an array for session id is probably easy, adding PageRank is probably hard (not sure it makes sense for limited crawling), will look into the snippet issue (does the highlighted word show up if you increase the snippet length), and adding a 'wait' variable is probably easy.

With regard to tags, do you mean that you see HTML comments in the search results (can you post an example), or is it that you want to be rid of META tag description and keywords text in the search results? If the latter, comment out the code in post seven of this (http://www.phpdig.net/showthread.php?threadid=139) thread.

mark

11-26-2003, 03:20 PM

>> For jumping domains, this (http://www.phpdig.net/showthread.php?threadid=177) thread might help with what you want.

That is great, I'll certainly try that. But the thing that this really makes me wonder about is how domain Z got into my database if I never specifically spidered it in the admin panel?

>> With regard to tags, do you mean that you see HTML comments in the search results (can you post an example), or is it that you want to be rid of META tag description and keywords text in the search results? If the latter, comment out the code in post seven of this (http://www.phpdig.net/showthread.php?threadid=139) thread.

Not comments, but meta keywords. I tried commenting that section out, then deleting my test case domain, and respidered, it still came up for the metakeyword, so I haven't found a solution for that.

What about the keyword in title weighting? I guess that should just work...?

Charter

11-26-2003, 03:46 PM

Keeping that code section commented out, try deleting the test case domain and also deleting the test case domain files in the text_content directory and then do a new index. Weights are stored in a database table so a new index should change the order.

mark

11-26-2003, 04:04 PM

Yes, I see now that when I respidered with a negative title weighting, those pages with the keyword are buried at the end. Thanks.

Charter

11-26-2003, 04:11 PM

Hi. Do you mean that after you did the thing in two posts above, the meta keywords and description still show up?

mark

11-26-2003, 04:21 PM

No, sorry, was just talking about the title keyword weighting, which I get now.

One more thing, it seems that when I hit the stop button on a large spider run, that the spidering is actually still running in the background, because I decided to clear out my database and deleted all the sites, but new ones kept appearing from the last spidering that I had run (and stopped). Is there a way to really stop the spidering once started?

Charter

11-26-2003, 04:27 PM

So the meta tag text is now gone, right?

How many new sites appear after you click stop? There may be a lag between the time you click stop and when the 'signal' is received by the server. If you have access to shell and want to kill a process, just type kill pid at the shell prompt, where pid is the process id.