PDA

View Full Version : Indexing the Internet


sufehmi
06-06-2004, 09:39 AM
I'm a bit concerned to Google's domination on search, so like many others I signed up to Grub.org (http://grub.org).

Unfortunately, their server software is not open-sourced, so I looked around again to find another similar project.

***** (http://www.*****.org/docs/en/) is a good search engine and it's open-source, however they're not interested on implementing distributed crawler (http://www.*****.org/docs/en/faq.html) like Grub. And I don't know Java :P

So I was looking for a good PHP-based search engine, and found PhpDig. I just installed it, and it looks quite good.

I'm very interested to start a project to index the Internet using PhpDig.
I think we can scale PhpDig for this, example: we can separate the various components (indexer, search front-end, database, etc) into multiple physical servers for each component, MySQL have clustering feature now, etc.

If anyone else's interested, feel free to join in.

This is the to-do list for this project:

# Purchase a dedicated server for the project
# Get domain names list by signing up [ here (http://www.verisign.com/nds/naming/tld/) ] and [ here (http://www.pir.org/registrars/zone_file_access) ]
(read [ this (http://forums.devshed.com/t139891/s.html) ] and [ this (http://www.webhostingtalk.com/showthread.php?threadid=52404) ] for details)
# Code a job allocator, which will allocate job packages to users. It will assign several domain names (from the list above) to be deep-crawled by users.
# Code a job manager, which will receive submission from users, and merge it to the main index.
# Modify spider.php to be able to request job packages (with user authentication), crawl the domains, and submit the result back securely. (running as php cgi)
# Create a simple website; with basic stats, user management, and search front-end.

That should be enough to get this project off the ground.

This project will be fully open and strictly non-profit.

Thanks for the PhpDig developers, and here's hoping that this will be useful for everyone as well.



Thanks,
Harry

bloodjelly
06-06-2004, 09:58 AM
Sounds like an ambitious task, sufehmi, to put it mildly--especially considering that as of now phpDig can only spider one site at a time per database. Also, and no offense, but why would you want to do this? Google's "domination" on search, most people agree, provides relevant information quickly and easily, giving useful results in a fair manner. Do you plan on out-Googling Google? Everywhere you look some search engine is trying to top them, and they're spending millions and millions of dollars to do it. If they do, great, just as long as we still get relevant information. That's the only thing people want. So I guess my question is why would you want to compete with these big businesses, and why would you want to do something that many many other people have already almost done? I'm personally happy with my search results thus far.

sufehmi
06-06-2004, 01:59 PM
Originally posted by bloodjelly
Sounds like an ambitious task, sufehmi, to put it mildly

You're absolutely correct. And I'm not the best PHP coder either.

But if I can drive people's interest to this project, I think this project has a good chance to succeed.
Making it very easy to contribute is one of the trick (by enabling them to run the spider)

And I think I can get the project off the ground by my own, where hopefully it'll be interesting enough for others to join in.


especially considering that as of now phpDig can only spider one site at a time per database.

Yes, you're correct, it needs an additional module that's able to accept results from multiple spiders and incorporate that into the main index.


Also, and no offense, but why would you want to do this? Google's "domination" on search, most people agree, provides relevant information quickly and easily, giving useful results in a fair manner.

# One thing that everyone agree is that Google is among the most powerful entity in the Internet at the moment.

# At the moment they're doing a great job playing it fair (for most people), but there's no guarantee for the future.

# Google is excellent, but there are a few stuff that I (and no doubt others) would like to enhance.
(link farm anyone ? Google spammer ? etc)

# It will be one mighty interesting project ;)


Do you plan on out-Googling Google? Everywhere you look some search engine is trying to top them, and they're spending millions and millions of dollars to do it. If they do, great, just as long as we still get relevant information. That's the only thing people want. So I guess my question is why would you want to compete with these big businesses, and why would you want to do something that many many other people have already almost done? I'm personally happy with my search results thus far.

I'd be a total idiot if I think that I can beat Google by myself.

But when people are working together, I think nothing is impossible.


Thanks,
Harry

bloodjelly
06-06-2004, 05:28 PM
Well good luck, let's hear how your project progresses.:)

MySQLwebmaster
08-03-2004, 05:18 PM
I like how you're thinking. Sounds like a great project. Best of luck. Any progress report?

sufehmi
08-04-2004, 01:56 PM
Nope, unfortunately I'm still busy coding for phpBB and phpOpenChat, among other things.

Well anyway, this gives me opportunity to look for a better server within my budget :) I can't believe how cheap dedicated server nowadays (as long as you don't host anything business-critical)

In the meantime if anyone is interested to join in, just drop me an email or post in this thread.


cheers, HS