PDA

View Full Version : Some sites won't index


jalerta
10-05-2003, 03:25 PM
Hi All,

I have installed PHPDig-1.6.2 on a Redhat Linux 8.1 server running Apache 2.0 and MySQL version 3.23.56 with PHP 4.2.2.

I am having problems with some sites not indexing and just giving me the following message.

SITE : http://www.somedomain.com/
Exclude paths :
- @NONE@
No link in temporary table

--------------------------------------------------------------------------------

links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !


I am sure that there are more than 10 links on the index.html page of this site, but still nothing.

On other domains on this server PHPDig works correctly.

Can anyone give me any idea as to what is happening?

Thanks in advance.


Jeff

Charter
10-05-2003, 03:41 PM
Hi. Did you previously index the sites recently, or are the sites like http://www.domain.com/dirone/index.php and http://www.domain.com/dirtwo/index.php? You can change the reindex timeframe with define('LIMIT_DAYS',7); in the config file.

jalerta
10-05-2003, 06:36 PM
Thanks for the reply.

I have been trying to get it to work with that specific domain and have read other posts here about problems with reindexing a recently indexed site.

So, I have repeatedly deleted the MySQL database and re-installed it using the install.php script.

I am only indexing from the top level directory using "www.domainname1.com" and "www.domainname2.com.

I have also tried "www.domainname.com/index.html" without any success.

I have tried indexing 3 domains on the same server. Only one indexed. The other 2, including the domain that I really what to index, did not.

Both domains gave the same message listed in the post above.


Jeff

Charter
10-05-2003, 06:49 PM
Hi. To start over and index from scratch, do the following:

empty all the PhpDig database tables
delete all files that may be in the temp dir
delete all files in the text_content dir except keepalive.txt
run spider.php from a browser or command prompt

Before running spider.php from the command prompt, in the config file, change the following to one like so, if only one level is wanted:

define('SPIDER_MAX_LIMIT',1);
define('SPIDER_DEFAULT_LIMIT',1);
define('RESPIDER_LIMIT',1);

Also, in the config file, change the following to one like so, if more frequent reindexing is wanted:

define('LIMIT_DAYS',1);

Emptying the database tables is part of the process to restart from scratch. The files in the text_content directory also need to be deleted, except for the keepalive.txt file.

jalerta
10-05-2003, 08:04 PM
I followed your instructions but still nothing.

The message this time was:

2935: old priority 0, new priority 18
Spidering in progress...
-----------------------------
SITE : http://www.somedomain.com/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

Just to recap the installation instructions so I am sure that I got everything right ...

I unTARed the phpdig files into a temp directory and then copied all the files into the www.somedomain.com/search directory.

I changed the permissions on the admin/temp, includes and text_content directories to 777 to allow write access to everyone. ( Security issue that I will worry about when I get PHPDig running )

I copied the _connect.php file to connect.php and edited it to add the MySQL hostname, username, password and database name. I cleared the PHPDIG_DB_PREFIX field.

I then ran the install.php file from a web browser ( although at first it complained about not finding the init_db.sql file, which I then copied to the admin directory).

Once the database was created and the tables were installed I tried to index www.somedomain.com with on success.

Was there anything else that I was supposed to do? Am I missing any permissions or something?

Any other suggestions?


Thanks for the help.


Jeff

Charter
10-06-2003, 02:53 PM
Hi. That sounds correct. What type of files are you trying to index: *.asp, *.shtml, etcetera? Do you notice if indexing works on some file types but not others?

jalerta
10-06-2003, 03:58 PM
I am trying to index plain .html files.

I have done some more tests and I have tried to index 10 different virtual domain sites that reside on my server.

I have discovered that of the 10 sites I tried to index only 1 site worked. 9 sites would not index.

Looking furthur, I discovered that the only site that would index was a site that had moved to another provider.

The directory structure and files for the web site still resided on my server but the DNS now points to another server.

All the other virtual domains that I tried to index had DNS entries that pointed to my server IP address.

Does this tell you anything?

Jeff

Charter
10-06-2003, 04:36 PM
Can you try lynx from command line instead? An example is in this (http://www.phpdig.net/showthread.php?threadid=78) thread.

jalerta
10-06-2003, 06:41 PM
I tried using Lynx, with no success.

Lynx would just sit there saying "Making HTTP connection to www.somedomain.com".

I was wondering if the issue in this case could be that the web server is behind a NAT'ed firewall?

Also, the web sites are on the same machine as the DNS service.

So, on the internal network the server has an IP address, for example, of 10.1.1.100. However, in the DNS the domain has an IP address of 123.123.123.1.

In this case, Lynx is trying to open the web site that DNS says is at 123.123.123.1, while the server that the web site is really on is at 10.1.1.100. So no connection can be established.

Is this a possible explaination for the problem?

Has anyone run into this problem before?

Any and all help is greatly appreciated.

Thanks,

Jeff

rayvd
10-08-2003, 10:04 AM
This is definitely a NAT problem. I am experiencing the same thing and am trying to figure out a rule to get around it. What I'm going to try and figure out how to do is to get the webserver to reply on the same interface as the request came in on, instead of doing NAT on the packet.

If your setup isn't too complex, you may just be able to set up a rule specifying that outbound packets to a given IP should not be NAT'd, or in some specific way only. I am hoping to find a way to tell the system to not do NAT on packets with a certain flag marked ... I'm using ipf on FreeBSD, but I would guess iptables would have this functionality as well...

rayvd
10-08-2003, 10:31 AM
Well, fixed my problem by adjusting the routing table on the machine with the webserver. :)

In your case, why not add an explicity entry to your /etc/hosts file pointing to the internal address instead of the external one?

jalerta
10-08-2003, 10:46 AM
Rayvd,

Yep, that worked.

Thanks for the help.

I hope the PHPDig will eventually have the ability to directly index a site based on the location of files in the file system instead of only by FQDN/IP address.

Again, thanks for the help.


Jeff

vvvvv
10-12-2003, 08:29 PM
I have the same problem:

SITE : http://www.blah-blah-blah.com/
Exclude paths :
- @NONE@
No link in temporary table


>Well, fixed my problem by adjusting the routing table on the machine with the webserver.

I can't do that cause I have a simple hosting account. :(
Any suggestions? And thanks in advance for any help.

Charter
10-13-2003, 05:01 PM
Perhaps in config.php change PHPDIG_DEFAULT_INDEX to false?

vvvvv
10-13-2003, 05:17 PM
thanks Charter but still the same:

--------------------------------------------------------------------------------
SITE : http://www.somesite.com/
Exclude paths :
- @NONE@
No link in temporary table

--------------------------------------------------------------------------------

links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

--------------------------------------------------------------------------------

Any other ideas? Much appreciate the help.

Charter
10-13-2003, 05:23 PM
Was recently indexed

Did it index the first time? Are you trying to reindex?

vvvvv
10-13-2003, 05:30 PM
>Did it index the first time?

Nope. :( This is all I ever got and I've tried it on several different URLs.

SITE : http://www.mysite.com/
Exclude paths :
- @NONE@
No link in temporary table

mike221
10-13-2003, 05:42 PM
I had the same problem but it went away after performing the mods published Here (http://www.phpdig.net/showthread.php?s=&threadid=80&highlight=2003) .

My sever (Where phpdig is) : Apache/1.3.28 (Unix) mod_auth_passthrough/1.8 mod_gzip/1.3.26.1a mod_log_bytes/1.2 mod_bwlimited/1.0 PHP/4.3.3 FrontPage/5.0.2.2634 mod_ssl/2.8.15 OpenSSL/0.9.7a on Linux.

I still have problems indexing a couple of servers running Netscape out of 265 servers with all kind of configurations.

Good Luck

vvvvv
10-13-2003, 05:58 PM
OK I'll give that a spin. Thanks for the suggestion mike221

Looks like a late night cup of coffee for me. :)

vvvvv
10-13-2003, 06:33 PM
OK I did the mods but still the same. :(

Any other ideas? Again much appreciate the help.

rayvd
10-14-2003, 07:32 AM
You've probably already checked this... there's no robots.txt file on your server preventing the crawling is there? :)

Tanasja
10-17-2003, 03:40 AM
Hi,

Think I have the same problem. The first time the indexing went fine. Then I changed some filenames. When reindexing, the old filenames were taken and the new ones skipped. Also when I index directly indexed the new filename the index couldn't find it.

I read the posts on this item and tried the following things:
- delete en reindex site (several ways)
- delete en reinstalling database
- empty dir text_content (not keepalive) and dir admin/temp (which stayed empty when reindexing)
- change config: LIMIT_DAYS=1 and PHPDIG_DEFAULT_INDEX=false
- run spider.php from a browser

I also see suggestions like:
- lynx from command line
- adjusting the routing table on the machine with the webserver

I don't understand these suggestions. Can anybody exlpain them? Or are there other options left? Maybe useful information: I host my sides at a provider.

Anybody can help?
Greetings from Amsterdam,
Tanasja

Charter
10-17-2003, 03:52 PM
Hi vvvvv. Maybe this is a JavaScript (http://www.phpdig.net/showthread.php?threadid=143) issue? Does setting PHPDIG_DEFAULT_INDEX to false have any effect?

Hi Tanasja. Just to be sure, when you say "the old filenames were taken and the new ones skipped" are the new links in the files you are trying to index?

web newsroom
05-19-2004, 12:35 PM
Im sure that its something quite simple.

I have been able to successfully spider certain sites and currently show the following stats.

Last Run : May 19, 2004
Pages : 5025 Entries
Index : 1397195 Entries
Keywords : 230416 Entries
Temporary : 110440 Entries


However, I still cannot seem to spider our own site.

A qualified subdomain.mydamain.com will work? I have changed the robots.txt and the .htaccess file and still stumped.