PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 02-23-2004, 11:40 AM   #16
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
>> Oh yes, I actually tried to enter the URL into the PhpDig interface just like it would redirect it, but it still hangs with a NOFILE status.

>> Warning: fsockopen(): php_network_getaddresses: getaddrinfo failed: No address associated with nodename (is your IPV6 configuration correct? If this error happens all the time, try reconfiguring PHP using --disable-ipv6 option to configure) in /Library/.../admin/robot_functions.php on line 337

Hi. The page here may help.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2004, 11:55 AM   #17
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Hello Charter,

Here is what I did. I used a very simple .php document/script and named it test.php. All that this file does is print the words "Welcome to my site." into an html document.

I tried spidering this document and it actually picked up the words, no errors... nothing wrong. Worked perfectly on that very same domain that I am having problems with.

I also tried spidering a page that has a variable get string in the URL like: http://www.mydomain.com/testpage.php?cpath=5. It worked perfectly for that page as well.

The only thing that I do have problems with seems to be the URL get string that has commas in it:
http://www.mydomain.com/index.php?na...,1,1,1,1,1,0,0

Could it be that PhpDig does not like the comma values for the dis variable? At this point I am rather clueless on what it could be. There are different types of pages on that very same URL, those pages that are accessed and have the dis variable with the comma values won't work, and any other page does work.

I really would like to get this to work. By the way, the prior described error has never happened again. I also double checked that the dns entries are correct in ns1 and ns2, and so forth, and it all seems to work correctly. The pages all work when I access them with a web browser. Please advise.

Thank you,

Mr. L
mlerch@mac.com is offline   Reply With Quote
Old 02-23-2004, 12:15 PM   #18
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Ok.. Here is some food for thought. I took that little script from the link that you wanted me to check out. and created a page like so:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>

<body>
<?php
$fd = fopen( "http://www.mydomain.com/index.php/", "r" );
if( !$fd )
{
echo "Cannot open URL";
} else {
while ( !feof( $fd ) ) {
$buffer = fgets( $fd, 4096 );
echo $buffer;
}
fclose ( $fd );
}
?>
</body>
</html>

I called it testing.php. Then I went to a web browser and called testing.php, and it printed the desired page fully intact into the browser window. Totally awesome. Then I went and to PhpDig and typed in that very same URL for testing.php, and guess what. It spidered it no problem. So.... what do you think it could be? If I try spidering the URL/page with it's actual URL it doesn't work, but when I spider the testing.php which essentially opens and reads the page that I want to spider and echos it back then the spidering works. It only works on that one level, because other links that are generated on that particular page also use that dis=1,0,0,....... Now I am really lost here. Could it be that the code needs to be changed or adjusted so that the dis variable with the string of number with the commas won't throw a fit?

Please help

Mr. L
mlerch@mac.com is offline   Reply With Quote
Old 02-23-2004, 12:42 PM   #19
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The type of links that PhpDig follows are those that match the regex in this thread. The comma is already allowed in the query string so I don't think that is the problem but am not exactly sure.

However, you had mentioned crawling secure links. One thing came to mind. The fsockopen error and/or NOFILE would make sense I guess if PhpDig was looking for file at http rather than at https or visa versa.

Try another test using a query string containing 1,1,1,1,1,1,0,0 but have everything using http instead of https, and use links not fopen. Does it index then?

If it does, then in the above linked thread, look at the code in the phpdigIndexFile function and change http to [a-z]{3,5} in the while line.

Also, set PHPDIG_IN_DOMAIN to true in the config file and apply the code change in this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2004, 01:49 PM   #20
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Ok.. I think we isolated the problem. The commas in the query string must cause the problem. I did 2 separate tests. I rigged the testing.php document to access http://www.mydomain.com/index.php (in index.php is a script that will reformulate the URL and give the initial parameters for that page and redirect with the header directive to the URL that has all the comma stuff behind it. When I try to spider testing.php it works. It is spidering the page correctly.

Then I did another test and I replaced the URL that I want to fopen with the actual URL of index.php, the one with the query string and commas behind it. The page seems to load just fine.

When I try to spider testing.php that way it indexes in 42 seconds. Isn't that interesting. If I set up a page ... testing.php, that fopens the url with the dis=1,1,1,1,0.... and try to spider testing.php it works, however when I try to spider the URL directly with PhpDig it stalls.

I did apply all the other changes per your instruction. Still no luck.

Mr. L

Last edited by mlerch@mac.com; 02-23-2004 at 02:14 PM.
mlerch@mac.com is offline   Reply With Quote
Old 02-23-2004, 02:11 PM   #21
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Try indexing http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0 directly from the admin panel and then watch the raw logs. What link does PhpDig try to fetch? I notice a slash at the end of the link in the fopen(http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0/) warning.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2004, 02:17 PM   #22
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
What do you mean by admin panel? By the way, I did correct my prior post and made some changes, but I saw that replied already. Apparantly it did work with the little fopen demo file that I wrote. It is actually spidering the fopen demo file that actually points to http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0
As I mentioned before. It will spider it when called via the fopen demo file, but it will not spider that page when called directly.
mlerch@mac.com is offline   Reply With Quote
Old 02-23-2004, 02:29 PM   #23
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Ok... I took a look at the logs. The error log that is. Here is what I get when I try to access that URL directly:

[Mon Feb 23 15:25:53 2004] [error] [client xxx.xxx.xxx.xxx] File does not exist: /Path/to/Web/Server/dir/html/robots.txt

Nothing in the access logs.

Why is it looking for robots.txt?


The access log gives me:

xxx.xxx.xxx.xxx - - [23/Feb/2004:15:32:10 -0800] "HEAD /index.php HTTP/1.1" 302 - "-" "PhpDig/1.8.0 (+http://www.phpdig.net/robot.php)"

Last edited by mlerch@mac.com; 02-23-2004 at 02:37 PM.
mlerch@mac.com is offline   Reply With Quote
Old 02-23-2004, 02:44 PM   #24
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. PhpDig looks for a robots.txt file to follow. PhpDig does not need a robots.txt file so that error means nothing. The admin panel is the admin/index.php file. What are all the entries seen in the access log when you put http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0 in the text box and click the dig button?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-23-2004, 03:02 PM   #25
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Ok... here is what I am getting, and it is getting repeated and repeated over and over again. I actually have to restart the apache or it won't stop writing to the log at a rate of one line per second it seems.

xxx.xxx.xxx.xxx - - [23/Feb/2004:15:59:59 -0800] "HEAD /index.php HTTP/1.1" 302 - "-" "PhpDig/1.8.0 (+http://www.phpdig.net/robot.php)"


Isn't that weird? It's almost like as if it is in a loop or something. Not even clicking stop on the browser button will stop that loop. That's a good way to crash the whole thing

Last edited by mlerch@mac.com; 02-23-2004 at 03:05 PM.
mlerch@mac.com is offline   Reply With Quote
Old 02-24-2004, 08:02 AM   #26
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Use the original spider.php file and use the attached file in place of the robot_functions.php file. From the admin panel, delete the site, click the delete button again with no site selected, click the clean dictionary link, and once cleaned type http://www.domain.com into the text box, select a search depth of two, click the dig button, and wait like ten minutes. What happens?
Attached Files
File Type: zip robot_functions.zip (12.4 KB, 8 views)
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 02-24-2004, 09:41 AM   #27
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Hello Charter,

Thank you for all your help. I think I got it to work. It still won't jump to the https://secure.mydomain.com host/pages for indexing. However the pages in http://www.mydomain.com are indexing correctly now. Got it to work. You are the man.

I have not tried your new robot_functions file at this point though. I will try it as soon as I find a free moment. At this point in time the non-indexing/stalling when it comes to a https://secure link is not that important, but I do want to revisit this issue, because maybe someone else may need the solution.

Catch up with you later.

Mr. L
mlerch@mac.com is offline   Reply With Quote
Old 02-25-2004, 02:54 PM   #28
webcat
Green Mole
 
Join Date: Feb 2004
Posts: 2
.htaccess

I am using .htaccess and I can't get phpdig to spider the protected folders, even if I put in a user and pass as described in this thread.

is there some trick that has since been discovered?


thanks for any tips!
webcat is offline   Reply With Quote
Old 02-25-2004, 03:13 PM   #29
mlerch@mac.com
Green Mole
 
Join Date: Feb 2004
Location: North Las Vegas, Nevada
Posts: 18
Hello webcat.

As it turns out, the reason why I was unable to spider the .htaccess protected directory was not the .htaccess protection at all. As long as you write your url to be spidered like:

http://username:password@www.mydomain.com/....
or
https://username:password@secure.mydomain.com/....

it should work. The problem that I had was that the server would go into a loop it couldn't get out of. I am using php pages that use different checking mechanisms and redirects, and it worked perfectly in a browser, but it sent PhpDig into a neverending loop. So, if you are using php pages like discribed above you may want to try the fix that Charter posted as a download a bit higher up in the threads.

Hope that this will work for you.

Mr. L
mlerch@mac.com is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Spider Indexing and htaccess directories webmaster_k Troubleshooting 0 10-01-2007 10:50 AM
cannot index my site ENTHALPIE Troubleshooting 2 11-18-2005 02:02 AM
successful indexing of every site but site where phpdig is served phillystyle123 Troubleshooting 1 02-21-2005 09:06 PM
How do I create "Site Index" using PHPDig ? jimfletcher How-to Forum 5 07-14-2004 04:56 AM
htaccess Tanasja How-to Forum 4 10-11-2003 06:29 AM


All times are GMT -8. The time now is 10:20 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.