Can't get PHPDig to index an htaccess protected site [Archive]

mlerch@mac.com

02-19-2004, 05:20 PM

Hello,

I installed PHPdig. I am getting the admin/index.php page. I set the ../admin/temp directory, the ../includes directory and the ../text_content directory to chmod 777 (don't know if this is how it is supposed to be.

I am not getting any database errors.

So I go ahead and enter a URL to be spidered into the text box. For example:

http://username:password@www.mydomain.com

Nothing happens. It just hangs.. doesn't even go to the next page.

However, if I write:
http://www.mydomain.com

It goes to the next page, but doesn't find any links on that page. Totally weird.

When I go ahead and remove the htaccess username/password protection from that website and try it again with:
http://www.mydomain.com

It does find the links and seems to spider it correctly.

So this is the first of my problems.

The second one is completely different:

I would like to split that admin directory out of the phpdig directory. I would like to stick that admin directory into an existing admin directory on an https server and rename that phpdig admin directory to "search_tools" or something like that. So, I want to access the phpdig admin directory with:

https://secure.mydomain.com/admin/search_tools/index.php

and the regular site search form with:

http://www.mydomain.com/search.php

Is that possible? Or would I have to change pretty much every single link in the admin directory and so forth?

Please advise.

Thank you. Great software, and the price is right. :) Better than the commercial license for Atomz for $15K per year !!!! Hope that I can get it working for us.

Mr. L

Charter

02-20-2004, 08:54 AM

>> I set the ../admin/temp directory, the ../includes directory and the ../text_content directory to chmod 777 (don't know if this is how it is supposed to be.

Hi. Yes, those are the correct directories to set to 777 permissions.

>> So I go ahead and enter a URL to be spidered into the text box. For example: http://username:password@www.mydomain.com
Nothing happens. It just hangs.. doesn't even go to the next page.

Are you able to access http://username:password@www.mydomain.com from the browser window without using PhpDig? What OS/setup are you using?

>> However, if I write: http://www.mydomain.com
It goes to the next page, but doesn't find any links on that page. Totally weird.

As the directory is username/password protected, PhpDig doesn't have access so it doesn't find any links.

>> When I go ahead and remove the htaccess username/password protection from that website and try it again with: http://www.mydomain.com
It does find the links and seems to spider it correctly.

Without the username/password protection, PhpDig has access and can find links.

>> I would like to split that admin directory out of the phpdig directory...

Try installing everything in the secure search_tools directory and then move the search.php file where wanted and make the following edits:

In search.php edit $relative_script_path = '.'; to reflect the directory of the PhpDig install, something like $relative_script_path = '../search_tools'; or $relative_script_path = './secure/admin/search_tools'; depending on your setup.

In config.php edit the first line of code (the code checking the $relative_script_path variable) so that it contains && ($relative_script_path != "fill_in") where fill_in matches what $relative_script_path = '.'; gets set to in the search.php file.

mlerch@mac.com

02-20-2004, 10:07 AM

Hi Charter,

Thanks for your pointers. Here is what I did. I opened a new browser and typed:

http://username:password@www.mydomain.com

Worked like a charm. Tried it again in PhpDig and it hung itself.

My Server configuration is as follows:

Mac OS X 10.2.8
iTools Apache 2
PHP -latest (safe_mode disabled)
MySQL - latest

I have not tried your instructions regarding moving the admin portion on a secure https server and leaving the search.php outside. I'll check it out this weekend.

Also, I need to integrate the search box in a very simplified version (just a text field and a button) into the site template system, and have an "Advanced Search" link that will go to it's own search.php (or better advanced_search.php) page. Also I need to have all of the results come up customized on the "search_results.php" page on my site (don't want it to pop in a _blank page.)

Are there any instructions on how to customize PhpDig that way. Please let me know, and thank you so much for your help. It's a wonderful tool.

Sincerely,

Mr. L

Charter

02-20-2004, 05:21 PM

>> ...PhpDig and it hung itself...

Hi. How long before it hangs? What happens if you wait like say ten minutes or so? Does it still seem to hang?

>> ...any instructions on how to customize PhpDig...

Most of your customization questions have been answered in one way or another somewhere in the forums. ;)

mlerch@mac.com

02-20-2004, 05:53 PM

Hi Charter,

Thanks for your answer. I will look for the customization stuff in the forum. Not a problem.

Regarding the spidering of the htaccess protected site. I created a demo user and password for the spidering. This user can access the site perfectly when accessing it through a browser. When I let it run in PhpDig it won't even go to the spider.php page... the browser says.... "... loading page" but that's it. It just hangs there. I let it run for like 20-30 minutes, but the spider.php page never loaded.

Don't know if this helps in any way. It's really strange.

Mr. L

Charter

02-20-2004, 05:56 PM

Hi. It is strange. Perhaps it's an OS/setup issue? Not sure. What happens if you go to the admin panel, click the site, click the update button, set the username and password there, and then try a reindex?

mlerch@mac.com

02-20-2004, 06:16 PM

Tried that already. It simply doesn't want to do it. I also tried a different htaccess protected site on my server. Same deal. I even tried different username:password combinations that I created. They all work in the web browser, but they don't work in the PhpDig spider.

Mr. L

Charter

02-20-2004, 06:23 PM

Hi. Basically the username:password combo is split by parse_url (http://www.php.net/manual/en/function.parse-url.php) so I'm wondering if there is something that is making the parse_url username:password combo not match what is in the .htaccess file.

mlerch@mac.com

02-20-2004, 08:09 PM

Hello Charter,

My server is not using an .htaccess file. It's done with the user/pass database authentication method. (not MySQL though.) I forget the name of it, but all the username:password combos are stored in a database. :)

I am sorry if I sound ignorant.. I just can't think of the name of that database file right now :)

Maybe that's the problem. But then again, I read some of your other answers to other .htaccess related posts, and I think I will go ahead and turn .htaccess off, spider it, then turn it back on. If this works all is well.

Thank you so much for all your help.

Mr. L

Charter

02-20-2004, 09:25 PM

>> ...server is not using an .htaccess file. It's done with the user/pass database authentication method...

Hi. PhpDig no mods cannot, and AFAIK no published mods exist to, validate against a username/password DB or cookie/browser authentication method. I assumed from your thread title that you were trying to index a .htaccess protected site. :eek:

mlerch@mac.com

02-21-2004, 06:33 AM

Ok... good that we have figured that one out. Still there is a problem as I have just found out. I really don't know what I am supposed to expect when spidering a site. What exactly should happen when I click the Dig This ! button. Is it supposed to hang there at the page indicating that a new page (spider.php) is loading, or is it supposed to load spider.php and basically print the spidering process back into the page, entry by entry, by entry? For me I click the button and it just hangs. I am going to let it run now for an hour or two and see if it is going to pop over to the spider.php page and start the indexing process. Is there any way to change the code so that I can watch the progress of the indexing? Please advise. Thank you.

Mr. L

Maybe it has nothing to do with the htaccess protection at all, but it has something to do with the vHost itself? Could that be it? Have there been any other reports that the PhpDig form pages simply "hangs" and does not proceed to the spider.php page after pressing the Dig This ! button? Please advise.

mlerch@mac.com

02-21-2004, 07:46 AM

Charter... do you think that this line of code on top of all pages of the site that is not indexing is causing PhpDig to croak?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

I checked all other sites and they I think don't have it on there.

Mr. L

It's not indexing the site. I really don't know anymore what to do. The other sites are indexing perfectly, but this one isn't. When I try to test index another site on my server it works. The spider.php page loads and I can see the progress. Just when I do the site that formerly was htaccess protected it doesn't work. It stalls. I removed the access controls and all. So I can access it now without a username and password.

mlerch@mac.com

02-21-2004, 11:20 AM

Hi Charter,

So I did some more detailed looking into the problem. Here is what I found.

when spidering the URL that doesn't work (stalls):

I have traced it to:

In robot_functions.php

1. function phpdigDetectDir

in this function it parses the URL in to the variable $test, then it goes through an if { then } else { then } statment. In my case it it takes the ...else path because apparently the $test['query'] is set.

Since it is taking the else { then } path. In the very first line robot_functions.php tries to define following variable:

$status = phpdigTestUrl($link['url'].$link['path'].$link['file'],'date',$cookies);

This is where it seems to stall, so I checked into this function.

2. function phpdigTestUrl

it runs all the way through the "while" routine end it ends up where:
$status = "NOFILE";

at the very end of that function $mode does not seem to be 'date', so it is supposed to:

return $status;

I guess that is where it hangs.

Here are some details about the URL/website that I am trying to spider:

http://www.mydomain.com/index.php

index.php actually has in the very beginning a piece of script that checks if there is a variable string appended to index.php, and if it is formatted correctly.

If the script finds out that there is a formatting problem, or that there is no variable string at the end of .../index.php then it will grab the correct string and do a redirect to an URL like this:

http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0

Essentially when you were to go and type in the URL http://www.mydomain.com, or http://www.mydomain.com/index.php it will redirect you to:

http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0

Do you think that this is causing the problem? Please advise.

Oh yes, I actually tried to enter the URL into the PhpDig interface just like it would redirect it, but it still hangs with a NOFILE status.

Oh yes, why is $path always /robots.txt
I don't really understand it enough I guess.

Thank you very much,

Mr. L

vinyl-junkie

02-21-2004, 11:27 AM

Originally posted by mlerch@mac.com
Have there been any other reports that the PhpDig form pages simply "hangs" and does not proceed to the spider.php page after pressing the Dig This ! button? Please advise. I don't remember what I may have posted (and I'm too lazy to go look), but I was having some problems spidering my site that is on a Windows server. Mine was going to the spider.php page though. Perhaps what you're experiencing is a server related issue similar to mine? Just something you might want to explore.

mlerch@mac.com

02-21-2004, 05:42 PM

Ok... something very interesting happened. I let it run and run and run and finally I got this:

Spidering in progress...

Warning: fsockopen(): php_network_getaddresses: getaddrinfo failed: No address associated with nodename (is your IPV6 configuration correct? If this error happens all the time, try reconfiguring PHP using --disable-ipv6 option to configure) in /Library/.../admin/robot_functions.php on line 337

Warning: fsockopen(): unable to connect to www.mydomain.com:80 in /Library/.../admin/robot_functions.php on line 337
SITE : http://www.mydomain.com/
Exclude paths :
- @NONE@
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete !

Line 337 is following:
// this is part of function phpdigTestUrl($url,$mode='simple',$cookies=array()) {

if (isset($req1) && $req1) {
//close, and open a new connection
//on the new location
fclose($fp);
$fp = fsockopen($host,$port); // this is line 337

Any Idea what this is supposed to mean? As I mentioned before I am dealing with a script that checks and redirects if necessary with the correct string appended to the URL. See prior post.

Mr. L

Seems like the .htaccess is not the problem afterall.

Charter

02-23-2004, 11:40 AM

>> Oh yes, I actually tried to enter the URL into the PhpDig interface just like it would redirect it, but it still hangs with a NOFILE status.

>> Warning: fsockopen(): php_network_getaddresses: getaddrinfo failed: No address associated with nodename (is your IPV6 configuration correct? If this error happens all the time, try reconfiguring PHP using --disable-ipv6 option to configure) in /Library/.../admin/robot_functions.php on line 337

Hi. The page here (http://bugs.php.net/bug.php?id=11058) may help.

mlerch@mac.com

02-23-2004, 11:55 AM

Hello Charter,

Here is what I did. I used a very simple .php document/script and named it test.php. All that this file does is print the words "Welcome to my site." into an html document.

I tried spidering this document and it actually picked up the words, no errors... nothing wrong. Worked perfectly on that very same domain that I am having problems with.

I also tried spidering a page that has a variable get string in the URL like: http://www.mydomain.com/testpage.php?cpath=5. It worked perfectly for that page as well.

The only thing that I do have problems with seems to be the URL get string that has commas in it:
http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0

Could it be that PhpDig does not like the comma values for the dis variable? At this point I am rather clueless on what it could be. There are different types of pages on that very same URL, those pages that are accessed and have the dis variable with the comma values won't work, and any other page does work.

I really would like to get this to work. By the way, the prior described error has never happened again. I also double checked that the dns entries are correct in ns1 and ns2, and so forth, and it all seems to work correctly. The pages all work when I access them with a web browser. Please advise.

Thank you,

Mr. L

mlerch@mac.com

02-23-2004, 12:15 PM

Ok.. Here is some food for thought. I took that little script from the link that you wanted me to check out. and created a page like so:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>

<body>
<?php
$fd = fopen( "http://www.mydomain.com/index.php/", "r" );
if( !$fd )
{
echo "Cannot open URL";
} else {
while ( !feof( $fd ) ) {
$buffer = fgets( $fd, 4096 );
echo $buffer;
}
fclose ( $fd );
}
?>
</body>
</html>

I called it testing.php. Then I went to a web browser and called testing.php, and it printed the desired page fully intact into the browser window. Totally awesome. Then I went and to PhpDig and typed in that very same URL for testing.php, and guess what. It spidered it no problem. So.... what do you think it could be? If I try spidering the URL/page with it's actual URL it doesn't work, but when I spider the testing.php which essentially opens and reads the page that I want to spider and echos it back then the spidering works. It only works on that one level, because other links that are generated on that particular page also use that dis=1,0,0,....... Now I am really lost here. Could it be that the code needs to be changed or adjusted so that the dis variable with the string of number with the commas won't throw a fit?

Please help :)

Mr. L

Charter

02-23-2004, 12:42 PM

Hi. The type of links that PhpDig follows are those that match the regex in this (http://www.phpdig.net/showthread.php?threadid=476) thread. The comma is already allowed in the query string so I don't think that is the problem but am not exactly sure.

However, you had mentioned crawling secure links. One thing came to mind. The fsockopen error and/or NOFILE would make sense I guess if PhpDig was looking for file at http rather than at https or visa versa.

Try another test using a query string containing 1,1,1,1,1,1,0,0 but have everything using http instead of https, and use links not fopen. Does it index then?

If it does, then in the above linked thread, look at the code in the phpdigIndexFile function and change http to [a-z]{3,5} in the while line.

Also, set PHPDIG_IN_DOMAIN to true in the config file and apply the code change in this (http://www.phpdig.net/showthread.php?threadid=177) thread.

mlerch@mac.com

02-23-2004, 01:49 PM

Ok.. I think we isolated the problem. The commas in the query string must cause the problem. I did 2 separate tests. I rigged the testing.php document to access http://www.mydomain.com/index.php (in index.php is a script that will reformulate the URL and give the initial parameters for that page and redirect with the header directive to the URL that has all the comma stuff behind it. When I try to spider testing.php it works. It is spidering the page correctly.

Then I did another test and I replaced the URL that I want to fopen with the actual URL of index.php, the one with the query string and commas behind it. The page seems to load just fine.

When I try to spider testing.php that way it indexes in 42 seconds. Isn't that interesting. If I set up a page ... testing.php, that fopens the url with the dis=1,1,1,1,0.... and try to spider testing.php it works, however when I try to spider the URL directly with PhpDig it stalls.

I did apply all the other changes per your instruction. Still no luck.

Mr. L

Charter

02-23-2004, 02:11 PM

Hi. Try indexing http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0 directly from the admin panel and then watch the raw logs. What link does PhpDig try to fetch? I notice a slash at the end of the link in the fopen(http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0/) warning.

mlerch@mac.com

02-23-2004, 02:17 PM

What do you mean by admin panel? By the way, I did correct my prior post and made some changes, but I saw that replied already. Apparantly it did work with the little fopen demo file that I wrote. It is actually spidering the fopen demo file that actually points to http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0
As I mentioned before. It will spider it when called via the fopen demo file, but it will not spider that page when called directly.

mlerch@mac.com

02-23-2004, 02:29 PM

Ok... I took a look at the logs. The error log that is. Here is what I get when I try to access that URL directly:

[Mon Feb 23 15:25:53 2004] [error] [client xxx.xxx.xxx.xxx] File does not exist: /Path/to/Web/Server/dir/html/robots.txt

Nothing in the access logs.

Why is it looking for robots.txt?

The access log gives me:

xxx.xxx.xxx.xxx - - [23/Feb/2004:15:32:10 -0800] "HEAD /index.php HTTP/1.1" 302 - "-" "PhpDig/1.8.0 (+http://www.phpdig.net/robot.php)"

Charter

02-23-2004, 02:44 PM

Hi. PhpDig looks for a robots.txt file to follow. PhpDig does not need a robots.txt file so that error means nothing. The admin panel is the admin/index.php file. What are all the entries seen in the access log when you put http://www.mydomain.com/index.php?navID=1&dis=1,1,1,1,1,1,0,0 in the text box and click the dig button?

mlerch@mac.com

02-23-2004, 03:02 PM

Ok... here is what I am getting, and it is getting repeated and repeated over and over again. I actually have to restart the apache or it won't stop writing to the log at a rate of one line per second it seems.

xxx.xxx.xxx.xxx - - [23/Feb/2004:15:59:59 -0800] "HEAD /index.php HTTP/1.1" 302 - "-" "PhpDig/1.8.0 (+http://www.phpdig.net/robot.php)"

Isn't that weird? It's almost like as if it is in a loop or something. Not even clicking stop on the browser button will stop that loop. That's a good way to crash the whole thing :)

Charter

02-24-2004, 08:02 AM

Hi. Use the original spider.php file and use the attached file in place of the robot_functions.php file. From the admin panel, delete the site, click the delete button again with no site selected, click the clean dictionary link, and once cleaned type http://www.domain.com into the text box, select a search depth of two, click the dig button, and wait like ten minutes. What happens?

mlerch@mac.com

02-24-2004, 09:41 AM

Hello Charter,

Thank you for all your help. I think I got it to work. It still won't jump to the https://secure.mydomain.com host/pages for indexing. However the pages in http://www.mydomain.com are indexing correctly now. Got it to work. You are the man.

I have not tried your new robot_functions file at this point though. I will try it as soon as I find a free moment. At this point in time the non-indexing/stalling when it comes to a https://secure link is not that important, but I do want to revisit this issue, because maybe someone else may need the solution.

Catch up with you later.

Mr. L

webcat

02-25-2004, 02:54 PM

I am using .htaccess and I can't get phpdig to spider the protected folders, even if I put in a user and pass as described in this thread.

is there some trick that has since been discovered?

thanks for any tips!

mlerch@mac.com

02-25-2004, 03:13 PM

Hello webcat.

As it turns out, the reason why I was unable to spider the .htaccess protected directory was not the .htaccess protection at all. As long as you write your url to be spidered like:

http://username:password@www.mydomain.com/....
or
https://username:password@secure.mydomain.com/....

it should work. The problem that I had was that the server would go into a loop it couldn't get out of. I am using php pages that use different checking mechanisms and redirects, and it worked perfectly in a browser, but it sent PhpDig into a neverending loop. So, if you are using php pages like discribed above you may want to try the fix that Charter posted as a download a bit higher up in the threads.

Hope that this will work for you.

Mr. L