PDA

View Full Version : problem indexing password-protected directories


alexp
10-21-2003, 06:56 AM
Hi all,

I am not able to spider a directory protected by .htaccess.

I have set up a test here:

http://testt:testt@www.php-web-development.com/testphpdig/main.php

...but the script just shows:


SITE : http://www.php-web-development.com/
Exclude paths :
- @NONE@
1:http://www.php-web-development.com/testphpdig/main.php
(time : 00:00:00)
No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://www.php-web-development.com/testphpdig/main.php
Optimizing tables...
Indexing complete !


Indexing the same content with the .htaccess removed is no problem at all...

I'm using 1.6.2 vanilla and have tried setting PHPDIG_DEFAULT_INDEX to both true and false

I'd be grateful for any suggestions.

TIA

Alex

Charter
10-21-2003, 08:11 AM
Hi. Hmm. Perhaps instead of passing the username and password via the URL, it might work to look at the sites table in say phpMyAdmin, and for the protected site, add the username and password to that row of the table.

alexp
10-21-2003, 08:58 AM
Hi Charter,

Thanks for your reply. I checked in phpMyAdmin and the user:pass combination had already been correctly parsed by the script and entered into the DB.

Any other ideas? If you try to index:

http://testt:testt@www.php-web-development.com/testphpdig/main.php


..on your installation, do you get any links?

Thanks again,
Alex

Charter
10-21-2003, 07:50 PM
Hi. Perhaps this is related to the problem posted here (http://www.phpdig.net/showthread.php?threadid=86). Can you try and set self.parent.location to the absolute URL instead of the relative URL and see if that works?

alexp
10-22-2003, 03:04 AM
Hi Charter,

I think I've worked out the problem....

It's not related to relative META and JS links - the same "site" spiders fine without the .htaccess

In fact, this is now spidering fine:

http://testt:testt@www.php-web-development.com/testphpdig/main.php

BUT this isn't:
http://test%40domain.com:test@www.php-web-development.com/testphpdig/main.php

and nor is this:

http://test@domain.com:test@www.php-web-development.com/testphpdig/main.php

The first version sends an escaped "%40" so gets "access denied" as the incorrect user. The second example parses as "domain.com"

....so is there no way of sending an @ sign as part of a username?

Thanks for all your help...

Alex

Charter
10-22-2003, 04:09 AM
Ooh, do tell what you did to get it to work. :)

I can understand the %40 not working, but the @ sounds like a regex issue. With http://test@domain.com:test@www.php-web-development.com/testphpdig/main.php is the username and password in the sites table now blank?

alexp
10-22-2003, 04:24 AM
Hi,

Ooh, do tell what you did to get it to work.

Hmm wish I knew. I tried it again. It worked. Sorry :mad:

Trying this in the spider box:

http://test@domain.com:test@php-web-development.com/testphpdig/main.php

...attempts to spider http://domain.com/ :)

phpMyadmin says this:

15 http://www.php-web-development.com/ 20031022121122 test%40domain.com test 0 0

16 http://domain.com/ 20031022132030 test 0 0


(so "test" is interpreted as the username, with the pw blank)

Cheers,
Alex

Charter
10-22-2003, 04:38 AM
Ah, okay, that'll help me track it down. I'll keep you posted.

alexp
10-22-2003, 04:41 AM
I appreciate it...


You're welcome to use my test site to test with if you want.

http://www.php-web-development.com/testphpdig/main.php

the two valid user/pass combos are:

testt:testt

and

test@domain.com:test


The site is identical to the root domain, except for the .htaccess.

Thanks again,
Alex

Loewenherz
11-25-2003, 11:56 PM
Hi,

I have a problem with protected sites too. Maybe, I don't understand the tipps above (my english is not the best) phpdig 1.6.4 says:

Warning: file( http://...@www.vdoh.de/robots.txt): failed to open stream: No such file or directory in /is/htdocs/xyz/www.vdoh.de/inc/search/admin/robot_functions.php on line 553

Warning: Variable passed to each() is not an array or object in /is/htdocs/30981/www.vdoh.de/inc/search/admin/robot_functions.php on line 554
SITE : http://www.vdoh.de/
Exclude paths :
- @NONE@
(time : 00:00:00)
No link in temporary table
links found : 0
...Was recently indexed
Optimizing tables...
Indexing complete ! [Back] to admin interface.

The URL to index is like:
http://username:password@www.vdoh.de/index.php

username and password are in the .htaccess

Charter
11-26-2003, 08:27 AM
Hi. What do the .htaccess and .htpasswd files look like?

The .htaccess file should have something in it like so:

AuthUserFile /full/path/to/.htpasswd
AuthGroupFile /dev/null
AuthName "Restricted Area"
AuthType Basic

require user Username

The .htpasswd file should have something in it like so:

Username:a1b2c3d4e5f6g

Loewenherz
11-27-2003, 12:08 PM
username and password are in the .htaccess
Oh sorry, username and password are in the .htpasswd, naturellement.

Okay, what can be the problem?

Charter
11-27-2003, 12:25 PM
Hi. What HTML source output do you get when you run the following script?

<?php
$site = 'http://www.vdoh.de/';
$robots = file($site.'robots.txt');
for ($i=0; $i<count($robots); $i++) {
echo $robots[$i].'<br>';
}
?>

I get the following:

User-agent:*
<br>
<br>Allow: /
<br>
<br>
<br>

Loewenherz
11-27-2003, 01:15 PM
Yes, this was a test today.

The really content of robots.txt is:
User-agent: *
Disallow:

Charter
11-27-2003, 01:56 PM
Hi. Please run the following. It may help me determine the problem.

<?php
$site = 'http://user:pass@www.vdoh.de/';
$robots = file($site.'robots.txt');
for ($i=0; $i<count($robots); $i++) {
echo $robots[$i].'<br>';
}
?>

What do you get? I get the following when viewing the HTML source:

User-agent: *
<br>Disallow:<br>

Also, do the username and password that you are using in the URL match those that are in the sites table for this site?

Loewenherz
11-27-2003, 02:23 PM
Hi Carter,

thanks for your help. Sorry, it's not so easy for me, to post some informations in english :-(

Originally posted by Charter

'http://user:pass@www.vdoh.de/';

Sorry, I cannot post the really username and the password here.
Originally posted by Charter

What do you get? I get the following when viewing the HTML source:

User-agent: *
<br>Disallow:<br>


Yes, I get the same.
Originally posted by Charter

Also, do the username and password that you are using in the URL match those that are in the sites table for this site? [/B]
This is the part, I don't understand. Is it necessary to do some changes in the database? If this are right, can you tell me the mysql-Code?

Loewenherz
11-27-2003, 11:44 PM
Okay, back to the beginning:

There's a complete website. Some is for all visitors, but there's a restricted area only for members of this association. The members wishes to have a search engine, it can be found in the member area and indexing all sites of this project.

PhpDig indexing the public sites without problems. But what have I to do for indexing the sites in the member area? Theres a entry in den .htpasswd with name and password only for the search engine, but how I let this know the program?

Loewenherz
11-28-2003, 12:21 AM
Maybe I found the point in table phpdig_sites :

2 http://www.vdoh.de/docs/verband/ 20031128101633 user passwd 0 0

Charter
11-28-2003, 03:13 AM
Hi. Thanks for your responses. I don't want you to post the true username and password. Yes, that is the correct row from the sites table. Are the username and password in that row correct? Do you get errors when you try to crawl http://user:passwd@www.vdoh.de/docs/verband/ rather than http://user:passwd@www.vdoh.de/index.php?

Loewenherz
11-28-2003, 06:02 AM
Are the username and password in that row correct? Do you get errors when you try to crawl http://user:passwd@www.vdoh.de/docs/verband/ rather than http://user:passwd@www.vdoh.de/index.php? [/B]
Username an password are correct. And I have no errors more.
BUT:

Der Spider arbeitet gerade...
SITE : http://www.vdoh.de/
Ausgeschlossene Pfade :
- @NONE@
1:http://www.vdoh.de/docs/verband/
(Zeit : 00:00:00)
Kein Link in der temporäreren Tabelle
Links gefunden : 1
http://www.vdoh.de/docs/verband/
Optimizing tables...
Indizierung abgeschlossen! [Zurück] zum Admin-Interface.

PhpDig is indexing not all the sites in the member area.

Charter
11-28-2003, 10:08 AM
Hi. Does the http://www.vdoh.de/docs/verband/ page do a JavaScript or META redirect once logged in? If so, try using the full URL in the redirect. If this doesn't work or isn't applicable, can you setup a password protected test area on your site that I could try and index?

Loewenherz
11-29-2003, 02:27 AM
Hi Carter,
Originally posted by Charter
[B]Does the http://www.vdoh.de/docs/verband/ page do a JavaScript or META redirect once logged in?
No.

I can send you password and username for the member area with PN.

Thanks for your help.

Another problem: I installed PhpDig in /inc/search/. But this file is outside the member area. Okay, I copied the PhpDig files to /docs/verband/suche/. PhpDig works on this place, but I cannot come in the administration from PhpDig (the Browser works without answer). Is there anything to change in the database?

Charter
11-29-2003, 09:22 AM
Hi. Does a username and password box pop up when you try to enter the administration? There should be nothing to change in the database tables.

Loewenherz
12-01-2003, 11:37 AM
Originally posted by Charter
Does a username and password box pop up when you try to enter the administration?
No. Nothing happens. No pop-up, no error, only an endless waiting.

Charter
12-01-2003, 12:03 PM
Hi. When I go to http://www.vdoh.de/docs/verband/suche/admin/ I get a popup box asking for a username and password. Is this the same link that you go to and get endless waiting?

Loewenherz
12-01-2003, 11:43 PM
Yes. I try this site with another browser and I see the pop-ups. But I cannot login. Okay, forget it. I'll using PhpDig in the restricted area for searchin and in the other area for administration. I think, this is possible.

The only important problem is indexing the sites in the member area.

Loewenherz
12-03-2003, 12:10 AM
After a refresh of indexing the Root-Directory, PhpDig seems to have indexing a part of the member area:

Level 1...
16:http://www.vdoh.de/inc/files/beitrittsformular.pdf
(Zeit : 00:00:03)
17:http://www.vdoh.de/docs/verband/
(Zeit : 00:00:03)
18:http://www.vdoh.de/docs/verband/recht.php
(Zeit : 00:00:03)
19:http://www.vdoh.de/docs/verband/suche/search.php
(Zeit : 00:00:03)
20:http://www.vdoh.de/docs/verband/ins.php
(Zeit : 00:00:03)
21:http://www.vdoh.de/docs/verband/press.php
(Zeit : 00:00:03)
22:http://www.vdoh.de/docs/verband/veran.php
(Zeit : 00:00:03)
23:http://www.vdoh.de/docs/verband/aktuelles.php
(Zeit : 00:00:03)
24:http://www.vdoh.de/docs/verband/grem.php
(Zeit : 00:00:03)
Kein Link in der temporäreren Tabelle
Links gefunden : 24
http://www.vdoh.de/docs/dienstleistungen.php
http://www.vdoh.de/docs/geschaeftsstelle.php
http://www.vdoh.de/docs/satzung.php
http://www.vdoh.de/docs/beitrag.php
http://www.vdoh.de/docs/links.php
http://www.vdoh.de/inc/html/navi.php
http://www.vdoh.de/docs/anfahrt1.php
http://www.vdoh.de/docs/kontakt.php
http://www.vdoh.de/docs/passw.php
http://www.vdoh.de/docs/impressum.php
http://www.vdoh.de/docs/anfahrt.php
http://www.vdoh.de/docs/mitglied.php
http://www.vdoh.de/docs/personalia.php
http://www.vdoh.de/docs/
http://www.vdoh.de/
http://www.vdoh.de/inc/files/beitrittsformular.pdf
http://www.vdoh.de/docs/verband/
http://www.vdoh.de/docs/verband/recht.php
http://www.vdoh.de/docs/verband/suche/search.php
http://www.vdoh.de/docs/verband/ins.php
http://www.vdoh.de/docs/verband/press.php
http://www.vdoh.de/docs/verband/veran.php
http://www.vdoh.de/docs/verband/aktuelles.php
http://www.vdoh.de/docs/verband/grem.php
Optimizing tables...


But I don't found this sites in the index. And the member area has many more sites based on a PHP-News-System. And the second host /docs/verband/ with the password-entry in the database is gone

Loewenherz
12-03-2003, 12:35 AM
And the second host /docs/verband/ with the password-entry in the database is gone
Okay, I change the database entry of the Root-URL and add username and password. Now, PhpDig found 150 Links! Wow!
In the database I found only:
DataBase status
Hosts : 1 Entries
Pages : 69 Entries

???

But it works!!! Yippie.

renehaentjens
12-05-2003, 01:30 AM
I have the same problem as Alex in the original post, but as there are no funny characters in my URL, I cannot solve it Alex's way.

Username/password are in the DB, I have no meta-refresh stuff (I don't even know what it is), yet, with .htaccess, like Alex, I never get beyond the first page, and I always get:

links found : 1
http://myserver.../subdir/mystart.php
Optimizing tables...
Indexing complete !

Without .htaccess, all generated pages are indexed...

Any fresh ideas on this?

Charter
12-05-2003, 08:34 AM
Hi. It's not a fresh idea, but the username and password in the database for the site need to match the username and password, as uncrypted, that are in the .htpasswd file. Another option would be to crawl the directory without the .htaccess file and then reinstate the .htaccess file so when users click a protected search result link they are prompted for their username and password. Other than what Alex wants to do, I haven't been able to replicate the other issues listed in this thread.