PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 03-09-2004, 06:33 AM   #1
ragaller
Green Mole
 
Join Date: Mar 2004
Posts: 5
Problem indexing site (uses mod_rewrite)

Hy there

I installed phpdig 1.8.0 on the site

www.personalsite.ch

After setting up the db and changing permissions I get the following (already described) output, when indexing:

Quote:
SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:06)
No link in temporary table


links found : 1
http://www.personalsite.ch/
Optimizing tables...
Indexing complete !
My website heavily relies on mod_rewrite for url-rewriting. Could this affect the behaviour of phpdig? I switched of the RewriteEngine for the phpdig root folder.

Thank You for help.
ragaller is offline   Reply With Quote
Old 03-09-2004, 09:52 AM   #2
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi ragaller, and welcome to PhpDig.net!

Perhaps try the mod attached in this thread.

Below is output with the mod and a search depth of one:

SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:10)
+ + + + + + +
level 1...
2:http://www.personalsite.ch/portfolio/
(time : 00:00:27)

3:http://www.personalsite.ch/info/
(time : 00:00:34)

4:http://www.personalsite.ch/kontakt/
(time : 00:00:41)

5:http://www.personalsite.ch/kontakt/oisjdfoijdf
(time : 00:00:48)

6:http://www.personalsite.ch/webdesign/
(time : 00:00:55)

7:http://www.personalsite.ch/web-it/
(time : 00:01:02)

8:http://www.personalsite.ch/grafik/
(time : 00:01:09)

No link in temporary table

--------------------------------------------------------------------------------

links found : 8
http://www.personalsite.ch/
http://www.personalsite.ch/portfolio/
http://www.personalsite.ch/info/
http://www.personalsite.ch/kontakt/
http://www.personalsite.ch/kontakt/oisjdfoijdf
http://www.personalsite.ch/webdesign/
http://www.personalsite.ch/web-it/
http://www.personalsite.ch/grafik/
Optimizing tables...
Indexing complete !
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-09-2004, 11:48 PM   #3
ragaller
Green Mole
 
Join Date: Mar 2004
Posts: 5
Hi Charter!

Thank You for the answer!

I got the engine working and producing the exact output You wrote in Your post for an indexing at depth one on the root level.

There is a problem related to mod_rewrite: A website using mod_rewrite needs to use absolute links (or root relative ones). In the header the base part of the url ist set:

Quote:
<base href="http://www.personalsite.ch" />
The browser (and hopefully google) composes a string adding any href entry to the base url, resulting in a correct absolute url.

Is it possible phpdig does not read the base url and treats the links as relative ones? If so, a search with depth one at the root level works - digging deeper breaks.

The following is a part of the search at depth 2, showing the problem for grafik:


Jürgen
ragaller is offline   Reply With Quote
Old 03-10-2004, 08:12 AM   #4
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. For base href tags, perhaps try the code in this thread.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-10-2004, 11:56 PM   #5
ragaller
Green Mole
 
Join Date: Mar 2004
Posts: 5
Hi Charter!

I tried indexing with the code for <base> Tag parsing (Your link in the previous post) - with or without the rewrite patch.

The result for me is still the same: The links are treated as relative ones.

I spidered www.personalsite.ch/grafik/

depth:1

phpdig found links like:

www.personalsite.ch/grafik/grafik/portfolio/
...

--> should be:

www.personalsite.ch/grafik/portfolio/

Any further ideas on this one? Maybe I set up something else the wrong way?

Thank You, Jürgen

p.s. personalsite.ch was off yesterday - it works now, just in case You'd like to try spidering.
ragaller is offline   Reply With Quote
Old 03-11-2004, 01:59 PM   #6
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. The code is that link won't work when the base href tag is something like <base href="http://www.personalsite.ch" /> because the regex in that code isn't matching it so something else will have to be coded. In the meantime, to get rid of the name/name directories/files just click the site, click the update button, and click the red circle noway symbol next to the bogus directories to delete and exclude them.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-13-2004, 09:16 AM   #7
ragaller
Green Mole
 
Join Date: Mar 2004
Posts: 5
Hi Charter

I found a quick solution that seems to work for a website with root relavite links (like mine).

in robot_functions

after:

$file_content = @file($tempfile);

I added:

$path = '';

I know, this is just quick and dirty workaround for my exotic case...

Jürgen
ragaller is offline   Reply With Quote
Old 03-13-2004, 09:51 AM   #8
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Ah, I see the problem. The regex wasn't matching the base href tag. Using the code in the other thread, if you change the following:
PHP Code:
if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*>",$base_regs1,$base_regs2)) { 
to the following:
PHP Code:
if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) { 
then that code should work.

Remember to remove any "word" wrapping in the above code.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 03-16-2004, 10:22 PM   #9
ragaller
Green Mole
 
Join Date: Mar 2004
Posts: 5
Hi Charter!


This works perfectely now for my site!


Thank You so much!
ragaller is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
problem: HTTP authentication versus mod_rewrite honza Coding & Tutorials 0 02-14-2007 04:58 AM
Problem indexing site due to backslash F.Keniki Troubleshooting 1 12-26-2006 07:34 AM
Problem with site indexing.... Lamer38 Troubleshooting 1 09-11-2004 06:36 AM
Indexing problem: PhpDig will not spider all of the site mih Troubleshooting 5 03-24-2004 11:54 PM
Strange indexing problem on my site drbill Troubleshooting 9 01-01-2004 01:29 PM


All times are GMT -8. The time now is 12:52 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.