PDA

View Full Version : Problem indexing site (uses mod_rewrite)


ragaller
03-09-2004, 07:33 AM
Hy there

I installed phpdig 1.8.0 on the site

www.personalsite.ch

After setting up the db and changing permissions I get the following (already described) output, when indexing:

SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:06)
No link in temporary table


links found : 1
http://www.personalsite.ch/
Optimizing tables...
Indexing complete !

My website heavily relies on mod_rewrite for url-rewriting. Could this affect the behaviour of phpdig? I switched of the RewriteEngine for the phpdig root folder.

Thank You for help.

Charter
03-09-2004, 10:52 AM
Hi ragaller, and welcome to PhpDig.net!

Perhaps try the mod attached in this (http://www.phpdig.net/showthread.php?threadid=573) thread.

Below is output with the mod and a search depth of one:

SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:10)
+ + + + + + +
level 1...
2:http://www.personalsite.ch/portfolio/
(time : 00:00:27)

3:http://www.personalsite.ch/info/
(time : 00:00:34)

4:http://www.personalsite.ch/kontakt/
(time : 00:00:41)

5:http://www.personalsite.ch/kontakt/oisjdfoijdf
(time : 00:00:48)

6:http://www.personalsite.ch/webdesign/
(time : 00:00:55)

7:http://www.personalsite.ch/web-it/
(time : 00:01:02)

8:http://www.personalsite.ch/grafik/
(time : 00:01:09)

No link in temporary table

--------------------------------------------------------------------------------

links found : 8
http://www.personalsite.ch/
http://www.personalsite.ch/portfolio/
http://www.personalsite.ch/info/
http://www.personalsite.ch/kontakt/
http://www.personalsite.ch/kontakt/oisjdfoijdf
http://www.personalsite.ch/webdesign/
http://www.personalsite.ch/web-it/
http://www.personalsite.ch/grafik/
Optimizing tables...
Indexing complete !

ragaller
03-10-2004, 12:48 AM
Hi Charter!

Thank You for the answer!

I got the engine working and producing the exact output You wrote in Your post for an indexing at depth one on the root level.

There is a problem related to mod_rewrite: A website using mod_rewrite needs to use absolute links (or root relative ones). In the header the base part of the url ist set:

<base href="http://www.personalsite.ch" />

The browser (and hopefully google) composes a string adding any href entry to the base url, resulting in a correct absolute url.

Is it possible phpdig does not read the base url and treats the links as relative ones? If so, a search with depth one at the root level works - digging deeper breaks.

The following is a part of the search at depth 2, showing the problem for grafik:


http://www.personalsite.ch/grafik/grafik/
http://www.personalsite.ch/grafik/webdesign/
...



Jürgen

Charter
03-10-2004, 09:12 AM
Hi. For base href tags, perhaps try the code in this (http://www.phpdig.net/showthread.php?threadid=364) thread.

ragaller
03-11-2004, 12:56 AM
Hi Charter!

I tried indexing with the code for <base> Tag parsing (Your link in the previous post) - with or without the rewrite patch.

The result for me is still the same: The links are treated as relative ones.

I spidered www.personalsite.ch/grafik/

depth:1

phpdig found links like:

www.personalsite.ch/grafik/grafik/portfolio/
...

--> should be:

www.personalsite.ch/grafik/portfolio/

Any further ideas on this one? Maybe I set up something else the wrong way?

Thank You, Jürgen

p.s. personalsite.ch was off yesterday - it works now, just in case You'd like to try spidering.

Charter
03-11-2004, 02:59 PM
Hi. The code is that link won't work when the base href tag is something like <base href="http://www.personalsite.ch" /> because the regex in that code isn't matching it so something else will have to be coded. In the meantime, to get rid of the name/name directories/files just click the site, click the update button, and click the red circle noway symbol next to the bogus directories to delete and exclude them.

ragaller
03-13-2004, 10:16 AM
Hi Charter

I found a quick solution that seems to work for a website with root relavite links (like mine).

in robot_functions

after:

$file_content = @file($tempfile);

I added:

$path = '';

I know, this is just quick and dirty workaround for my exotic case...

Jürgen

Charter
03-13-2004, 10:51 AM
Hi. Ah, I see the problem. The regex wasn't matching the base href tag. Using the code in the other thread, if you change the following:

if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*>",$base_regs1,$base_regs2)) {

to the following:

if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) {

then that code should work.

Remember to remove any "word" wrapping in the above code.

ragaller
03-16-2004, 11:22 PM
Hi Charter!


This works perfectely now for my site!


Thank You so much!