View Full Version : Problem indexing site (uses mod_rewrite)
ragaller
03-09-2004, 06:33 AM
Hy there
I installed phpdig 1.8.0 on the site
www.personalsite.ch
After setting up the db and changing permissions I get the following (already described) output, when indexing:
SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:06)
No link in temporary table
links found : 1
http://www.personalsite.ch/
Optimizing tables...
Indexing complete !
My website heavily relies on mod_rewrite for url-rewriting. Could this affect the behaviour of phpdig? I switched of the RewriteEngine for the phpdig root folder.
Thank You for help.
Charter
03-09-2004, 09:52 AM
Hi ragaller, and welcome to PhpDig.net!
Perhaps try the mod attached in this (http://www.phpdig.net/showthread.php?threadid=573) thread.
Below is output with the mod and a search depth of one:
SITE : http://www.personalsite.ch/
Exclude paths :
- @NONE@
1:http://www.personalsite.ch/
(time : 00:00:10)
+ + + + + + +
level 1...
2:http://www.personalsite.ch/portfolio/
(time : 00:00:27)
3:http://www.personalsite.ch/info/
(time : 00:00:34)
4:http://www.personalsite.ch/kontakt/
(time : 00:00:41)
5:http://www.personalsite.ch/kontakt/oisjdfoijdf
(time : 00:00:48)
6:http://www.personalsite.ch/webdesign/
(time : 00:00:55)
7:http://www.personalsite.ch/web-it/
(time : 00:01:02)
8:http://www.personalsite.ch/grafik/
(time : 00:01:09)
No link in temporary table
--------------------------------------------------------------------------------
links found : 8
http://www.personalsite.ch/
http://www.personalsite.ch/portfolio/
http://www.personalsite.ch/info/
http://www.personalsite.ch/kontakt/
http://www.personalsite.ch/kontakt/oisjdfoijdf
http://www.personalsite.ch/webdesign/
http://www.personalsite.ch/web-it/
http://www.personalsite.ch/grafik/
Optimizing tables...
Indexing complete !
ragaller
03-09-2004, 11:48 PM
Hi Charter!
Thank You for the answer!
I got the engine working and producing the exact output You wrote in Your post for an indexing at depth one on the root level.
There is a problem related to mod_rewrite: A website using mod_rewrite needs to use absolute links (or root relative ones). In the header the base part of the url ist set:
<base href="http://www.personalsite.ch" />
The browser (and hopefully google) composes a string adding any href entry to the base url, resulting in a correct absolute url.
Is it possible phpdig does not read the base url and treats the links as relative ones? If so, a search with depth one at the root level works - digging deeper breaks.
The following is a part of the search at depth 2, showing the problem for grafik:
http://www.personalsite.ch/grafik/grafik/
http://www.personalsite.ch/grafik/webdesign/
...
Jürgen
Charter
03-10-2004, 08:12 AM
Hi. For base href tags, perhaps try the code in this (http://www.phpdig.net/showthread.php?threadid=364) thread.
ragaller
03-10-2004, 11:56 PM
Hi Charter!
I tried indexing with the code for <base> Tag parsing (Your link in the previous post) - with or without the rewrite patch.
The result for me is still the same: The links are treated as relative ones.
I spidered www.personalsite.ch/grafik/
depth:1
phpdig found links like:
www.personalsite.ch/grafik/grafik/portfolio/
...
--> should be:
www.personalsite.ch/grafik/portfolio/
Any further ideas on this one? Maybe I set up something else the wrong way?
Thank You, Jürgen
p.s. personalsite.ch was off yesterday - it works now, just in case You'd like to try spidering.
Charter
03-11-2004, 01:59 PM
Hi. The code is that link won't work when the base href tag is something like <base href="http://www.personalsite.ch" /> because the regex in that code isn't matching it so something else will have to be coded. In the meantime, to get rid of the name/name directories/files just click the site, click the update button, and click the red circle noway symbol next to the bogus directories to delete and exclude them.
ragaller
03-13-2004, 09:16 AM
Hi Charter
I found a quick solution that seems to work for a website with root relavite links (like mine).
in robot_functions
after:
$file_content = @file($tempfile);
I added:
$path = '';
I know, this is just quick and dirty workaround for my exotic case...
Jürgen
Charter
03-13-2004, 09:51 AM
Hi. Ah, I see the problem. The regex wasn't matching the base href tag. Using the code in the other thread, if you change the following:
if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*>",$base_regs1,$base_regs2)) {
to the following:
if (eregi("<base href[[:space:]]*=[[:space:]]*['\"]*([a-z]{3,5}://[.a-z0-9-]+[^'\"]*)['\"]*[[:space:]]*[/]?>",$base_regs1,$base_regs2)) {
then that code should work.
Remember to remove any "word" wrapping in the above code.
ragaller
03-16-2004, 10:22 PM
Hi Charter!
This works perfectely now for my site!
Thank You so much!
vBulletin® v3.7.3, Copyright ©2000-2025, Jelsoft Enterprises Ltd.