PDA

View Full Version : Problems with URL parsing


apdejong
11-17-2003, 07:02 AM
Hi,

I try to use phpDig. It seems very good to me. Although when I try to index my site the following happens. URLs like:

http://www.webthings.nl/archive/2003/11/14/cyberterrorisme_blijkt_gewoon_een_hype(2)#body

will be rewritten to:

http://www.webthings.nl/archive/2003/11/14/cyberterrorisme_blijkt_gewoon_een_hype

And that doesn't work. How can I fix this?

Thanx to the programmer! I searched the web and phpDig was one of the best I could find.

Greets,
Arjan

Charter
11-17-2003, 09:57 AM
Hi. I'm not sure I understand the problem. When I index http://www.webthings.nl/archive/200...gewoon_een_hype(2)#body using one level I get the following results:

--------------------------------------------------------------------------------
SITE : http://www.webthings.nl/
Exclude paths :
- @NONE@
1:http://www.webthings.nl/archive/200...gewoon_een_hype(2)
(time : 00:00:04)
+
level 1...
Duplicate of an existing document
2:http://www.webthings.nl/archive/webthings_stylesheet.css
(time : 00:00:06)

No link in temporary table
--------------------------------------------------------------------------------

links found : 2
http://www.webthings.nl/archive/200...gewoon_een_hype(2)
http://www.webthings.nl/archive/webthings_stylesheet.css
Optimizing tables...
Indexing complete !

Then when I seach on realhosting I see the the following results:

1. [100.00 %] webthings/webdesign/webdesign nieuws
limit to http://www.webthings.nl/, this path : archive/

...2003 - Eduvision BV en Van Duuren Media - Hosting by Realhosting webthings/webdesign/webdesign nieuws webthings/webdesign/webdesign nieuws...

When I click the link, it links me to http://www.webthings.nl/archive/200...gewoon_een_hype(2) and I see your page.

When you do the above things, what do you see?

apdejong
11-18-2003, 03:05 AM
Hi Ruud,

I get the following:

Warning: is_executable() [function.is-executable]: open_basedir restriction in effect. File(/usr/local/bin/pstotext) is not within the allowed path(s): (/vhost/webthings.nl/home) in /vhost/webthings.nl/home/www/html/zoek/admin/robot_functions.php on line 635
Duplicate of an existing document
6:http://www.webthings.nl/archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail
(time : 00:00:03)

(see last line)

PhpDig has found the following:

links found : 9
http://www.webthings.nl/
http://www.webthings.nl/archive/2003/11/14/cyberterrorisme_blijkt_gewoon_een_hype
http://www.webthings.nl/pivot/submit.php?vote=good&piv_code=2&piv_weblog=webthings&group=k_
http://www.webthings.nl/pivot/submit.php?vote=bad&piv_code=2&piv_weblog=webthings&group=k_
http://www.webthings.nl/webthings/archives/archive_2003-m11.php
http://www.webthings.nl/archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail
http://www.webthings.nl/pivot/submit.php?vote=good&piv_code=1&piv_weblog=webthings&group=k_
http://www.webthings.nl/pivot/submit.php?vote=bad&piv_code=1&piv_weblog=webthings&group=k_
http://www.webthings.nl/pivot/kortnieuws.php?wtk=selected


As you see it will not index the last (5) etc. Strange it works in your configuration
I use the standard config (with Apache 1.3.27 and PHP 4.3.1).

Any ideas? What am I doing wrong?

Greets,
Arjan

Charter
11-18-2003, 09:59 AM
Hi. Try installing PhpDig in the open_basedir that is set. You can find this directory by looking at your PHP info ( <? phpinfo(); ?> ) or by asking your host. Also, try changing the path to pstotext. If you have access to shell and are able to use the locate command, you can locate the correct path to pstotext ( locate pstotext ) or try asking your host. Otherwise grab a copy of pstotext and place it in the open_basedir directory and use that path. If is_executable continues to give you problems, you can set USE_IS_EXECUTABLE_COMMAND to zero in the config file.

apdejong
11-19-2003, 03:29 AM
Hi,

Tanx for the answer. The problem is however not that the executables will not work. I don't like to index pdf etc. But the problem is the system indexes URLs like

http://www.webthings.nl/archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail(1)

as

6:http://www.webthings.nl/archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail
(time : 00:00:02)

(1) fails. I saw in your file it will index it at your server, but it won't index it here? And I have no idea why that is. Do I need to change somethings in my config file?

Greets,
Arjan

Charter
11-19-2003, 08:05 AM
Hi. Apache 1.3.27 and PHP 4.3.1 under what OS?

What do you see when you run the following:

<?
// remember to remove any "word" wrapping
$url="http://www.webthings.nl/archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail(1)";
print_r(parse_url($url));
?>

When viewing the HTML source, I get the following:

Array
(
[scheme] => http
[host] => www.webthings.nl
[path] => /archive/2003/11/14/nieuwe_worm_doet_zich_voor_als_paypalmail(1)
)

apdejong
11-20-2003, 02:35 AM
Hi. I am afraid I see the same... It seems that is not the problem. Any other ideas?

Greets,
Arjan