PDA

View Full Version : unable to parse url


marb
03-27-2004, 12:32 AM
Hi,
I'm spider a page and get the below error notice, what can I do on it?
I use the loop option and have no troubles before with other pages spidering.
The spider index a page and the message show up wen a other url is located, not the page wich is spider at that moment.



[quote]
+ + + + + + + + + + + + + + + + + +
63:http://www.wetcanvas.com/MediaKit/
(time : 00:54:20)

Warning: parse_url(http://www.heritageglass.com?amp;zoneid=0&source=&dest=http://www.heritageglass.com) [function.parse-url]: Unable to parse url in /opt/guide/www.artrefer.com/HTML/web/s3/admin/robot_functions.php on line 372
+ + + + + + + +
64:http://www.wetcanvas.com/web/
(time : 00:54:51)
+ +
65:http://www.wetcanvas.com/colormixer/
(time : 00:55:21)
+ + + +

Marten :)

Charter
03-27-2004, 03:06 AM
Hi. There is a 1.8.0 fix in this (http://www.phpdig.net/showthread.php?postid=3084#post3084) post that should be applied.

However, even with the fix, I'm not sure parse_url will handle a URL in the query string. See below.

<?php
$url="http://www.heritageglass.com?amp;zoneid=0&source=&dest=http://www.heritageglass.com";
print_r(parse_url($url)); // without fix and with url
echo "\n<br>\n";
$url="http://www.heritageglass.com?zoneid=0&source=&dest=http://www.heritageglass.com";
print_r(parse_url($url)); // with fix and with url
echo "\n<br>\n";
$url="http://www.heritageglass.com?zoneid=0&source=&dest=";
print_r(parse_url($url)); // with fix and without url
?>

The output is as follows:

Array
(
[scheme] => http
[host] => www.heritageglass.com?amp;zoneid=0&source=&dest=http
[path] => //www.heritageglass.com
)

Array
(
[scheme] => http
[host] => www.heritageglass.com?zoneid=0&source=&dest=http
[path] => //www.heritageglass.com
)

Array
(
[scheme] => http
[host] => www.heritageglass.com
[query] => zoneid=0&source=&dest=
)


Untested, but in robot_functions.php you might try the following code:

$newurl = parse_url($newpath);

// add this chunk of code here
if ((isset($newurl["host"])) && (eregi("[?]",$newurl["host"]))) {
if (!isset($newurl["path"])) { $newurl["path"] = ""; }
if (!isset($newurl["query"])) { $newurl["query"] = ""; }
$newurl["query"] = substr(strstr($newurl["host"],"?"),1).$newurl["path"].$newurl["query"];
unset($newurl["path"]);
$newurl["host"] = substr($newurl["host"],0,strpos($newurl["host"],"?"));
}

//search if relocation is absolute or relative

Remember to remove any "word" wrapping in the above code.