PDA

View Full Version : includes & excludes


Andreas_Wien
04-23-2004, 03:44 AM
After a quick setup and easy integration I have difficulties spidering the page http://444.docs4you.at correctly.

1.) the path portalnode/ is excluded in the database AND in the robots.txt - nevertheless it is found somehow, and spidered over and over again.

2.) OTOH links on the page are not followed in general. This behavior is different every time, in the worst case 2 pages are spidered and indexed, nothing else, and phpdig hangs spidering portalnode/.

Maybe my understanding of the <!-- phpdigExclude --> / include tags is wrong; Can I assume that the parser reads the page top down, switching off and on and off, and on and off again the indexing as it sees lines with a phpdig-tag?
And; Regardless of include/exclude tags each and every link on the page should be spidered?

Any help would be greatly appreciated!

Best Regards, Andreas

vinyl-junkie
04-23-2004, 04:26 AM
Welcome to the forums, Andreas. We're glad you could join us! :)

Check out this thread (http://www.phpdig.net/showthread.php?s=&threadid=637&highlight=directories) for a solution to your problem.

Andreas_Wien
04-23-2004, 04:44 AM
Hi Pat, thank you for the reply!

My case is in fact much easier, since the /portalnode/ path sould be excluded altogether from ANY robot indexing, thus my robots.txt looks like:

User-agent: *
Disallow: /portalnode/

Nevertheless, phpdig hangs in eactly that directory.

did I miss something?
Andreas

vinyl-junkie
04-23-2004, 05:29 AM
Seems like I read somewhere in the forums that if you list the complete path for your starting file; e.g.,
http://444.docs4you.at/portalnode/index.php
or whatever the file is called, indexing will work for you.

You most likely have indexing locked right now, so you'll have to unlock it. Search the forum for how to do that if you don't know how. Also, make sure your LIMIT_DAYS parameter in config.php is set in a way that will let you re-spider your site now.

Good luck, and let us know how it goes.

vinyl-junkie
04-23-2004, 05:31 AM
Oops! That's what I get for being in a hurry. I'm getting ready to leave for work right now. ;) You wanted to exclude that directory, not include it.

Are you trying to index the whole site, and it's hanging? I'm not sure what is happening here.

Andreas_Wien
04-24-2004, 02:56 AM
yep, the whole site. I don't understand a few points here:

- how it get's a link to /portalnode/uups.php
- why it keeps indexing that file against all the exclude-rules
- why it ignores the rest of the site
- why it hangs

any idea?
Andreas

vinyl-junkie
04-24-2004, 07:31 AM
I downloaded your zip file and took a look at your screen shot (nothing amiss there as far as I can see) and your spidering log. Just before the link that was being spidered multiple times is this link:

http://444.docs4you.at/Content.Node/Veranstaltungen/index.php?S=navi

When I tried to bring up that page in my browser, it seemed to fail. I don't speak German so I don't have a clue what the whole page says, put there is definitely an error in the first line:Warning: Cannot modify header information - headers already sent by (output started at /Node/node/portal.node/uups.php:2) in /Node/node/portal.node/uups.php on line 9 Maybe fixing that will get your problem cleared up.

vinyl-junkie
04-24-2004, 07:45 AM
I meant to include this in my last post and forgot. I am somewhat concerned about that 404 error that you got first thing in your spider log.

This doesn't have anything to do with phpDig per se, but I found some free link checker software that you might want to use on your site. Just a word of caution though. It consumes bandwidth on your site about like phpDig does, so you wouldn't want to run it every single day. The free version is called REL Checker Lite, and you can download it here (http://www.relsoftware.com/).

Andreas_Wien
04-24-2004, 01:33 PM
Ja, right, sorry about that - The site is in some respects not finished yet, nevertheless it should be searchable already.

I hope that doesnt affect phpdig in any way. I'm not troubled if phpdig doesnt index a page that doesnt exist. It seems difficult enough to get the existing pages indexed ;-) !

Some additional info about that site:
Every page has several modes of appearance, controled by the S-parameter. I intended to hide this apparent duplicate pages from phpdig by dynamically adding a line:
<meta name="robots" content="noindex,nofollow,none">
iff an S-parameter is passed to the page. So only the simple page (without any S-parameter) should be indexed, they carry a line:
<meta name="robots" content="index,follow">

And even if phpdig hangs in one branch, why doesn't it finish spidering the other branches of the site? And why does it change it's behavior (number of pages successfully indexed) every time I dig the site?

Still confused ... are my assumptions in the initial posting correct?

And the main point is: portal.node/ is EXCLUDED in the DB and in robots.txt. the URL of uups.php lies on that path. Which precautions do I have to take on such pages in order to have phpdig spider the rest of the site that is not explicitly excluded?

Greets from Vienna, Andreas

vinyl-junkie
04-25-2004, 09:41 AM
OK, here's my take on what you're saying, and it's not based on my knowledge of the phpDig code itself. Rather, it's based on what I see in the spider logs. My assumptions may or may not be correct.

phpDig obeys robots.txt - that much we know - but it still has to visit the page to find out if there is a robots exclusion, assuming that it didn't find that already in the robots.txt file. If a page has some kind of problem, like the one I pointed out, that could cause phpDig to go into some kind of loop. Exactly how or why that happens, I wouldn't know.

I hope you understand where I'm coming from with this. What I'm saying is basically this: If phpDig has to visit a page, there better not be any errors in it. If there is, it could throw phpDig into a tailspin and cause it not to spider everything you think it should.

My suggestion would be to either fix the page or use the include/exclude comments in the page(s) that link to the problem document, so that phpDig will not attempt to spider it.