PDA

View Full Version : Need Solution Please


rwh
12-24-2003, 09:29 AM
I have installed the new version 1.6.5, I have set the correct permissions on the named folders in the instructions to 777. Inside the admin panel I can place the ip of my account in the field and the spider program works great, however if I place my domain name in the field it does not. I have tried a empty robots.txt file and also a file with only protecting the cgi-bin.
Still I get no pages spidered using the domain name. Here is what I get when running the domain name .

SITE : http://www.yourdomain.com/
Exclude paths :
- @NONE@
1:http://www.yourdomain.com/
(time : 00:00:00)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://www.yourdomain.com/
Optimizing tables...
Indexing complete !

We are currently running PHP 4.3.4 , Apache Linux 9.0 and MySql 4.0.15

My current path is /home/username/public_html/search/

search being the phpdig installed directory

Any ideas anyone?

Charter
12-24-2003, 10:49 AM
Hi. Perhaps something in this (http://www.phpdig.net/showthread.php?threadid=310) thread might help.

rwh
12-24-2003, 10:57 AM
No we are not behind any firewall.

rwh
12-24-2003, 12:04 PM
Ok looking at his dns zone we have this in place for ftp.
ftp CNAME domain.com.

--------------------------------------------------------------------------------

rwh
12-24-2003, 01:04 PM
Some more information for you.
I went here search/includes/config.php file


And changed these settings to as follows:


//---------FTP SETTINGS
define('FTP_ENABLE',1);//enable ftp content for distant PhpDig
define('FTP_HOST','mydomainname.com'); //if distant PhpDig, ftp host;
define('FTP_PORT',21); //ftp port
define('FTP_PASV',1); //passive mode
define('FTP_PATH','/home/username/public_html/'); //distant path from the ftp root
define('FTP_TEXT_PATH','text_content');//ftp path to text-content directory
define('FTP_US

And when we try the domain name now we get this.



Warning: ftp_chdir(): Can't change directory to /home/username/public_html/: No such file or directory in /home/username/public_html/search/admin/robot_functions.php on line 1204
Error : Ftp connect failed !
Warning: Cannot modify header information - headers already sent by (output started at /home/username/public_html/search/admin/robot_functions.php:1204) in /home/username/public_html/search/admin/update_frame.php on line 69



Hoping this will help someone to help us.

rwh
12-24-2003, 01:30 PM
ok I step further. I placed this in for ftp settings. I changed this 1 line to this
Define('FTP_PATH',''); //distant path from the ftp root


Now when I run say http://www.domain.com I still do not get any links,but If I run something like this.


http://www.domain.com/testfolder/

It will grab and record everything under that directory.

So getting closer think it is just a setting somewhere. Hopfully someone has run into this before and can help out.

Charter
12-24-2003, 02:48 PM
Hi. Are text files showing up in the text_content/ directory? What files are in the / directory? Also, how many text_content directories do you have?

rwh
12-24-2003, 03:35 PM
Ok right now I have one text folder called text_content, it has one file in it called keepalive.txt
Now I go to admin and run the domain name by itself on / or sub dir in path. I run it I select depth 5.

I get this

SITE : http://www.mansfield-tx.gov/
Exclude paths :
- @NONE@
1:http://www.mansfield-tx.gov/
(time : 00:00:00)

No link in temporary table
--------------------------------------------------------------------------------
links found : 1
http://www.mansfield-tx.gov/
Optimizing tables...
Indexing complete !


Now I look in the text_content folder and I have 1 file called 387.txt and it is empty. Back to admin interface panel I show now 1 host and 1 page.

Now if I redo this and I add a / and a sub dir to domain name ir works, adds files, keywords and such. It just will not run the domain like this http://www.domain.com

works fine with http://www.domain.com/subdir or the ip http://ip

rwh
12-24-2003, 03:49 PM
Ok found 1 problem the client had a index.html file in his dir that was nothing more than a redirect file very short. So for a test I uploaded a index.html file with alot of text in it. I re ran the test and it picked up the keywords for that index.html page and stored them., It only got the 1 page though it did not attempt to get the other filles or sub dir files.

Charter
12-24-2003, 03:52 PM
Hi. Is http://www.mansfield-tx.gov/ the domain you are trying to crawl? If so, the redirect has JavaScript in the middle and end of the HTML so try changing define('CHUNK_SIZE',2048); to define('CHUNK_SIZE',200); in the config file. Does this change make it pick up the links?

rwh
12-24-2003, 03:58 PM
Samething no change and yes that is the site, it only gets index.html

Charter
12-24-2003, 04:12 PM
Hi. What happens if you crawl http://www.ci.mansfield.tx.us directly?

rwh
12-24-2003, 04:44 PM
It crawled it without any problems. I placed realwebhost.net in there and it did not crawl it either same as the other one

rwh
12-24-2003, 04:51 PM
What should be in the robots file to make this work?

Charter
12-24-2003, 05:29 PM
Hi. Temporarily remove the robots.txt file from realwebhost.net and then PhpDig should crawl it. Some tweaks need to be done when PhpDig reads the robots.txt file, as it's too restrictive now, but there isn't a list of tweaks ready.

The deal with the Mansfield site is that PhpDig won't follow the redirect. To fix this, change define('PHPDIG_IN_DOMAIN',false); to define('PHPDIG_IN_DOMAIN',true); in the config.php file, and also make the change listed in this (http://www.phpdig.net/showthread.php?threadid=177) thread.

rwh
12-24-2003, 05:42 PM
First thanks for all your help.
Real Web Host I can remove that because I have files and directories that can not be crawled.

On the other it is crawling now, but even though it has a redirect in it there are still directories in there for that domain.
It is not looking at them at all still. it just jumped over that domain and went to the others.
So maybe just have to do those sub directories manually like I did before I guess.

Charter
12-24-2003, 06:32 PM
Hi. Add the following to the top of the robots.txt file and then make the code change listed in this (http://www.phpdig.net/showthread.php?threadid=269) thread.

User-agent: PhpDig
Disallow:
# whatever else below this

This should let PhpDig follow all the links it finds, and then you can go and delete/exclude certain links/directories from the admin panel.

rwh
12-24-2003, 06:57 PM
Any way around the other problem with that domain not reading because it gets redirected

Charter
12-24-2003, 07:00 PM
Hi. Post fifteen on the first page of this thread should deal with the redirect.

rwh
12-24-2003, 07:06 PM
Ok made that change and I put in the main domain it looks like this notice it does not even try to get sub directories under main domain it does not get anything then goes to number 2 which is the redirect domain name so it gets nothing from main domain name.

SITE : http://www.mansfield-tx.gov/
Exclude paths :
- @NONE@
1:http://www.mansfield-tx.gov/
(time : 00:00:00)
Ok for http://www.ci.mansfield.tx.us/ (site_id:49)

No link in temporary table

--------------------------------------------------------------------------------

links found : 1
http://www.mansfield-tx.gov/

--------------------------------------------------------------------------------
SITE : http://www.ci.mansfield.tx.us/
Exclude paths :
- @NONE@
2:http://www.ci.mansfield.tx.us/

it is still running as we speak over 50 minutes now and on number 41.

Charter
12-24-2003, 07:20 PM
Hi. PhpDig can't index subdirectories/files if there are no links to such. The only thing PhpDig sees at http://www.mansfield-tx.gov/ is the below so, with the changes made in this thread, the only place PhpDig can go to is http://www.ci.mansfield.tx.us and then follow the links from there.

<html>
<head>
<meta http-equiv="refresh" content="0;url=http://www.ci.mansfield.tx.us">
</head>
</html>

You could setup a temp page with links to the subdirectories/files that you want indexed, and after the index is done, then go to the admin panel, click the site holding the temp page, click the update button, click a blue arrow if needed, and then delete the temp page.

rwh
12-24-2003, 08:04 PM
Ok understand that

rwh
12-24-2003, 08:26 PM
Just to follow up. I made the temp index.html file itworks getting pages now, for some reason when it got done with domain name it got the same pages using the ip.

Charter
12-26-2003, 01:04 PM
Hi. Are you crawling shell or from the browser interface, with FTP on or off? Is there a link somewhere that uses the IP instead of the domain name?

rwh
12-26-2003, 01:16 PM
From IE Browser and FTP ON.
Figure there are links with ip in his files, not sure well let him look at them.