PhpDig.net

Go Back   PhpDig.net > PhpDig Forums > Troubleshooting

Reply
 
Thread Tools
Old 07-11-2004, 10:12 AM   #1
Nad
Green Mole
 
Join Date: Jul 2004
Location: Paris
Posts: 5
Question Numbers everywhere...

Hi,
I'm encountering some problems while indexing a website with phpdig.
There is no prolem with the indexing itself, but it's the text that is stored in the txt files of the text_content directory.
All the text file contain text with numbers and letters (ex:19b)placed almost every where.
On first indexation, there are few but on re-indexing, these alpha-numeric "bugs" begin to invade all the text. especially in the begining of the text
Here's an example after 3rd indexation :

"b3 46 19b

198 19b ee
119 66 6e 10 Le livre du Mois 15 2 1c Miró, un feu dans les ruines 1a 1d5 Sans doute êtes vous déjÃ* nombreux Ã* avoir vu ou Ã* revoir la très importante exposition consacrée"


The text is the one that is shown in the result page, so it is really annoying.

It's like some ereg_replace/eregi stuff did'nt do its job well.

If somebody can tell me what's wrong, I'll be grateful.

Thx.
Nad is offline   Reply With Quote
Old 07-11-2004, 10:22 AM   #2
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
I believe what you need to do is replace this statement in config.php:
PHP Code:
$phpdig_language "en"
with this:
PHP Code:
$phpdig_language "fr"
.

BTW, welcome to the forum!
vinyl-junkie is offline   Reply With Quote
Old 07-11-2004, 11:23 AM   #3
Nad
Green Mole
 
Join Date: Jul 2004
Location: Paris
Posts: 5
Thank you for your welcome and reply.
When I saw your reply I felt like "damn, I'm so dumb..."
But no, I'm not... configuring the language paremeter to "fr" didn't change anything.
I made a new installation of phpdig (with new ddb) to see if nothing came from the "old" indexation.

thx anyway.


btw, My conf. : Apache server, php 4.2, Windows 2k
Server Conf. : Apache, php 4.2.1, Sun OS... and same problem.

Last edited by Nad; 07-11-2004 at 11:31 AM.
Nad is offline   Reply With Quote
Old 07-11-2004, 11:35 AM   #4
vinyl-junkie
Purple Mole
 
Join Date: Jan 2004
Posts: 694
Actually, you may be onto something with starting afresh. I had a test database for phpdig that seemed to mess things up for me when I upgraded to phpdig 1.8.1. I was still pointing to that, and didn't realize that's what the problem was when I was getting some screwy search results.

Let us know if you still have problems. We'll be glad to help.
vinyl-junkie is offline   Reply With Quote
Old 07-11-2004, 02:21 PM   #5
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Are you getting external binary output like in this thread, or are you getting character encoding output like in this thread? The results that have these number/letter combos, are they coming from pages that have a different encoding than that set in the config file?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-11-2004, 11:59 PM   #6
Nad
Green Mole
 
Join Date: Jul 2004
Location: Paris
Posts: 5
Hi again,

Charter,

The text filled in the txt files come from "simple" html or php pages. Nothing from doc or pdf files.
The encoding used is charset=iso-8859-1, the same that is in use in the phpdig config file.

On first indexation, these numbers/letters appears like a "flag" in the txt files, sometimes you can find 6-7 files begining with the same combo (ex: e7e, or e5f, or 980 ...)
As shown in my first post, in example of text, theses numbers/letters are placed everywhere in the text.

thx.
Nad is offline   Reply With Quote
Old 07-12-2004, 04:06 AM   #7
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. What website was it that these number/letter combos came from?
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-13-2004, 10:15 AM   #8
Nad
Green Mole
 
Join Date: Jul 2004
Location: Paris
Posts: 5
Sorry, Network problems for 2 days...

Charter,

The website are all the websites that I made (so I must be the guilty here... ;-) )
These website can be on unix server or windows, local or not, still the same problem.
Tell me if you need an URl in any case, I'll try to give you one (not local of course)

Thx
Nad is offline   Reply With Quote
Old 07-13-2004, 07:57 PM   #9
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. Looks like it might be related to this report. Search that webpage for "a57" (without quotes) and read from there. Also, this may be of interest too.

If these number/letter combos are in fact chunk encoding size markers, then they may be on their own lines so try the following to remove them.

In robot_functions.php, in the phpdigGetUrl function, find:
PHP Code:
else {
    
$lines[] = $answer;

and replace with the following:
PHP Code:
else {
    if (!
eregi("^[a-z0-9]{1,3}[[:space:]]*$",$answer)) {
        
$lines[] = $answer;
    }

__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Old 07-14-2004, 12:41 AM   #10
Nad
Green Mole
 
Join Date: Jul 2004
Location: Paris
Posts: 5
Hello Charter,
I inserted your code and tried spidering one website and for now it is working just fine !
It's great!
Thanks a lot!

I did not fully understood the chunk encoding stuff. I'm still french, so it will take me a bit longuer to read and understand all the informations you gave me to read
It seems to be a problem between the web pages I've made and how the server return them...

Anyway, thank you again !
Nad is offline   Reply With Quote
Old 07-14-2004, 01:43 AM   #11
Charter
Head Mole
 
Charter's Avatar
 
Join Date: May 2003
Posts: 2,539
Hi. See how "10 Le livre du Mois" and "1c Miró, un feu dans les ruines" were getting indexed...
  • 10 in hexadecimal is 16 in decimal: "Le livre du Mois" without the quotes is 16 characters.
  • 1c in hexadecimal is 28 in decimal: "Miró, un feu dans les ruines" without the quotes is 28 characters.
These are chunks, where the hexadecimal is the length of the chunk. The {1,3} in the regex is assuming those hexadecimals are always at most a length of three, but replacing {1,3} with a + might be better.

Speaking of better, I suppose a routine could be written to loop and parse and convert between hex and dec and find string positions and all that, but there probably won't be any header fields in the trailer, so just avoiding the hexadecimals should do as a quick patch.
__________________
Responses are offered on a voluntary if/as time is available basis, no guarantees. Double posting or bumping threads will not get your question answered any faster. No support via PM or email, responses not guaranteed. Thank you for your comprehension.
Charter is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to Include Numbers which occur in names galacticvoyager How-to Forum 1 11-12-2005 01:45 PM
Indexing of numbers jerrywin5 How-to Forum 3 04-06-2005 01:08 PM
fuzzy search on product numbers indeh How-to Forum 0 10-13-2004 11:33 AM
Numbers BernhardG Bug Tracker 2 10-10-2003 04:20 AM
phpdig not index numbers. redlock Troubleshooting 6 10-06-2003 02:44 PM


All times are GMT -8. The time now is 06:18 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.