PDA

View Full Version : Numbers everywhere...


Nad
07-11-2004, 10:12 AM
Hi,
I'm encountering some problems while indexing a website with phpdig.
There is no prolem with the indexing itself, but it's the text that is stored in the txt files of the text_content directory.
All the text file contain text with numbers and letters (ex:19b)placed almost every where.
On first indexation, there are few but on re-indexing, these alpha-numeric "bugs" begin to invade all the text. especially in the begining of the text
Here's an example after 3rd indexation :

"b3 46 19b

198 19b ee
119 66 6e 10 Le livre du Mois 15 2 1c Miró, un feu dans les ruines 1a 1d5 Sans doute êtes vous déjÃ* nombreux Ã* avoir vu ou Ã* revoir la très importante exposition consacrée"

The text is the one that is shown in the result page, so it is really annoying.

It's like some ereg_replace/eregi stuff did'nt do its job well.

If somebody can tell me what's wrong, I'll be grateful.

Thx.

vinyl-junkie
07-11-2004, 10:22 AM
I believe what you need to do is replace this statement in config.php:$phpdig_language = "en"; with this:$phpdig_language = "fr";.

BTW, welcome to the forum! :D

Nad
07-11-2004, 11:23 AM
Thank you for your welcome and reply.
When I saw your reply I felt like "damn, I'm so dumb..."
But no, I'm not... configuring the language paremeter to "fr" didn't change anything.
I made a new installation of phpdig (with new ddb) to see if nothing came from the "old" indexation.

thx anyway.


btw, My conf. : Apache server, php 4.2, Windows 2k
Server Conf. : Apache, php 4.2.1, Sun OS... and same problem.

vinyl-junkie
07-11-2004, 11:35 AM
Actually, you may be onto something with starting afresh. I had a test database for phpdig that seemed to mess things up for me when I upgraded to phpdig 1.8.1. I was still pointing to that, and didn't realize that's what the problem was when I was getting some screwy search results.

Let us know if you still have problems. We'll be glad to help. :)

Charter
07-11-2004, 02:21 PM
Hi. Are you getting external binary output like in this (http://www.phpdig.net/showthread.php?threadid=532) thread, or are you getting character encoding output like in this (http://www.phpdig.net/showthread.php?threadid=1027) thread? The results that have these number/letter combos, are they coming from pages that have a different encoding than that set in the config file?

Nad
07-11-2004, 11:59 PM
Hi again,

Charter,

The text filled in the txt files come from "simple" html or php pages. Nothing from doc or pdf files.
The encoding used is charset=iso-8859-1, the same that is in use in the phpdig config file.

On first indexation, these numbers/letters appears like a "flag" in the txt files, sometimes you can find 6-7 files begining with the same combo (ex: e7e, or e5f, or 980 ...)
As shown in my first post, in example of text, theses numbers/letters are placed everywhere in the text.

thx.

Charter
07-12-2004, 04:06 AM
Hi. What website was it that these number/letter combos came from?

Nad
07-13-2004, 10:15 AM
Sorry, Network problems for 2 days...

Charter,

The website are all the websites that I made (so I must be the guilty here... ;-) )
These website can be on unix server or windows, local or not, still the same problem.
Tell me if you need an URl in any case, I'll try to give you one (not local of course)

Thx

Charter
07-13-2004, 07:57 PM
Hi. Looks like it might be related to this (http://bugzilla.ximian.com/show_bug.cgi?id=21236) report. Search that webpage for "a57" (without quotes) and read from there. Also, this (http://www.mail-archive.com/java-apache@list.working-dogs.com/msg00205.html) may be of interest too.

If these number/letter combos are in fact chunk encoding size markers, then they may be on their own lines so try the following to remove them.

In robot_functions.php, in the phpdigGetUrl function, find:

else {
$lines[] = $answer;
}

and replace with the following:

else {
if (!eregi("^[a-z0-9]{1,3}[[:space:]]*$",$answer)) {
$lines[] = $answer;
}
}

Nad
07-14-2004, 12:41 AM
Hello Charter,
I inserted your code and tried spidering one website and for now it is working just fine !
It's great!
Thanks a lot!

I did not fully understood the chunk encoding stuff. I'm still french, so it will take me a bit longuer to read and understand all the informations you gave me to read :D
It seems to be a problem between the web pages I've made and how the server return them...

Anyway, thank you again !

Charter
07-14-2004, 01:43 AM
Hi. See how "10 Le livre du Mois" and "1c Miró, un feu dans les ruines" were getting indexed...

10 in hexadecimal is 16 in decimal: "Le livre du Mois" without the quotes is 16 characters.
1c in hexadecimal is 28 in decimal: "Miró, un feu dans les ruines" without the quotes is 28 characters.

These are chunks, where the hexadecimal is the length of the chunk. The {1,3} in the regex is assuming those hexadecimals are always at most a length of three, but replacing {1,3} with a + might be better.

Speaking of better, I suppose a routine could be written to loop and parse and convert between hex and dec and find string positions and all that, but there probably won't be any header fields in the trailer, so just avoiding the hexadecimals should do as a quick patch.