![]() |
Numbers everywhere...
Hi,
I'm encountering some problems while indexing a website with phpdig. There is no prolem with the indexing itself, but it's the text that is stored in the txt files of the text_content directory. All the text file contain text with numbers and letters (ex:19b)placed almost every where. On first indexation, there are few but on re-indexing, these alpha-numeric "bugs" begin to invade all the text. especially in the begining of the text Here's an example after 3rd indexation : "b3 46 19b 198 19b ee 119 66 6e 10 Le livre du Mois 15 2 1c Miró, un feu dans les ruines 1a 1d5 Sans doute êtes vous déjÃ* nombreux Ã* avoir vu ou Ã* revoir la très importante exposition consacrée" The text is the one that is shown in the result page, so it is really annoying. It's like some ereg_replace/eregi stuff did'nt do its job well. If somebody can tell me what's wrong, I'll be grateful. Thx. |
I believe what you need to do is replace this statement in config.php:
PHP Code:
PHP Code:
BTW, welcome to the forum! :D |
Thank you for your welcome and reply.
When I saw your reply I felt like "damn, I'm so dumb..." But no, I'm not... configuring the language paremeter to "fr" didn't change anything. I made a new installation of phpdig (with new ddb) to see if nothing came from the "old" indexation. thx anyway. btw, My conf. : Apache server, php 4.2, Windows 2k Server Conf. : Apache, php 4.2.1, Sun OS... and same problem. |
Actually, you may be onto something with starting afresh. I had a test database for phpdig that seemed to mess things up for me when I upgraded to phpdig 1.8.1. I was still pointing to that, and didn't realize that's what the problem was when I was getting some screwy search results.
Let us know if you still have problems. We'll be glad to help. :) |
|
Hi again,
Charter, The text filled in the txt files come from "simple" html or php pages. Nothing from doc or pdf files. The encoding used is charset=iso-8859-1, the same that is in use in the phpdig config file. On first indexation, these numbers/letters appears like a "flag" in the txt files, sometimes you can find 6-7 files begining with the same combo (ex: e7e, or e5f, or 980 ...) As shown in my first post, in example of text, theses numbers/letters are placed everywhere in the text. thx. |
Hi. What website was it that these number/letter combos came from?
|
Sorry, Network problems for 2 days...
Charter, The website are all the websites that I made (so I must be the guilty here... ;-) ) These website can be on unix server or windows, local or not, still the same problem. Tell me if you need an URl in any case, I'll try to give you one (not local of course) Thx |
Hi. Looks like it might be related to this report. Search that webpage for "a57" (without quotes) and read from there. Also, this may be of interest too.
If these number/letter combos are in fact chunk encoding size markers, then they may be on their own lines so try the following to remove them. In robot_functions.php, in the phpdigGetUrl function, find: PHP Code:
PHP Code:
|
Hello Charter,
I inserted your code and tried spidering one website and for now it is working just fine ! It's great! Thanks a lot! I did not fully understood the chunk encoding stuff. I'm still french, so it will take me a bit longuer to read and understand all the informations you gave me to read :D It seems to be a problem between the web pages I've made and how the server return them... Anyway, thank you again ! |
Hi. See how "10 Le livre du Mois" and "1c Miró, un feu dans les ruines" were getting indexed...
Speaking of better, I suppose a routine could be written to loop and parse and convert between hex and dec and find string positions and all that, but there probably won't be any header fields in the trailer, so just avoiding the hexadecimals should do as a quick patch. |
All times are GMT -8. The time now is 02:06 PM. |
Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.