PDA

View Full Version : \3 at the right of the searched keyword


dawn
01-13-2005, 11:31 PM
Hello everyone !

I'd like to say first that PHP dig is excellent, don't have any configuration problems or so, except that on my result page, i always have a "\3" at the right of the searched keyword.
Has anyone had this kind of problem ? What to do to get rif of that \3 ?
I am on a windows XP Home, running PHP Dig on local with easy PHP 1.6 apache.

Also wanted to have a clue about the number of pages it is possible to crawl before MYSQL gets overloaded. I have 25,000 pages crawled for a 90 Mo database. It starts to get slow (7 or 8 sec) on some type of search. Am i reaching the maximum capabilities of MYSQL and or PHP Dig ?

Thanks in advance for all your answers ?

Charter
01-14-2005, 12:33 AM
It looks like an encoding issue. For example, search on "lien concerné" (without quotes) and see the first two results:

Palestine
... l’aborde déjà , cela répond à une question qui a été posée sur le lien\3 av... ...: « médiatisation, » d’une part et « parti-pris » de l’autre cela concerne\3 bien « Vivre ensemble et violence »… ... ...sse se présenter à chacune des 2 parties de façon impartiale. En ce qui concerne\3 la Palestine, çà ... ...aéliens. Donc, les Etats-Unis ne sont pas les mieux placés, et en ce qui concerne\3 l’ONU, on n’arrête pas de constater que l’ONU...
http://www.pingouins.com/Temoignages/Palestine/body_palestine.html 45.1 k

www.infomer.fr
...bus en ce qui concerne\3 la littérature maritime le marin, 7 mars 2003 . br 14/02/03 Merveilles des fonds sous-marins - De la mer Rouge aux trois océans... ...Paris. Tél : 01 44 32 10 70. Fax : 01 40 51 73 16 . 280 pages; 25 euros br Lien\3 en relation : www.oceano.org 10/01/03 Thaiti et ses archipels - Ce ... ...Tél : 01 43 94 92 88. Fax : 01 43 94 02 45 . 160 pages; 35 euros. br br br Lien\3 en relation : www.anako.com 3/01/03 Lumières d'Oman - Ce livre est le ... ...tionale et de la Recherche, 1 rue Descartes, 75005 Paris. Prix : 40 euros. Lien\3 en relation : www.cths.fr 6/09/02 `Guide de la pêche Ã* pied` ...
http://www.lemarin.fr/2-Pagemarin/PG-livrebord.html 862.8 k

If you go to the http://www.pingouins.com/Temoignages/Palestine/body_palestine.html page and look at the HTML source, you will see that it is encoded as utf-8. However, if you look at the HTML source of the search results page, it is encoded as iso-8859-1.

PhpDig does not support multiple or multi-byte encodings. The choosen encoding applies to all indexed documents and the admin interface. Choose one encoding per installation and stick with it.

Reinstall PhpDig in a test location, and only index documents that are encoded with the same encoding as PHPDIG_ENCODING in the config file, and see if that makes the \3s go away.

dawn
01-14-2005, 01:48 AM
thanks for your quick answer !
I installed a second version of the 1.8.6 in a new directory, without changing anything in the files.
For a test I crawled the index page of PHP Dig, which is on ISO-8859-1.
The config file is also on : define('PHPDIG_ENCODING','iso-8859-1');

But still i get those annoying \3 and also \1 in the title of the found page, see attached for details.

:bang:

Don't see what else to do to fix it ; waiting for some help...
Thanks.

Charter
01-14-2005, 09:41 AM
Look in phpdig_functions.php for the phpdigHighlight function and replace the two instances of "\\1<^#_>\\2</_#^>\\3" with '\1<^#_>\2</_#^>\3' and do another search. BTW, please don't use PhpDig on this site. PhpDig is free, but my bandwidth is not free. Thanks.

dawn
01-14-2005, 12:13 PM
you got almost it !
here is what i have on line 154 for a result page without the \3 :
$string = @eregi_replace($ereg,"\\1<^#_>\\2</_#^>\\3",@eregi_replace($ereg,"\\1<^#_>\\2</_#^>",$string));

My problem is fixed. Thanks a lot. Hope that may help some other users !

All the best.

Charter
01-14-2005, 01:55 PM
Yes, I've seen that "fix" before, but if the last \\3 were a global problem, then \3s should show up in the online demo, but they don't so that leads me to believe it is something else causing the problem, and this "fix" may or may not really be a fix.