PDA

View Full Version : windows-1251 encoding


jvalej
12-08-2003, 12:15 AM
Hello all!

I would like to configure PhpDig, so that it can search pages with windows-1251 (cyrillic) encoding.

I have viewed two similar threads in this forum, on questions of encoding, located here:

ISO-8859-5 (http://www.phpdig.net/showthread.php?s=&threadid=93)
ISO-8859-7 (http://www.phpdig.net/showthread.php?s=&threadid=135)

The ISO-8859-7 thread is quite extensive, but to say the truth, I still have little clue on how to include windows-1251 encoding support into PhpDig... :bang:

I have found a couple of pages on windows-1251 on these sites:

http://www.sensi.org/~alec/locale/other/win1251.html
http://www.cs.susu.ac.ru/RS6000/tbcp1251.html

Would please anyone help me, how and which characters do I add to the:

$phpdig_string_subst['windows-1251']

and

$phpdig_words_chars['windows-1251']


Thank you very much!!!

Charter
12-08-2003, 02:48 PM
Windows-1251 Characters (note A0 is a space):

A0-AF _ Ў ў Ј ¤ Ґ ¦ § Ё © Є « ¬ _ ® Ї
B0-BF ° ± І і ґ µ ¶ · ё № є » ј Ѕ ѕ ї
C0-CF А Б В Г Д Е Ж З И Й К Л М Н О П
D0-DF Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
E0-EF а б в г д е ж з и й к л м н о п
F0-FF р с т у ф х ц ч ш щ ъ ы ь э ю я

Those Characters in ASCII (note A0 is a space):

A0-AF _ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ _ ® ¯
B0-BF ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C0-CF À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D0-DF Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E0-EF * á â ã ä å æ ç è é ê ë ì * î ï
F0-FF ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Then complete this table (I had to 'code' it to keep spacing):

Latin Cyrillic Hex ASCII
----- -------- --- -----
A A C0 À
B Б C1 Á
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
a a E0 *
b б E1 á
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

I'm not very familiar with Cyrillic so I'm not sure if the entries in the above table are correct Latin to Cyrillic mappings. There might be characters that don't map, and I'm not sure what to do with those characters.

Charter
12-30-2003, 04:32 AM
Hi. Here's a new approach/workaround for use with PhpDig 1.6.5.

In the config.php file set the following:

define('PHPDIG_ENCODING','windows-1251');

// give functions something trivial to do
$phpdig_string_subst['windows-1251'] = 'Q:Q,q:q';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãä æçèéêëì*îïðñòóôõö÷øùúûüýþÿ';

In addition, in the robot_functions.php file is a phpdigIndexFile function.

In the phpdigIndexFile function replace:

global $common_words,$relative_script_path,$s_yes,$s_no,$br;

with the following:

global $phpdig_words_chars,$common_words,$relative_script_path,$s_yes,$s_no,$br;

Also, in the phpdigIndexFile function replace:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^[0-9a-zßðþ]',$key))

with the following:

if (strlen($key) > SMALL_WORDS_SIZE and strlen($key) <= MAX_WORDS_SIZE and !isset($common_words[$key]) and ereg('^['.$phpdig_words_chars[PHPDIG_ENCODING].']',$key))

Remember to remove any "word" wrapping in the above code and use PhpDig 1.6.5 if not used already.

Also you will need to index from scratch for the changes to take effect.

Please let me know how this method works for you.

jvalej
01-02-2004, 07:18 AM
Happy New Year! :)

Thank you very much for your help! The windows-1251 solution that you have posted, works!

But now I have 2 other questions:


For exmaple, if I search for the 3 letter word (search link to which is given below) there's no highlighting being made for it:

http://oasiswithin.net/search/search.php?browse=1&query_string=%EE%F8%EE&limite=10&option=start&lim_start=0

My "config.php" contains the following settings:

define('SMALL_WORDS_SIZE',2);


And If I search for the word:

http://oasiswithin.net/search/search.php?browse=1&query_string=%F2%E0%ED%F2%F0%E0&limite=10&option=start&lim_start=0

result number 2, also contains the words from the navigational menu, though this section should be excluded from being displayed, as I have the following settings in "config.php":

define('PHPDIG_EXCLUDE_COMMENT','<!-- *************************************');
define('PHPDIG_INCLUDE_COMMENT','************************************** -->');

The content which should be considered for the search engine is located between these 2 comment lines. Or maybe I understood it wrong?

And just to mention, the navigational section of the site is not contained in the content file which is being indexed, but is being included to this file via PHP include call, from an external .HTML file. Maybe it has to do something with this?..


Thank you! :)

Charter
01-02-2004, 08:51 AM
Hi. For the first part, it looks like the highlighting isn't picking up the case sensitivity. For this, try the following:

In the config.php file, replace:

// give functions something trivial to do
$phpdig_string_subst['windows-1251'] = 'Q:Q,q:q';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãä æçèéêëì*îïðñòóôõö÷øùúûüýþÿ';

with the following:

// remove word wrapping in the below line
$phpdig_string_subst['windows-1251'] = 'À:*,Á:á,Â:â,Ã:ã,Ä:ä,Å:å,Æ:æ,Ç:ç,È:è,É:é,Ê:ê,Ë:ë,Ì :ì,Í:*,Î:î,Ï:ï,Ð:ð,Ñ:ñ,Ò:ò,Ó:ó,Ô:ô,Õ:õ,Ö:ö,×:÷,Ø:ø ,Ù:ù,Ú:ú,Û:û,Ü:ü,Ý:ý,Þ:þ,ß:ÿ';

// remove word wrapping in the below line
$phpdig_words_chars['windows-1251'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß*áâãä æçèéêëì*îïðñòóôõö÷øùúûüýþÿ';

For the second part, to use the PhpDig exclude/include comments with the definitions given it works like below, where the PhpDig exclude/include comments must each be on their own line:

define('PHPDIG_EXCLUDE_COMMENT','<!-- *************************************');
define('PHPDIG_INCLUDE_COMMENT','************************************** -->');


<!-- *************************************
the content
to exclude
goes here
************************************** -->

If you look at the HTML source for result two (http://oasiswithin.net/lib/osho_tntreie.php), the PhpDig exclude/include comments do not surround the navigational menu so that change will need to be made and then a reindex done.