PhpDig.net

PhpDig.net (http://www.phpdig.net/forum/index.php)
-   How-to Forum (http://www.phpdig.net/forum/forumdisplay.php?f=33)
-   -   UTF-8 Question (http://www.phpdig.net/forum/showthread.php?t=2621)

jackpod 09-21-2006 02:17 PM

UTF-8 Question
 
If my html pages are UTF-8 what are the consequences of using PhpDig? It seems to work ok. Am I missing something? Also, is there a plan to support UTF-8 in an upcoming release? It seems to me that this is crucial as UTF-8 is quite common now.

Thanks, any help would be appreciated.

Dave A 09-22-2006 02:11 PM

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
All possible 231 UCS codes can be encoded.
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
The sorting order of Bigendian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
==============================================
In addition to all that, UTF-8 was introduced to provide an ASCII backwards compatible multi-byte encoding. The definitions of UTF-8 in UCS and Unicode differed originally slightly, because in UCS, up to 6-byte long UTF-8 sequences were possible to represent characters up to U-7FFFFFFF, while in Unicode only up to 4-byte long UTF-8 sequences are defined to represent characters up to U-0010FFFF. (The difference was in essence the same as between UCS-4 and UTF-32.)

jackpod 09-22-2006 02:20 PM

I am sorry, but that info is a litte over my head. Mainly I just wanted answers to my specific questions. Maybe I should revise them slightly. What are the consequences of using PhpDig with UTF-8 files? And is there a plan to support UTF-8 in an upcoming release?

Dave A 09-22-2006 05:03 PM

The only problem you may get is that in some results a few characters may have odd letters displayed like accents above them.

jackpod 09-22-2006 05:10 PM

Thank you so much for you help.


All times are GMT -8. The time now is 10:47 PM.

Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright © 2001 - 2005, ThinkDing LLC. All Rights Reserved.