unicode(7) Miscellaneous Information Manual unicode(7)
JMENO
unicode - univerzalni znakova sada
POPIS
The international standard ISO/IEC 10646 defines the Universal
Character Set (UCS). UCS contains all characters of all other
character set standards. It also guarantees "round-trip
compatibility"; in other words, conversion tables can be built such
that no information is lost when a string is converted from any other
encoding to UCS and back.
UCS obsahuje znaky potrebne pro temer vsechny zname jazyky. Mimo jine
je to mnoho jazyku vyuzivajicich rozsireni latinky a take nasledujici
jazyky a pisma: rectinu, azbuku, hebrejstinu, arabstinu, armenstinu,
gruzinstinu, japonstinu, cinstinu, korejske ideogramy Han, pisma
Hiragana, Katakana, Hangul, Devangari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, thajstinu, Lao, Khmer,
Bopomofo, tibetstinu, runove pismo, etiopstinu, kanadske slabiky,
Cherokee, mongolstinu, Ogham, barmstinu, sinhalstinu, Thaana, Yi a
mnoho jinych. Pracuje se na vlozeni dalsich pisem jako hieroglyfy a
ruzne historicke indoevropske jazyky, eventualne by mohly byt zacleneny
nektere umele jazyky, jako Tengwar, Cirth a klingonstina. UCS navic ke
znakum pro tyto jazyky obsahuje graficke, typograficke, matematicke a
vedecke symboly pouzivane napr. v TeXu, PostScriptu, APL, MS-DOSu,
MS-Windows, Macintosh, OCR, stejne tak jako v mnoha systemech pro
zpracovani textu a publikovani, ktere neustale pribyvaji.
The UCS standard (ISO/IEC 10646) describes a 31-bit character set
architecture consisting of 128 24-bit groups, each divided into 256
16-bit planes made up of 256 8-bit rows with 256 column positions, one
for each character. Part 1 of the standard (ISO/IEC 10646-1) defines
the first 65534 code positions (0x0000 to 0xfffd), which form the Basic
Multilingual Plane (BMP), that is plane 0 in group 0. Part 2 of the
standard (ISO/IEC 10646-2) adds characters to group 0 outside the BMP
in several supplementary planes in the range 0x10000 to 0x10ffff.
There are no plans to add characters beyond 0x10ffff to the standard,
therefore of the entire code space, only a small fraction of group 0
will ever be actually used in the foreseeable future. The BMP contains
all characters found in the commonly used other character sets. The
supplemental planes added by ISO/IEC 10646-2 cover only more exotic
characters for special scientific, dictionary printing, publishing
industry, higher-level protocol and enthusiast needs.
Reprezentaci kazdeho UCS znaku jako dvoubajtoveho slova se rika UCS-2
forma (jen pro znaky z BMP), zatimco UCS-4 je reprezentace kazdeho
znaku ctyrbajtovym slovem. Navic existuji dve formy kodovani: UTF-8
pro zpetnou kompatibilitu s programy zpracovavajicimi ASCII a UTF-16
pro zpetne kompatibilni zpracovani znaku mimo BMP az do 0x10ffff
programy pouzivajicimi UCS-2.
The UCS characters 0x0000 to 0x007f are identical to those of the
classic US-ASCII character set and the characters in the range 0x0000
to 0x00ff are identical to those in ISO/IEC 8859-1 (Latin-1).
Spojovani znaku
Nektere kody v UCS jsou prirazeny tzv. akcentum. Tyto jsou podobne
neposouvajicim znakum na psacim stroji. Akcent modifikuje predchozi
znak. Nejdulezitejsi znaky s akcenty sice maji sve vlastni kody v UCS,
ale akcentove znaky dovoluji pridat libovolne diakriticke znamenko k
libovolnemu znaku. Akcent vzdy nasleduje znak, ktery je modifikovan.
Napriklad, nemecky znak Umlaut-A ("Velke A v latince s umlautem") muze
byt reprezentovan pomoci kodu UCS 0x00c4 a nebo alternativne jako
kombinace normalniho velkeho A, nasledovaneho akcentem umlaut: 0x0041
0x0308.
Akcenty jsou nezbytne napr. pro thajske pismo, pro matematicke tisky a
pro uzivatele Mezinarodni foneticke abecedy.
Urovne implementace
As not all systems are expected to support advanced mechanisms like
combining characters, ISO/IEC 10646-1 specifies the following three
implementation levels of UCS:
Level 1 Akcenty a znaky Hangul Jamo (specialni, komplikovane kodovani
korejskeho pisma, kde jsou jednotlive symboly dany jako
sekvence dvou ci tri znaku) nejsou podporovany.
Level 2 Jako level 1, pricemz nektere kombinujici znaky jsou povoleny
(napr. pro thajstinu, Lao, hebrejstinu, arabstinu, Devangari,
Malayalam).
Level 3 Vsechny znaky z UCS jsou povoleny.
The Unicode 3.0 Standard published by the Unicode Consortium contains
exactly the UCS Basic Multilingual Plane at implementation level 3, as
described in ISO/IEC 10646-1:2000. Unicode 3.1 added the supplemental
planes of ISO/IEC 10646-2. The Unicode standard and technical reports
published by the Unicode Consortium provide much additional information
on the semantics and recommended usages of various characters. They
provide guidelines and algorithms for editing, sorting, comparing,
normalizing, converting, and displaying Unicode strings.
Unicode pod Linuxem
V GNU/Linuxu je datovy typ jazyka C wchar_t definovan jako 32 bitovy
integer. Knihovna jazyka C jeho hodnoty vzdy interpretuje jako kodove
hodnoty UCS (ve vsech locale), coz je konvence, kterou GNU knihovna
jazyka C oznamuje aplikacim definovanim konstanty __STDC_ISO_10646__,
tj. tak, jak to urcuje standard ISO C99.
UCS/Unicode muze byt, stejne jako ASCII, pouzivano ve vstupnich a
vystupnich proudech, terminalove komunikaci, souborech prosteho textu,
nazvech souboru a promennych prostredi prostrednictvim ASCII
kompatibilniho vicebajtoveho kodovani UTF-8. K uzivani UTF-8 jako
kodovani znaku pro vsechny aplikace je treba vybrat vhodne locale
pomoci promennych prostredi (napr. "LANG=en_GB.UTF-8").
Funkce nl_langinfo(CODESET) vraci nazev zvoleneho kodovani. Knihovni
funkce jako wctomb(3) a mbsrtowcs(3) mohou byt pouzity ke konverzi
interniho typu wchar_t do kodovani pouzivaneho systemem a naopak.
Funkce wcwidth(3) rika, kolik o pozic (0-2) postoupil kurzor po
vytisteni znaku.
Private Use Areas (PUA)
In the Basic Multilingual Plane, the range 0xe000 to 0xf8ff will never
be assigned to any characters by the standard and is reserved for
private usage. For the Linux community, this private area has been
subdivided further into the range 0xe000 to 0xefff which can be used
individually by any end-user and the Linux zone in the range 0xf000 to
0xf8ff where extensions are coordinated among all Linux users. The
registry of the characters assigned to the Linux zone is maintained by
LANANA and the registry itself is Documentation/admin-guide/unicode.rst
in the Linux kernel sources (or Documentation/unicode.txt before Linux
4.10).
Two other planes are reserved for private usage, plane 15
(Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) and plane
16 (Supplementary Private Use Area-B, range 0x100000 to 0x10fffd).
Literatura
o Information technology -- Universal Multiple-Octet Coded Character
Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane.
International Standard ISO/IEC 10646-1, International Organization
for Standardization, Geneva, 2000.
This is the official specification of UCS. Available from
.
o The Unicode Standard, Version 3.0. The Unicode Consortium,
Addison-Wesley, Reading, MA, 2000, ISBN 0-201-61633-5.
o S. Harbison, G. Steele. C: A Reference Manual. Fourth edition,
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
Dobra referencni kniha o jazyku C. Ctvrte vydani take zahrnuje
dodatek 1 z roku 1994 ke standardu ISO C 90, ktery pridava mnoho
knihovnich funkci pro praci s wide-byte a multi-byte kodovanimi, ale
jeste nezahrnuje ISO C99, ktere dale zlepsilo podporu techto
kodovani.
o Technicke zpravy Unicode.
o Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
o Bruno Haible: Unicode HOWTO.
DALSI INFORMACE
locale(1), setlocale(3), charsets(7), utf-8(7)
PREKLAD
Preklad teto prirucky do cestiny vytvorili Jiri Pavlovsky
a Pavel Heimlich
Tento preklad je bezplatna dokumentace; Prectete si GNU General Public
License Version 3 nebo
novejsi ohledne podminek autorskych prav. Neexistuje ZADNA ODPOVEDNOST.
Pokud narazite na nejake chyby v prekladu teto prirucky, poslete e-mail
na adresu .
Linux man-pages 6.9.1 2. kvetna 2024 unicode(7)