unicode(7) Miscellaneous Information Manual unicode(7) JMENO unicode - univerzalni znakova sada POPIS The international standard ISO/IEC 10646 defines the Universal Character Set (UCS). UCS contains all characters of all other character set standards. It also guarantees "round-trip compatibility"; in other words, conversion tables can be built such that no information is lost when a string is converted from any other encoding to UCS and back. UCS obsahuje znaky potrebne pro temer vsechny zname jazyky. Mimo jine je to mnoho jazyku vyuzivajicich rozsireni latinky a take nasledujici jazyky a pisma: rectinu, azbuku, hebrejstinu, arabstinu, armenstinu, gruzinstinu, japonstinu, cinstinu, korejske ideogramy Han, pisma Hiragana, Katakana, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, thajstinu, Lao, Khmer, Bopomofo, tibetstinu, runove pismo, etiopstinu, kanadske slabiky, Cherokee, mongolstinu, Ogham, barmstinu, sinhalstinu, Thaana, Yi a mnoho jinych. Pracuje se na vlozeni dalsich pisem jako hieroglyfy a ruzne historicke indoevropske jazyky, eventualne by mohly byt zacleneny nektere umele jazyky, jako Tengwar, Cirth a klingonstina. UCS navic ke znakum pro tyto jazyky obsahuje graficke, typograficke, matematicke a vedecke symboly pouzivane napr. v TeXu, PostScriptu, APL, MS-DOSu, MS-Windows, Macintosh, OCR, stejne tak jako v mnoha systemech pro zpracovani textu a publikovani, ktere neustale pribyvaji. The UCS standard (ISO/IEC 10646) describes a 31-bit character set architecture consisting of 128 24-bit groups, each divided into 256 16-bit planes made up of 256 8-bit rows with 256 column positions, one for each character. Part 1 of the standard (ISO/IEC 10646-1) defines the first 65534 code positions (0x0000 to 0xfffd), which form the Basic Multilingual Plane (BMP), that is plane 0 in group 0. Part 2 of the standard (ISO/IEC 10646-2) adds characters to group 0 outside the BMP in several supplementary planes in the range 0x10000 to 0x10ffff. There are no plans to add characters beyond 0x10ffff to the standard, therefore of the entire code space, only a small fraction of group 0 will ever be actually used in the foreseeable future. The BMP contains all characters found in the commonly used other character sets. The supplemental planes added by ISO/IEC 10646-2 cover only more exotic characters for special scientific, dictionary printing, publishing industry, higher-level protocol and enthusiast needs. Reprezentaci kazdeho UCS znaku jako dvoubajtoveho slova se rika UCS-2 forma (jen pro znaky z BMP), zatimco UCS-4 je reprezentace kazdeho znaku ctyrbajtovym slovem. Navic existuji dve formy kodovani: UTF-8 pro zpetnou kompatibilitu s programy zpracovavajicimi ASCII a UTF-16 pro zpetne kompatibilni zpracovani znaku mimo BMP az do 0x10ffff programy pouzivajicimi UCS-2. The UCS characters 0x0000 to 0x007f are identical to those of the classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in ISO/IEC 8859-1 (Latin-1). Spojovani znaku Nektere kody v UCS jsou prirazeny tzv. akcentum. Tyto jsou podobne neposouvajicim znakum na psacim stroji. Akcent modifikuje predchozi znak. Nejdulezitejsi znaky s akcenty sice maji sve vlastni kody v UCS, ale akcentove znaky dovoluji pridat libovolne diakriticke znamenko k libovolnemu znaku. Akcent vzdy nasleduje znak, ktery je modifikovan. Napriklad, nemecky znak Umlaut-A ("Velke A v latince s umlautem") muze byt reprezentovan pomoci kodu UCS 0x00c4 a nebo alternativne jako kombinace normalniho velkeho A, nasledovaneho akcentem umlaut: 0x0041 0x0308. Akcenty jsou nezbytne napr. pro thajske pismo, pro matematicke tisky a pro uzivatele Mezinarodni foneticke abecedy. Urovne implementace As not all systems are expected to support advanced mechanisms like combining characters, ISO/IEC 10646-1 specifies the following three implementation levels of UCS: Level 1 Akcenty a znaky Hangul Jamo (specialni, komplikovane kodovani korejskeho pisma, kde jsou jednotlive symboly dany jako sekvence dvou ci tri znaku) nejsou podporovany. Level 2 Jako level 1, pricemz nektere kombinujici znaky jsou povoleny (napr. pro thajstinu, Lao, hebrejstinu, arabstinu, Devangari, Malayalam). Level 3 Vsechny znaky z UCS jsou povoleny. The Unicode 3.0 Standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane at implementation level 3, as described in ISO/IEC 10646-1:2000. Unicode 3.1 added the supplemental planes of ISO/IEC 10646-2. The Unicode standard and technical reports published by the Unicode Consortium provide much additional information on the semantics and recommended usages of various characters. They provide guidelines and algorithms for editing, sorting, comparing, normalizing, converting, and displaying Unicode strings. Unicode pod Linuxem V GNU/Linuxu je datovy typ jazyka C wchar_t definovan jako 32 bitovy integer. Knihovna jazyka C jeho hodnoty vzdy interpretuje jako kodove hodnoty UCS (ve vsech locale), coz je konvence, kterou GNU knihovna jazyka C oznamuje aplikacim definovanim konstanty __STDC_ISO_10646__, tj. tak, jak to urcuje standard ISO C99. UCS/Unicode muze byt, stejne jako ASCII, pouzivano ve vstupnich a vystupnich proudech, terminalove komunikaci, souborech prosteho textu, nazvech souboru a promennych prostredi prostrednictvim ASCII kompatibilniho vicebajtoveho kodovani UTF-8. K uzivani UTF-8 jako kodovani znaku pro vsechny aplikace je treba vybrat vhodne locale pomoci promennych prostredi (napr. "LANG=en_GB.UTF-8"). Funkce nl_langinfo(CODESET) vraci nazev zvoleneho kodovani. Knihovni funkce jako wctomb(3) a mbsrtowcs(3) mohou byt pouzity ke konverzi interniho typu wchar_t do kodovani pouzivaneho systemem a naopak. Funkce wcwidth(3) rika, kolik o pozic (0-2) postoupil kurzor po vytisteni znaku. Private Use Areas (PUA) In the Basic Multilingual Plane, the range 0xe000 to 0xf8ff will never be assigned to any characters by the standard and is reserved for private usage. For the Linux community, this private area has been subdivided further into the range 0xe000 to 0xefff which can be used individually by any end-user and the Linux zone in the range 0xf000 to 0xf8ff where extensions are coordinated among all Linux users. The registry of the characters assigned to the Linux zone is maintained by LANANA and the registry itself is Documentation/admin-guide/unicode.rst in the Linux kernel sources (or Documentation/unicode.txt before Linux 4.10). Two other planes are reserved for private usage, plane 15 (Supplementary Private Use Area-A, range 0xf0000 to 0xffffd) and plane 16 (Supplementary Private Use Area-B, range 0x100000 to 0x10fffd). Literatura o Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. International Standard ISO/IEC 10646-1, International Organization for Standardization, Geneva, 2000. This is the official specification of UCS. Available from . o The Unicode Standard, Version 3.0. The Unicode Consortium, Addison-Wesley, Reading, MA, 2000, ISBN 0-201-61633-5. o S. Harbison, G. Steele. C: A Reference Manual. Fourth edition, Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3. Dobra referencni kniha o jazyku C. Ctvrte vydani take zahrnuje dodatek 1 z roku 1994 ke standardu ISO C 90, ktery pridava mnoho knihovnich funkci pro praci s wide-byte a multi-byte kodovanimi, ale jeste nezahrnuje ISO C99, ktere dale zlepsilo podporu techto kodovani. o Technicke zpravy Unicode. o Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux. o Bruno Haible: Unicode HOWTO. DALSI INFORMACE locale(1), setlocale(3), charsets(7), utf-8(7) PREKLAD Preklad teto prirucky do cestiny vytvorili Jiri Pavlovsky a Pavel Heimlich Tento preklad je bezplatna dokumentace; Prectete si GNU General Public License Version 3 nebo novejsi ohledne podminek autorskych prav. Neexistuje ZADNA ODPOVEDNOST. Pokud narazite na nejake chyby v prekladu teto prirucky, poslete e-mail na adresu . Linux man-pages 6.9.1 2. kvetna 2024 unicode(7)