^^Unicode e UTF-8.

cap-lore.com/cgibin/h.cgi?0 a series of code pages. If you type or paste an exotic character into the box, it will take you to the code page.
2015/dark-corners-of-unicode
vo: Grafema, fonema, glifo, suono. Per essere precisi.

Unicode Universal Coded Character Set

is a big table that assigns numbers (codepoints) to characters

unicode character

is a fairly fuzzy concept.

Letters and numbers and punctuation are characters.
But so are Braille and frogs and halves of flags.

UTF-8 Unicode Transformation Format – 8-bit

espande il codice ASCII: i caratteri ASCII hanno la stessa codifica UTF-8: 1 singolo byte, bit7 = 0

Although Unicode typically assigns characters to code points to express the graphemes within a system of writing, the Unicode Standard (section 3.4 D7) does with caution:

An abstract character does not necessarily correspond to what a user thinks of as a "character" and should not be confused with a grapheme.

wp/Unicode	Universal Coded Character Set
ASCII	rob: Codice ASCII American Standard Code for Information Interchange
wp/UTF-16	0.01% of web pages; incompatible with ASCII
wp/UTF-8	95% of web pages; "the mandatory encoding for all text" (wp/WHATWG Web Hypertext Application Technology Working Group) Unicode Transformation Format – 8-bit. variable-width character encoding. johndcook/how-utf-8-works
windows-1252	caratteri NON ASCII del character set.

length(x) = length(toUpper(x)) hold for Unicode x? No

since Unicode has, among other things, ligature characters such as ﬁ, which expand 2 fold to FI.

So, it is probably better not to assume anything about lengths of Strings after any operation.

UTF-8 was designed for backward compatibility with ASCII

UTF-8 estende ASCII

Un codice estende-amplia un altro se un qualsiasi testo del minore e' decodificato correttamente dal maggiore.

In particolare devono essere decodificati i singoli caratteri, quindi

i codici del minore devono essere contenuti nel maggiore.

UTF-8 codifica una parte dei caratteri Unicode

Unicode to UTF-8

First code point	Last code point	Byte 1 76543210	Byte 2 76543210	Byte 3 76543210	Byte 4 76543210
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

UTF-8

i caratteri ASCII hanno la stessa codifica UTF-8: 1 singolo byte, bit7 = 0
i caratteri UTF-8 NON ASCII sono fatti da 2 o 3 o 4 byte di caratteri NON ASCII, hanno tutti il bit7=1
codifica piu' di 1 milione di caratteri: 1,112,064

Sincronizzare la lettura di un flusso in codice UTF-8.

UTF-8 e' un codice a lunghezza variabile, cioe' i caratteri sono codificati da una tupla di byte in nr variabile, da 1 a 4.

D: Immaginando di iniziare a leggere una sequenza di bytes in punto qualsiasi, come e' possibile sapere in quale punto si e' di una tupla-codice?

R: UTF-8 ha lo start byte riconoscibile, cioe' il 1° byte della tupla che codifica un carattere e' riconoscibile. Ci sono 2 casi

bit7=0 la tupla e' un carattere ASCII, fatta da 1 solo byte
bit7 e bit6 =1 start byte
bit7=1 e bit6=0 continuation byte

Inoltre il codice fornisce altre informazioni di robustezza

lo start byte codifica anche la lunghezza in byte del carattere UTF-8,
codificata dai bit piu' pesanti:
- 110 2byte per 1 carattere,
- 1110 3byte per 1 carattere,
- 11110 4byte per 1 carattere.

Characters inside an HTML document. >>>
HTML source text format. HTML characterset, HTML Encoding.

Unicode

the lower 256 characters of Unicode are the same as the 256 characters of Latin-1.

Attenzione: ASCII ≠ win-1252. UTF-8 non estende win-1252

es: "questo ° e' il simbolo del grado" e' un testo in codice win-1252, ma non utf-8 poiche' fatto tutti di ASCII tranne il simbolo del grado il cui byte-code e' 0xb0, che non e' uno start-byte, e che invece dovrebbe esserlo dato che segue un carattere valido.