cap-lore.com/cgibin/h.cgi?0  a series of code pages. If you type or paste an exotic character into the box, it will take you to the code page.



vo: Grafema, fonema, glifo, suono.

Unicode    Universal Coded Character Set
is a big table that assigns numbers (codepoints) to characters
unicode character
is a fairly fuzzy concept.

Although Unicode typically assigns characters to code points to express the graphemes within a system of writing, the Unicode Standard (section 3.4 D7) does with caution:

An abstract character does not necessarily correspond to what a user thinks of as a "character" and should not be confused with a grapheme.


wp/Unicode Universal Coded Character Set
ASCII rob: Codice ASCII American Standard Code for Information Interchange
wp/UTF-16 0.01%  of web pages; incompatible with ASCII
wp/UTF-8 95% of web pages; "the mandatory encoding for all text" (wp/WHATWG Web Hypertext Application Technology Working Group)
Unicode Transformation Format 8-bit.

variable-width character encoding.


windows-1252 caratteri NON ASCII del character set.


length(x) = length(toUpper(x)) hold for Unicode x? No

since Unicode has, among other things, ligature characters such as fi, which expand 2 fold to FI.

So, it is probably better not to assume anything about lengths of Strings after any operation.



UTF-8 was designed for backward compatibility with ASCII

Un codice amplia un altro se un qualsiasi testo del minore e' decodificato correttamente dal maggiore.

In particolare devono essere decodificati i singoli caratteri, quindi

i codici del minore devono essere contenuti nel maggiore.


es: "questo e' il simbolo del grado" e' un testo in codice win-1252, ma non utf-8 poiche' fatto tutti di ASCII tranne il simbolo del grado il cui byte-code e' 0xb0, che non e' uno start-byte, e che invece dovrebbe esserlo dato che segue un carattere valido.


UTF-8 is variable-byte character encoding.

Immaginando di  iniziare a leggere un testo codificato in un punto-byte qualsiasi, come e' possibile sapere di stare leggengo il 1 byte di una tupla-codice  R: e' necessario che il 1 byte sia riconoscibile dai seguenti.

1 byte di una tupla ha bit7 e bit6 = 1, solo lui; i seguenti hano bit7=1 e bit6=0.


Characters inside an HTML document. >>>
HTML source text format. HTML characterset, HTML Encoding.



the lower 256 characters of Unicode are the same as the 256 characters of Latin-1.


<meta http-equiv="Content-Language" content="it">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">



