What Unicode Is
Unicode is a universal character encoding standard that assigns a unique number (codepoint) to every character in every writing system — Latin, Cyrillic, Arabic, CJK, Devanagari, and dozens more — plus symbols, emoji, mathematical notation, and historical scripts. As of Unicode 16.0 (2024), the standard defines 149,813 characters across 161 scripts. UTF-8, the dominant encoding on the web, represents each Unicode codepoint in 1 to 4 bytes.
How Unicode Is Organized
Unicode divides the codepoint space (0x0000–0x10FFFF) into 17 planes, each with 65,536 codepoints:
- Plane 0 (BMP — Basic Multilingual Plane): U+0000 to U+FFFF. Contains almost all modern writing systems. Most text fits entirely in the BMP.
- Plane 1 (SMP — Supplementary Multilingual Plane): U+10000 to U+1FFFF. Historic scripts, music notation, emoji.
- Plane 2 (SIP — Supplementary Ideographic Plane): U+20000 to U+2FFFF. Rare and historic CJK characters.
- Planes 3–13: Unassigned (reserved for future use).
- Plane 14 (SSP): Special-purpose characters and control codes.
- Planes 15–16: Private use areas — guaranteed never to be assigned by Unicode, free for application-specific use.
Major Unicode Blocks
- Basic Latin (U+0000–U+007F): ASCII — the first 128 characters.
- Latin-1 Supplement (U+0080–U+00FF): Accented letters, punctuation — used by Western European languages.
- Cyrillic (U+0400–U+04FF): Russian, Ukrainian, Bulgarian, Serbian, and other Slavic languages.
- CJK Unified Ideographs (U+4E00–U+9FFF): Chinese, Japanese, and Korean characters — the largest block in the BMP.
- Arrows (U+2190–U+21FF): Every arrow direction and style: → ← ↑ ↓ ↔ ⇒ ⇐.
- Mathematical Operators (U+2200–U+22FF): ∀ ∃ ∈ ∉ ∑ ∏ ∫ ∮.
- Emoticons & Emoji (U+1F600–U+1F64F): 😀 through 🙏.
UTF-8 Encoding: How Codepoints Become Bytes
U+0041 ('A') → 0x41 (1 byte)
U+00E9 ('é') → 0xC3 0xA9 (2 bytes)
U+4E16 ('世') → 0xE4 0xB8 0x96 (3 bytes)
U+1F600 ('😀') → 0xF0 0x9F 0x98 0x80 (4 bytes)
BMP characters (U+0000–U+FFFF) encode in 1–3 bytes. Supplementary characters (U+10000+) need 4 bytes. ASCII through U+007F encode as a single byte — this is why UTF-8 is backwards-compatible with ASCII.
Browse Unicode Now
Use ToolsVito's Unicode Character Table to browse Unicode by block — Latin, Cyrillic, CJK, symbols, arrows, math, emoji. See codepoint, UTF-8 and UTF-16 encoding, and HTML entity for every character. All in your browser.