NE #29: Programming with emojis

Part 29 of the Nerd Encyclopedia brings some color into the grey programming everyday life.

3 min readFeb 6, 2024

Texts consist of sentences, sentences consist of words and words consist of letters or precise characters. We all know it, the Latin alphabet, Arabic numerals, but also Cyrillic characters or the sinograms of Chinese script. The computer understands all these symbols thanks to a large table or even “character set”. Unicode has established itself as a quasi-standard in recent years.

Booklets out, learning unit!

A Zeichensatzcharacter set describes the set of all available characters. This would be, for example, a very small character set, which can only be the capital letters of the Latin alphabet:

[A, B, C, D, E, F, G, ,H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z]

If a numeric position is assigned to each kodierten Zeichensatzcharacter, the so-called “ codepoint ”, one speaks of a coded character set ”). Small bigame us then looks like this:

1 -> A
2 -> B
3 -> C
...
26 -> Z

Very widespread is the character set UCS (Universal Coded Character Set), better known Unicode, which is defined in the ISO 10646. Theoretically Unicode includes a range of 1.114.112 codepoints. These are divided into 17 levels (planes) of 16 bits each, i.e. 65,536 codepoints per level. Due to various technical specifications, 1.111,998 codepoints can be used effectively. Unicode not only contains the letters we know from A to Z, numbers and characters of other languages, but also now also emojis:

😆🫠😇

To each of the more than 1 million To be able to address characters can be used on UTF-32 (Unicode Transformation Format). UTF-32 has a 32 bit (4 byte) address range to encode any character in Unicode. This is simple, but also an insane waste of places. The most common German letter “e” is encoded in UTF-32 as follows:

00 00 00 65

In binary:

000000000000000000000000000001100101

An address range with 4 bytes to map a character for which 1 byte is enough? To save space, algorithms were developed that slightly more complex code, but consume less space. UTF-8 is very widely used, a “dynamic” coding — if you like it.

UTF-8 was developed in 1992 by Ken Thompson and Rob Pike, two programmers of the operating system Plan9 (named after Ed Wood’s eponymous film “ Plan9 from outer Space ” by Ed Wood, the allegedly “badest science fiction film of all time”) [WIKI14].

UTF-8 encodes the first range of Unicode with 7 bits — the first bit or the highest bit is always 0. The “e” is thus encoded as follows:

In binary:

01100101

So you only occupy 1 byte instead of 4. If you want to encode exotic, i.e. higher-value characters from Unicode, UTF-8 applies further bytes, where the highest value bits are also set. The euro mark is shown in UTF-8 with 3 bytes:

E2 82 AC

In binary:

11100010 10000010 10101100

Back to topic

As you can see, letters for the computer are also just certain places in a large table. Since the Unicode table also includes emojis, should it be possible to use emojis as an identifier for functions and variables?

Unfortunately, it is not quite so easy. The common programming languages have a defined range of characters that are permitted for such declarations. One way out are emoticons, i.e. characters that can be interpreted as emoji. Especially non-Latin fonts offer a lot of possibilities. In JavaScript, for example, the following is possible:

var ツ = „smile“;
var ൠ = „alien“;
function ಠ_ಠ (){console.log(“Viel Spaß beim Refactoring!”);}

However, there is also a programming language based solely on emojis: emojicode [EMOJI1]. The language of was invented. And this is what “ Hello World ” looks in emojicode:

🏁 🍇
 😀 🔤Hello World!🔤❗️
🍉

NE #29: Programming with emojis

Part 29 of the Nerd Encyclopedia brings some color into the grey programming everyday life.

Booklets out, learning unit!

Back to topic

Written by Nicky Reinert