Unicode, buffering and printf

One day, during a weekend, I was writing some code for my Mojibake library when I saw a strange output in my CLI for the U+10C0 codepoint chosen by random.

./mojibake.sh char $'\U10C0'
Codepoint: U+10C0
Name: GEORGIAN CAPITAL LETTER HAE
Character: Ⴠ
Hex UTF-8: E1 83 80
NFD: Ⴠ
NFD normalization: U+10C0
NFC: Ⴠ
NFC normalization: U+10C0
NFKD: Ⴠ
NFKD normalization: U+10C0
NFKC: Ⴠ
NFKC normalization: U+10C0
Category: [0] Letter, uppercase
Combining: [0] Not reordered
Bidirectional: [1] Left-to-right
Plane: [0] Basic Multilingual Plane
Block: [0] Georgian
Decomposition: [0] None
Decimal: N/A
Digit: N/A
Numeric: N/A
Mirrored: N
Uppercase: N/A
Lowercase: U+2D20
Titlecase: N/A

Uh? The Unicode Georgian block (MJB_BLOCK_GEORGIAN) has ID 36, and not 0. Even stranger the name of the block was correct. Just the ID was wrong. I made a quick database query and the data was ok:

select id from blocks where name = 'Georgian';
36

Mojibake is in the alpha status. I probably overwritten the memory or something in the mjb_character_block function?

block->id = (mjb_block)raw_id;

No data truncation or similar here. An enum is an integer in C and sqlite3_column_int return an integer. Did I overwrite something with the strncpy? Nothing. So I added an innocent printf.

printf("ID: %d, block.id)
ID: 0

Nothing changed. Or at least I tought that nothing changed. I output the data inside the CLI code:

printf("ID: %d, block.id)
print_value("Block", 1, "[%d] %s", block.id, block.name);
ID: 0
[36] Georgian

Now it was 36? Only because I’ve added a printf before? I commented out all the code and just kept that two output statements.

mjb_codepoint_block block = {0};
bool valid_block = mjb_character_block(character->codepoint, &block);
printf("ID: %d\n, block.id)
printf("ID: %d\n, block.id)
ID 0
ID 36

mjb_codepoint_block block = {0};
bool valid_block = mjb_character_block(character->codepoint, &block);
// printf("ID: %d\n, block.id)
printf("ID: %d\n, block.id)
ID 0

🙂. The first time I called a printf the value was zero, the second one it was the right one.

Welcome to the Unicode world

How is it ever possible that printf print something and the same printf one row below something else? The printf function must not change the values it receives. But, what is the printf function?

The printf, like other functions of the stdio.h family outputs a string to a FILE*, in this case stdout. It has a buffer that is flushed when you output a newline. Should this buffer be the problem? Well… yes? By forcing a fflush I solved the problem.

I checked the macOS open source internals but after a while, I gave up. I began working on my library to see if I can solve some of those Latin-1 screwed up èéò characters family.

This is just a little story, nothing more.

Welcome to the Unicode world

More posts

AIs are the new computers

Don’t hate AI

Prefix compression of thousands of similar strings

A new BibLaTeX WordPress block