Behold!

My name, written in Hindi, written in Unicode:

ऐलन डेिवडसन

Yeah, that’s right—real programmers code in binary (or hexadecimal, if they get lazy). The coolest thing about this is that if I had been more confident, I could have done it without getting help from the Internet. but I wasn’t, so I double checked stuff online. I’m still not entirely sure I got it right, so if you or someone you know is familiar with the Devanagari alphabet, please double check my spelling. I have written this so that people who don’t have Hindi vowel-rendering turned on (which I suspect is the majority of my readers) will see this correctly, while anyone who actually has a computer set up to read Hindi/Sanskrit/&c will think the ि and व should be swapped. I’m aware of the problem, but can’t fix it for everyone.

Unicode is surprisingly intricate: like x86 machine code, UTF-8 (the most common encoding of Unicode, since it’s backwards compatible with ASCII) and UTF-16 use a variable-length encoding for characters, so that common character sets like ASCII take up less room than uncommon ones like Braille (which is not as widespread on the Internet as it is elsewhere). Unicode text files typically start off with a Byte-Order Mark, which describes the basic unit size of characters along with the endianness of the machine on which it was encoded; these BOMs are partly why it’s such a universal encoding system. Unicode actually raises some pretty challenging questions in terms of “alphabetical” sorting and accent placement, and even presents some security problems by opening the way for homograph phishing attacks (for instance, see this Shmoo article on IDN attacks, which mentions that www.pаypal.com can be registered with a Cyrillic first ‘а’ and could be full of scams. Yes, I have written both the URL and the ‘а’ with the actual Cyrillic letter).

Yes, it’s totally dorky to learn about Unicode, but it’s actually kinda cool at the same time.

Leave a Reply

3 Comments

  1. So, there’s some cool cursive-y stuff going on in your Hindi; is that because my Unicode engine is smart, and knows that certain glyphs get turned into ligatures [0] for me, or is the font just set up to make it look about right? I know Arabic has the same behavior (words are all mushed together into one glyph-like thing), so it seems like it could be a common problem.

    [0] -> I think of “ligatures” as being when two letters get mushed together into one glyph; do you know if the same term applies?

    • Alan says:

      My example doesn’t use ligatures, since you can still highlight individual characters (except for the “accents,” which isn’t the correct term but is close enough). What you’re seeing is probably caused by kerning (the TeX renderer also uses kerning, fyi. That’s how the (La)TeX symbol was created). Nonetheless, in theory Unicode can create ligatures for digraphs, such as turning a German ‘ſs’ into a ‘ß’. I don’t know whether your computer likely supports this or not, however.

    • Alan says:

      Also, in true programmer form I should point out that what you called a “common problem” is actually a feature. :-D These languages are supposed to look like that.

Leave a Reply to Alan

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>