![]() There is an article on the Unicode site explaining the linguistic, historical and technical rationale, which also says: Or at least, that's how some of them interpret it. Similarly, through something called the " Han unification," people who type Chinese, Japanese and Korean have been told, "hey listen, y'all are gonna have to share some of your characters to save space." from The Sorry State of Japanese on the Internet For example, in an article called I Can Text You A Pile of Poo, But I Can’t Write My Name, Aditya Mukerjee explains that Bengali, with about 200 million native speakers (more than Russian), can't always be properly typed on a computer. In practice, Unicode is made by a standards body, so it's a political process, and some people say that their language isn't getting a fair shake. Unicode lets us type anything in all of human language. We want to type emoji for laughing, and crying, and kissing, and being upside down, and having dollars in our mouths. But we want to be able to type more than just these characters.Īnd this Han character that means "to castrate a fowl."Īnd sometimes we want to type more than just words. CharacterĪnd that's fine as far as it goes. Since there are only 128 ASCII characters their actual data is never more than 7 bits long, hence the leading 0 when we encode 'a'. to_string ( i, 2 ) end # A ? gives us the codepoint ?a = 97 ?a |> base_2. Here's how to do that in Elixir: base_2 = fn ( i ) -> Integer. You just convert the codepoint to base 2 and pad it with zeros up to a full 8-bit byte. To "encode" ASCII-to represent it in a way that can be stored or transmitted-is simple. ![]() (Why 65? There are reasons for the numeric choices.) The number assigned to a character is called its "codepoint." It's an agreement that capital A can be represented by the number 65, and so on. To understand Unicode, let's talk first about ASCII, which is what English-speaking Americans like me might think of as "plain old text." Here's what I get when I run man ascii on my machine:ĪSCII is just a mapping from characters to numbers. Unicode is pretty awesome, but unfortunately, my first exposure to it was "broken characters on the web." From Zazzle OK, but how does Elixir support Unicode so well? I'm glad you asked! (Ssssh, pretend you asked.) To find out, we need to explore the concepts behind Unicode. ("noël".unicode_normalize = "noël".unicode_normalize) = true String.equivalent?("noël", "noël") = true "noël" (this time the e with accent is one codepoint) should equal "noël" if normalized "baffle" ("baffle" with ligature - "ffl" as a single code point) upcased should be "BAFFLE"Ĩ. Substring after the first character of "□□" is "□"ħ. ![]() Reverse of "noël" (e with accent is two codepoints) is "lëon"Ģ. (By the way, the test descriptions use terms like "codepoints" and "normalized"-I'll explain those later.) 1. But here I'll compare the languages I use most: Elixir (version 1.3.2), Ruby (version 2.4.0-preview1) and JavaScript (run in v8 version 4.6.85.31). The article says that most languages fail at least some of its tests, and mentions C#, C++, Java, JavaScript and Perl as falling short (it doesn't specify which versions). Specifically, Elixir passes all the checks suggested in The String Type is Broken. This makes it a great language for distributed, concurrent, fault-tolerant apps that send poo emoji! □ You may have heard that Elixir has great Unicode support. My posts on Elixir and IO Lists ( here and here) were also part of that talk. I originally posted it on the Big Nerd Ranch blog. This post was adapted from a talk called "String Theory", which I co-presented with James Edward Gray II at Elixir & Phoenix Conf 2016. Elixir unicode Posted on: November 7, 2016
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |