Widget HTML Atas

You Accept To Know : An Unfair Unicode Rant

I would non last the showtime someone to last grumpy that you can't usage UTF8 on Windows, as well as I must acknowledge that my wish for that choice comes from pure laziness.  Perhaps it comes downward to Microsoft non wanting to alter their huge code base of operations as well as me non wanting to alter mine, but I must acknowledge that X-Plane's code is all the same slightly smaller than, um, all of Windows. :-)

It turns out that the work of dealing amongst Unicode isn't every bit bad every bit it seems - for reasons of pure luck (probably partly coming from X-Plane non existence real file-system or text-centric) most of our text-based contact amongst other systems happens inwards entirely iii or 4 translation units, so it's pretty slow to honour as well as trap the bottlenecks.

The ugliest work turns out to last the C runtime.  For historical reasons X-Plane on Windows is built using Metrowerks CodeWarrior 8.  The bad word is that the C runtime API uses unmarried byte file paths, as well as does non back upwardly UTF8.  The skillful word is that they give y'all the source.

So it looks to me similar what we're going to receive got to create is to position a converter code inwards the bottom of the C library to convert eight fleck UTF8 paths (passed inwards to X-Plane) into UTF16 paths that nosotros tin as well as then transcend to the "W" variants of the file routines.  Fortunately since this happens "just inwards time" nosotros entirely receive got to aspect for Win32 file organization calls within the C runtime source, which are already partitioned into a "Win32 specific" section.

We're using UTF8 internally because it allows us to non alter the size of whatever internal retention buffers or information structures - if nosotros did nosotros would receive got to draw every unmarried instance to honour which ones acquire written to a binary file format as well as convert in that location (and I tin enjoin y'all off paw that nosotros receive got a disclose of these cases).  

Converting the string path to broad characters seems similar a loss - trading an annoying but survivable põrnikas (incorrect strings) into a laid of worse ones (data loss, file corruption, as well as crashes).

And that brings me to the grumpy rant part: a lot of the documentation as well as comments I flora online were only focused on the "Microsoft" process, e.g. usage TCHAR for all your strings, #define UNICODE to 1, build clean upwardly the wreckage of having changed the size of a telephone substitution information type inwards your entire app, become habitation happy.

The truth is, the rant isn't well-nigh Microsoft, as well as their "churn the whole code base, redefine the world" approach to Unicode (which I am certain was a lot to a greater extent than compelling when the BMP was the entire grapheme set...with broad characters, y'all acquire the joy of debugging your entire app combined amongst the joy of dealing amongst variable-length characters!) - rather this rant is well-nigh the nation of documentation as well as how programmers usage it.  And I don't retrieve the documentation writers are to blame.  Besides their having to approach a real hard work without ever having plenty fourth dimension as well as budget, I retrieve they're exactly giving the marketplace what it wants.

Documentation today is built to a greater extent than or less a combination of lowest-level reference (e.g. lexicon styled definitions of what a unmarried function's parameters are) as well as recipes (that is, examples of how to create mutual tasks).  Low grade reference is of course of didactics mandatory, but it's non enough. The gap is is inwards conceptual documentation.  The "why" of programming is under-documented, as well as from what I tin tell, fifty-fifty when the why is documented (in a weblog post, jammed into the remarks of a component subdivision reference, inwards an overview to a module), it is oft ignored yesteryear programmers who are either every bit good busy, every bit good lazy, or every bit good dumb to care.

The thought of programming similar this is absurd - to usage a library whose blueprint y'all don't empathise is similar to receive got a speech, run it through Google linguistic communication services, as well as and then give that speech communication inwards a linguistic communication y'all don't verbalize inwards front end of unusual dignitaries.  Sure they powerfulness aspect at y'all similar they empathise most of it, but telephone substitution things could last going wrong because y'all don't understand what y'all mean* as well as y'all wouldn't fifty-fifty know.

Something I receive got learned via the schoolhouse of hard knocks is the dominion of "no squirrelly behavior". That is to say, if your application does something that is surprising as well as innocuous, it e'er pays to receive got the fourth dimension to empathise what is going (thus rendering the behaviour unsurprising).

The choice is to allow the misunderstood behaviour be every bit "harmless", except that:
  1. Since y'all don't actually know what your app is doing, y'all won't last agreement what y'all tell for all the code your write from that betoken on and
  2. more importantly unremarkably the harmless behaviour is the tip of an iceberg that volition sink your app inwards a means that is both much worse (crash, information corruption) as well as much harder to debug.
So inwards an endeavour to terminate my rant as well as acquire dorsum to the existent piece of work of drinking java as well as cursing:
  • Always receive got the fourth dimension to empathise why a library is the means it is earlier using it.
  • Always brand certain y'all empathise what your ain code actually implies.
  • If y'all receive got unexpected behavior, create it early on land it's slow to create so.
And my plea to documentation writers is: delight enjoin us why, fifty-fifty if most programmers don't know why they demand to know!

* This is component subdivision of Scott Meyers 50th dominion from "Effective C++": tell what y'all hateful as well as empathise what y'all say.  It could last the best i judgement of programming advice y'all volition ever get.