Project

General

Profile

Unicode » History » Version 4

Gregg -, 09/09/2009 03:30 AM

1 1 Gregg -
2 4 Gregg -
h1. Unicode Support
3
4
5 1 Gregg -
Background:
6
7 4 Gregg -
* See "UTR 17, Unicode Character Encoding Model":http://unicode.org/reports/tr17/ - if you're brave enough to tackle the mysteries of CCSs, CEFs, CESs, etc.
8
* See also "Sections 3.8, 9, 10":http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf of Unicode 5 for more punishment.
9
* See also the "ICU":http://site.icu-project.org/ page for lots of detailed documentation on how Unicode is supposed to work in running,  software, including discussions of what can possibly go wrong.
10
* There are three "encoding" forms, UTF-8, UTF-16, and UTF-32; there are also UCS-2 and UCS-4.
11
* JSON must be unicode
12
* The default encoding form of JSON is utf-8 unicode, which effectively means it must be supported, but JSON data can also be delivered in the other two forms
13
* SPARQL syntax is UTF-8 Unicode:  "The encoding is always UTF-8 [RFC3629].  Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]".  In other words, the SPARQL must detect and reject non-utf-8.  But it isn't clear if a conformant SPARQL parser _must_ accept unicode expressed with escapes (which is essentially utf-7).
14 1 Gregg -
15
16
Requirements:
17
18 4 Gregg -
* The XML header of a result should always explicitly declare the encoding
19
* Content negotiation (Accept-Charset, Content-Type 'charset' parameter, etc.) should be used to specify encodings and forms
20
* A SPARQL query whose Accept header specifies JSON must always return results in utf-8 if no other Charset is requested
21 1 Gregg -
 
22
Other:
23 4 Gregg -
* Acceptance and conversion of other encodings for incoming data?
24
* Collations?
25
* Date comparisons?
26
* Other locale-specific logic?