Internationalization Support

Background

  • See UTR 17, Unicode Character Encoding Model - if you're brave enough to tackle the mysteries of CCSs, CEFs, CESs, etc.
  • See also Sections 3.8, 9, 10 of Unicode 5 for more punishment.
  • See also the ICU page for lots of detailed documentation on how Unicode is supposed to work in running, software, including discussions of what can possibly go wrong.
  • There are three "encoding" forms, UTF-8, UTF-16, and UTF-32; there are also UCS-2 and UCS-4.
  • JSON must be unicode
  • The default encoding form of JSON is utf-8 unicode, which effectively means it must be supported, but JSON data can also be delivered in the other two forms
  • SPARQL syntax is UTF-8 Unicode: "The encoding is always UTF-8 [RFC3629]. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]". In other words, the SPARQL must detect and reject non-utf-8. But it isn't clear if a conformant SPARQL parser must accept unicode expressed with escapes (which is essentially utf-7).

Requirements

  • The XML header of a result should always explicitly declare the encoding
  • Content negotiation (Accept-Charset, Content-Type 'charset' parameter, etc.) should be used to specify encodings and forms
  • A SPARQL query whose Accept header specifies JSON must always return results in utf-8 if no other Charset is requested

Calendar support

Resources:

  • Joda "Joda-Time provides a quality replacement for the Java date and time classes."

Collation support

UTS 10 Unicode Collation Algorithm

ICU Collation documentation

XQuery/XPath collation documentation

SPARQL

See the ORDER BY clause, which uses "<", whose semantics is defined by:
XQuery/XPath fn:compare operator.

SPARQL does not define a collation order, though one could be used for any undefined comparison. Examples of undefined comparisons include:
  • "a" and "a"@en_gb (a simple literal and a literal with a language tag)
  • "a"@en_gb and "b"@en_gb (two literals with language tags)
  • "a" and "a"^^xsd:string (a simple literal and an xsd:string)
Collation is only possible in SPARQL as an extension. The two possible extension points are:
  • Comparisons between language-coded literals.
  • User access (through FILTER, LET, or HAVING) to an implementation of fn:compare.
    Currently, we don't know of any projects that implement fn:compare.

Other

  • Acceptance and conversion of other encodings for incoming data?
  • Collations?
  • Date comparisons?
  • Other locale-specific logic?