Project

General

Profile

Actions

Unicode » History » Revision 2

« Previous | Revision 2/4 (diff) | Next »
Gregg -, 09/09/2009 03:26 AM


= Unicode Support =

Background:

  • See [http://unicode.org/reports/tr17/ UTR 17, Unicode Character Encoding Model] - if you're brave enough to tackle the mysteries of CCSs, CEFs, CESs, etc.
  • See also the [http://site.icu-project.org/ ICU] page for lots of detailed documentation on how Unicode is supposed to work in running software
  • There are three "encoding" forms, UTF-8, UTF-16, and UTF-32; there are also UCS-2 and UCS-4.
  • JSON must be unicode
  • The default encoding form of JSON is utf-8 unicode, which effectively means it must be supported, but JSON data can also be delivered in the other two forms
  • SPARQL syntax is UTF-8 Unicode: "The encoding is always UTF-8 [RFC3629]. Unicode code points may also be expressed using an \uXXXX (U+0 to U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a hexadecimal digit [0-9A-F]". In other words, the SPARQL must detect and reject non-utf-8. But it isn't clear if a conformant SPARQL parser ''must'' accept unicode expressed with escapes (which is essentially utf-7).

Requirements:

  • The XML header of a result should always explicitly declare the encoding
  • Content negotiation (Accept-Charset, Content-Type 'charset' parameter, etc.) should be used to specify encodings and forms
  • A SPARQL query whose Accept header specifies JSON must always return results in utf-8 if no other Charset is requested

Other: * Acceptance and conversion of other encodings for incoming data? * Collations? * Date comparisons? * Other locale-specific logic?

Updated by Gregg - over 16 years ago · 4 revisions