|
|
Message-ID: <20260523223429.GC27423@brightrain.aerifal.cx> Date: Sat, 23 May 2026 18:34:30 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Results of analysis of requirements for collation data representation While the specifics of best representation of the collation weights data for runtime use, and how to generate such a representation from the CLDR root data and CLDR-format locale-specific tailorings, is something that will continue to evolve as the locale project continues, the actual requirements for the data representation are something that can be pinned down completely right now. This is part of the locale support overhaul project, funded by NLnet and the NGI Zero Core Fund. At a very high level, the collation data object is a greedy mapping from an input sequence of NFD codepoints, optionally contextual*, to sequences of collation elements. At this abstract level, it doesn't even matter what collation elements are/look-like. But in practice, each collation element consists of 3 weights, where each weight is a sequence of zero or more non-null bytes. There are further constraints on the structure of these weights, but they are not important at this level. (*) The mapping may be context-dependent, in that it may assert a prefix from previously-consumed codepoints. This is not a strictly required feature, and it was not present in the original UCA; prefix rules can be transformed to rules that don't use prefix matching. The documented motivation for doing prefixes is performance. This seems dubious, at least with the implementation I'm going with for multi-character sequence matching, but the ability to use rules written with prefixes, which appear in real CLDR data, without having to transform them, seems worthwhile, so I am including it in the model. Per TR#35 https://unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings prefixes are matched before contraction suffixes (any part of the forward match beyond the first codepoint), and are greedy in the same way the main contraction suffix matches are. So. the keys in the mapping lookup can be represented as: base/prefix-1/prefix-2/.../prefix-K/suffix+1/suffix+2/.../suffix+N By requirements of the UCA, whenever a key of the form A/B/C/D exists, keys of the form A/B/C, A/B, and A must also exist. These can be represented as A/B/C/D, A/B/C/0, A/B/0, and A/0. so that a particular level is never both final and non-final. At each level of such a greedy mapping up until the first suffix, the next component can be either a[nother] prefix codepoint or a first suffix codepoint. This requires one additional bit of data in the representation beyond the codepoint value. Since 0 is not a possible codepoint value, using sign is an obvious choice: representing prefixes as the negated codepoint value. Use of collation data for non-strcoll/strxfrm use: The regex engine also exposes collation with equivalence classes [=ce=] and collating symbols [.ce.]. While it might seem there is a need to represent classes explicitly, the regex implementation can evaluate equivalence by either (1) saving the primary weight for the characters between the = signs at regcomp time and looking up the primary weight of the input at regexec time, and comparing them to evaluate the match, or (2) constructing strings to pass to strcoll as a black box, which assess the primary equivalence. In particular, it is not necessary to be able to convert a representative of the class into a set of characters. In fact the set of matches in general will not be single characters but sequences consisting of a base character and combining marks. Likewise collating symbol matches can be asserted by doing the equivalent of strxfrm on the characters between the .'s at regcomp time, and likewise for the input at regexec time, and comparing them byte-for-byte to evaluate the match. Note: The TRE regex engine we use now does not support collation elements or equivalence classes, and they are unlikely to be added in the short-term future. What is important now however is that the data model not preclude future support.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.