內容

名稱

Unicode::Collate::Locale - 透過 Unicode::Collate 為 DUCET 進行語言調整

語法

use Unicode::Collate::Locale;

#construct
$Collator = Unicode::Collate::Locale->
    new(locale => $locale_name, %tailoring);

#sort
@sorted = $Collator->sort(@not_sorted);

#compare
$result = $Collator->cmp($a, $b); # returns 1, 0, or -1.

注意:@not_sorted$a$b 中的字串會根據 Perl 的 Unicode 支援進行詮釋。請參閱 perlunicodeperluniintroperlunitutperlunifaqutf8。否則,您可以使用 preprocess(請參閱 Unicode::Collate)或在之前對它們進行解碼。

說明

此模組提供語言調整,以利用 Unicode::Collate

建構函式

new 方法會傳回一個整理器物件。

建構函式的參數清單是一個雜湊,其中可以包含一個特殊金鑰 locale 及其值(不分大小寫),代表 Unicode 基礎語言代碼(兩個或三個字母)。例如,Unicode::Collate::Locale->new(locale => 'ES') 會傳回一個針對西班牙語調整的整理器。

$locale_name 可以加上 Unicode 字碼 (四個字母)、Unicode 區域 (領土) 代碼、Unicode 語言變體代碼。這些代碼不分大小寫,並以 '_''-' 分隔。例如 en_US 代表美國英語、az_Cyrl 代表西里爾字母的亞塞拜然語、es_ES_traditional 代表西班牙的西班牙語 (傳統)。

如果 $locale_name 不可用,則按以下順序選擇後備選項

1. language with a variant code
2. language with a script code
3. language with a region code
4. language
5. default

只要不用於 locale 支援,則允許使用 Unicode::Collate 提供的調整標籤。特別是 table 標籤始終不可調整,因為它保留給 DUCET。

不過,即使 entry 用於 locale 支援,也允許使用 entry 來新增或覆寫對應。

例如,一個忽略變音符號和大小寫差異的西班牙語校對器 (即等級 1),具有反向大小寫順序且沒有正規化。

Unicode::Collate::Locale->new(
    level => 1,
    locale => 'es',
    upper_before_lower => 1,
    normalization => undef
)

如果將此類調整傳遞給 new(),則不允許覆寫已由 locale 調整的行為。

Unicode::Collate::Locale->new(
    locale => 'da',
    upper_before_lower => 0, # causes error as reserved by 'da'
)

不過,從 Unicode::Collate 繼承的 change() 允許 locale 保留此類調整。範例

new(locale => 'fr_ca')->change(backwards => undef)
new(locale => 'da')->change(upper_before_lower => 0)
new(locale => 'ja')->change(overrideCJK => undef)

方法

Unicode::Collate::LocaleUnicode::Collate 的子類別,除了 new 之外的方法都從 Unicode::Collate 繼承而來。

以下是其他方法的清單

$Collator->getlocale

傳回校對中實際接受和使用的語言代碼。如果您傳遞的語言代碼未提供語言調整 (某些語言有意的,或由於實作不完整),此方法會傳回字串 'default',表示沒有特殊調整。

$Collator->locale_version

(自 Unicode::Collate::Locale 0.87 起) 傳回 locale 的版本號碼 (可能是 /\d\.\d\d/),如同 Locale/*.pl

注意: getlocalelocale_version 的回傳值組合應可識別出排序器使用的 Locale/*.pl

可調整地區清單

  locale name       description
--------------------------------------------------------------
  af                Afrikaans
  ar                Arabic
  as                Assamese
  az                Azerbaijani (Azeri)
  be                Belarusian
  bn                Bengali
  bs                Bosnian (tailored as Croatian)
  bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
  ca                Catalan
  cs                Czech
  cu                Church Slavic
  cy                Welsh
  da                Danish
  de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
  de_AT_phonebook   Austrian German (umlaut primary greater)
  dsb               Lower Sorbian
  ee                Ewe
  eo                Esperanto
  es                Spanish
  es__traditional   Spanish ('ch' and 'll' as a grapheme)
  et                Estonian
  fa                Persian
  fi                Finnish (v and w are primary equal)
  fi__phonebook     Finnish (v and w as separate characters)
  fil               Filipino
  fo                Faroese
  fr_CA             Canadian French
  gu                Gujarati
  ha                Hausa
  haw               Hawaiian
  he                Hebrew
  hi                Hindi
  hr                Croatian
  hu                Hungarian
  hy                Armenian
  ig                Igbo
  is                Icelandic
  ja                Japanese [1]
  kk                Kazakh
  kl                Kalaallisut
  kn                Kannada
  ko                Korean [2]
  kok               Konkani
  lkt               Lakota
  ln                Lingala
  lt                Lithuanian
  lv                Latvian
  mk                Macedonian
  ml                Malayalam
  mr                Marathi
  mt                Maltese
  nb                Norwegian Bokmal
  nn                Norwegian Nynorsk
  nso               Northern Sotho
  om                Oromo
  or                Oriya
  pa                Punjabi
  pl                Polish
  ro                Romanian
  sa                Sanskrit
  se                Northern Sami
  si                Sinhala
  si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
  sk                Slovak
  sl                Slovenian
  sq                Albanian
  sr                Serbian
  sr_Latn           Serbian in Latin (tailored as Croatian)
  sv                Swedish (v and w are primary equal)
  sv__reformed      Swedish (v and w as separate characters)
  ta                Tamil
  te                Telugu
  th                Thai
  tn                Tswana
  to                Tonga
  tr                Turkish
  ug_Cyrl           Uyghur in Cyrillic
  uk                Ukrainian
  ur                Urdu
  vi                Vietnamese
  vo                Volapu"k
  wae               Walser
  wo                Wolof
  yo                Yoruba
  zh                Chinese
  zh__big5han       Chinese (ideographs: big5 order)
  zh__gb2312han     Chinese (ideographs: GB-2312 order)
  zh__pinyin        Chinese (ideographs: pinyin order) [3]
  zh__stroke        Chinese (ideographs: stroke order) [3]
  zh__zhuyin        Chinese (ideographs: zhuyin order) [3]
--------------------------------------------------------------

根據預設 UCA 規則的地區包括 am(阿姆哈拉語)不含 [reorder Ethi]、bg(保加利亞語)不含 [reorder Cyrl]、chr(切羅基語)不含 [reorder Cher]、de(德語)、en(英語)、fr(法語)、ga(愛爾蘭語)、id(印尼語)、it(義大利語)、ka(喬治亞語)不含 [reorder Geor]、mn(蒙古語)不含 [reorder Cyrl Mong]、ms(馬來語)、nl(荷蘭語)、pt(葡萄牙語)、ru(俄語)不含 [reorder Cyrl]、sw(史瓦希里語)、zu(祖魯語)。

注意

[1] ja:表意文字依 JIS X 0208 順序排序。全形和半形與一般形式相同。平假名和片假名之間的差異在第 4 層級,比較時也需要 (variable => 'Non-ignorable'),然後 katakana_before_hiragana 就不會產生作用。

[2] ko:許多表意文字依其讀音排序。此類表意文字的主序(第 1 層級)等於對應的韓文字母,次序(第 2 層級)大於對應的韓文字母。

[3] zh__pinyin、zh__stroke 和 zh__zhuyin:實作 alt='short',其中調整了較少的表意文字。

變體碼和其別名的清單

  variant code       alias
------------------------------------------
  dictionary         dict
  phonebook          phone     phonebk
  reformed           reform
  traditional        trad
------------------------------------------
  big5han            big5
  gb2312han          gb2312
  pinyin
  stroke
  zhuyin
------------------------------------------

注意:'pinyin' 是拉丁語的漢語拼音,'zhuyin' 是注音符號的漢語拼音。

安裝

安裝 Unicode::Collate::Locale 需要 Collate/Locale.pmCollate/Locale/*.pmCollate/CJK/*.pmCollate/allkeys.txt。在建置時,Unicode::Collate::Locale 不需要任何 data/*.txtgendata/*mklocaleUnicode::Collate::Locale 的測試命名為 t/loc_*.t

注意事項

調整並非最大值

即使某個字母已調整,其等效字母也不一定會像它一樣調整。例如,即使 W 已調整,全形 W(U+FF37)、帶有銳音符號的 W(U+1E82)等並未調整。結果可能取決於原始字串是否已正規化,以及是否已分解或合成。因此較不建議使用 (normalization => undef)

不支援整理順序

包括文字系統在內的任何群組順序都不會變更。

參考

  locale            based CLDR or other reference
--------------------------------------------------------------------
  af                30 = 1.8.1
  ar                30 = 28 ("compat" wo [reorder Arab]) = 1.9.0
  as                30 = 28 (without [reorder Beng..]) = 23
  az                30 = 24 ("standard" wo [reorder Latn Cyrl])
  be                30 = 28 (without [reorder Cyrl])
  bn                30 = 28 ("standard" wo [reorder Beng..]) = 2.0.1
  bs                30 = 28 (type="standard": [import hr])
  bs_Cyrl           30 = 28 (type="standard": [import sr])
  ca                30 = 23 (alt="proposed" type="standard")
  cs                30 = 1.8.1 (type="standard")
  cu                34 = 30 (without [reorder Cyrl])
  cy                30 = 1.8.1
  da                22.1 = 1.8.1 (type="standard")
  de__phonebook     30 = 2.0 (type="phonebook")
  de_AT_phonebook   30 = 27 (type="phonebook")
  dsb               30 = 26
  ee                30 = 21
  eo                30 = 1.8.1
  es                30 = 1.9.0 (type="standard")
  es__traditional   30 = 1.8.1 (type="traditional")
  et                30 = 26
  fa                22.1 = 1.8.1
  fi                22.1 = 1.8.1 (type="standard" alt="proposed")
  fi__phonebook     22.1 = 1.8.1 (type="phonebook")
  fil               30 = 1.9.0 (type="standard") = 1.8.1
  fo                22.1 = 1.8.1 (alt="proposed" type="standard")
  fr_CA             30 = 1.9.0
  gu                30 = 28 ("standard" wo [reorder Gujr..]) = 1.9.0
  ha                30 = 1.9.0
  haw               30 = 24
  he                30 = 28 (without [reorder Hebr]) = 23
  hi                30 = 28 (without [reorder Deva..]) = 1.9.0
  hr                30 = 28 ("standard" wo [reorder Latn Cyrl]) = 1.9.0
  hu                22.1 = 1.8.1 (alt="proposed" type="standard")
  hy                30 = 28 (without [reorder Armn]) = 1.8.1
  ig                30 = 1.8.1
  is                22.1 = 1.8.1 (type="standard")
  ja                22.1 = 1.8.1 (type="standard")
  kk                30 = 28 (without [reorder Cyrl])
  kl                22.1 = 1.8.1 (type="standard")
  kn                30 = 28 ("standard" wo [reorder Knda..]) = 1.9.0
  ko                22.1 = 1.8.1 (type="standard")
  kok               30 = 28 (without [reorder Deva..]) = 1.8.1
  lkt               30 = 25
  ln                30 = 2.0 (type="standard") = 1.8.1
  lt                22.1 = 1.9.0
  lv                22.1 = 1.9.0 (type="standard") = 1.8.1
  mk                30 = 28 (without [reorder Cyrl])
  ml                22.1 = 1.9.0
  mr                30 = 28 (without [reorder Deva..]) = 1.8.1
  mt                22.1 = 1.9.0
  nb                22.1 = 2.0   (type="standard")
  nn                22.1 = 2.0   (type="standard")
  nso           [*] 26 = 1.8.1
  om                22.1 = 1.8.1
  or                30 = 28 (without [reorder Orya..]) = 1.9.0
  pa                22.1 = 1.8.1
  pl                30 = 1.8.1
  ro                30 = 1.9.0 (type="standard")
  sa            [*] 1.9.1 = 1.8.1 (type="standard" alt="proposed")
  se                22.1 = 1.8.1 (type="standard")
  si                30 = 28 ("standard" wo [reorder Sinh..]) = 1.9.0
  si__dictionary    30 = 28 ("dictionary" wo [reorder Sinh..]) = 1.9.0
  sk                22.1 = 1.9.0 (type="standard")
  sl                22.1 = 1.8.1 (type="standard" alt="proposed")
  sq                22.1 = 1.8.1 (alt="proposed" type="standard")
  sr                30 = 28 (without [reorder Cyrl])
  sr_Latn           30 = 28 (type="standard": [import hr])
  sv                22.1 = 1.9.0 (type="standard")
  sv__reformed      22.1 = 1.8.1 (type="reformed")
  ta                22.1 = 1.9.0
  te                30 = 28 (without [reorder Telu..]) = 1.9.0
  th                22.1 = 22
  tn            [*] 26 = 1.8.1
  to                22.1 = 22
  tr                22.1 = 1.8.1 (type="standard")
  uk                30 = 28 (without [reorder Cyrl])
  ug_Cyrl           https://en.wikipedia.org/wiki/Uyghur_Cyrillic_alphabet
  ur                22.1 = 1.9.0
  vi                22.1 = 1.8.1
  vo                30 = 25
  wae               30 = 2.0
  wo            [*] 1.9.1 = 1.8.1
  yo                30 = 1.8.1
  zh                22.1 = 1.8.1 (type="standard")
  zh__big5han       22.1 = 1.8.1 (type="big5han")
  zh__gb2312han     22.1 = 1.8.1 (type="gb2312han")
  zh__pinyin        22.1 = 2.0   (type='pinyin' alt='short')
  zh__stroke        22.1 = 1.9.1 (type='stroke' alt='short')
  zh__zhuyin        22.1 = 22    (type='zhuyin' alt='short')
--------------------------------------------------------------------

[*] http://www.unicode.org/repos/cldr/tags/latest/seed/collation/

作者

perl 的 Unicode::Collate::Locale 模組由 SADAHIRO Tomoyuki, <SADAHIRO@cpan.org> 編寫。此模組的著作權為 SADAHIRO Tomoyuki, Japan 所有,© 2004-2020。保留所有權利。

此模組為自由軟體;您可以在與 Perl 相同的條款下重新散布或修改它。

另請參閱

Unicode 校對演算法 - UTS #10

http://www.unicode.org/reports/tr10/

預設 Unicode 校對元素表 (DUCET)

http://www.unicode.org/Public/UCA/latest/allkeys.txt

Unicode 地區資料標記語言 (LDML) - UTS #35

http://www.unicode.org/reports/tr35/

CLDR - Unicode 常見地區資料儲存庫

http://cldr.unicode.org/

Unicode::Collate
Unicode::Normalize