Chinese character order, or Chinese character indexing, Chinese character collation and Chinese character sorting, is the way in which a Chinese character set is sorted into a sequence for the convenience of information retrieval. It may also refer to the sequence so produced.English dictionaries and indexes are normally arranged in alphabetical order for quick lookup, but Chinese is written in tens of thousands of different characters, not just dozens of letters in an alphabet, and that makes the sorting job much more challenging.
The orders or sorting methods of Chinese dictionaries are traditionally divided into three categories:
In modern Chinese, people also use frequency orders, where words or characters are sorted by their frequencies of use in a text corpus. There is also computer-based sorting and lookup.
Chinese dictionaries include character dictionaries and word dictionaries . Chinese word orders are based on character orders. Single-character words are arranged by character sorting directly, and multi-character words can be sorted character by character in a similar way.In the following sections, there is a general introduction to the orders and sorting methods currently in use, focused on those which are more popular and effective.
In this category of orders, Chinese characters are sorted according to various features of their forms or shapes. There are two subcategories of form-based orders: stroke-based orders and component-based orders.
See main article: Stroke-based sorting. In stroke-based orders, Chinese characters are sorted by different features of strokes, including stroke counts, stroke forms, stroke orders, stroke combinations, stroke positions, etc.
In this order, Chinese characters are sorted by their stroke count ascendingly. A character with less strokes is put before those of more strokes. For example, the different characters in "漢字筆劃, 汉字笔画 " (Chinese character strokes) are sorted into "汉(5)字(6)画(8)笔(10)[筆(12)畫(12)]漢(14)", where stroke counts are put in brackets. (Please note that both 筆 and 畫 are of 12 strokes and their order is not determinable by stroke-count order.).
This is a combination of stroke-count sorting and stroke-order sorting. Characters are first arranged by stroke-counts ascendingly. Then Stroke-order sorting is employed to sort characters with the same number of strokes. The characters are first arranged by their first strokes according to an order of stroke groups (such as “heng (横), shu (竖), pie (撇), dian (点), zhe (折)”, or “dian (点), heng (横), shu (竖), pie (撇), zhe (折)”), if the first strokes belong to the same group, then sort by their second strokes in a similar way, and so on. In our example of the previous section, both 筆 and 畫 are of 12 strokes. 筆 starts with stroke ㇓of the pie (撇) group, and 畫 starts with ㇕ of the zhe (折) group, and pie is before zhe in the groups order, so 筆 comes before 畫. Hence the different characters in "汉字笔画, 漢字筆劃" are finally sorted into "汉(5)字(6)画(8)笔(10)筆(12)畫(12)漢(14)", where each character is put at its unique position.
See main article: GB stroke-based order. GB Stroke-Based Order, full name GB13000.1 Character Set Chinese Character Order (Stroke-Based Order) (GB13000.1字符集汉字字序(笔画序)规范) is a standard released by the National Language Commission of China in 1999. This is an enhanced version of stroke-count-stroke-order sorting. According to this standard, the characters are first sorted by stroke counts, followed by stroke order (of the five families of heng, shu, pie, dian and zhe). Then if there are characters of the same stroke count and stroke order, they will be sorted by the primary-secondary stroke order. For example, 子 and 孑 are both of 3 strokes and have the same five-group stroke order (㇐ and ㇀ both belong to the heng family), but according to the rule of primary-secondary stroke order, primary stroke ㇐ is before secondary stroke ㇀. So 子 comes before 孑. If two characters are of the same stroke count, stroke order and primary-secondary stroke, then sort them according to the mode of stroke combination. Stroke separation precedes stroke connection, and connection precedes intersection. For example: 八 is before 人, which is before 乂. And there are other sorting rules for more accurate sorting.
See main article: YES stroke alphabetical order. YES is a simplified stroke-based sorting method free from stroke counting and grouping, without comprise in accuracy. And it has been successfully applied to the indexing of all the characters in the Xinhua Zidian and Xiandai Hanyu Cidian. In this joint index you can look up a Chinese character to find its pinyin and Unicode, in addition to the page numbers in the two popular dictionaries
In this category, characters are sorted by one or more components.
A radical is a common component shared by a group of characters. The radical usually lies on the upper part or left side of a character and helps to express its meaning. For example, 花 (flower), 草 (grass), 菜 (vegetable) all have the radical of 艹 (艸, 草, glass/plant), which indicates they are related to plant; 推 (push), 拉 (pull), 打(beat) share the radical of 扌 (手, hand), and are actions normally involving hands. In radical-based order, all the characters sharing a radical are put under that radical to form a radical family or section. Different families are arranged by their leading radicals in stroke-based order, and characters inside a family are also sorted by their strokes.
In many contemporary dictionaries, including Xinhua Zidian, Xiandai Hanyu Cidian and Oxford Chinese Dictionary, the radical-based character lookup system consists of three indexes or tables: a radical index, a character lookup index, and an index of characters with radicals difficult to find, all sorted in stroke-based order. To lookup a character (such as 家, home) in a dictionary (e.g., Xinhua Zidian, version 12), first find out its radical (the component 宀 at the top of 家). Count its number of strokes (3 strokes in 宀) and find it in the radical index in stroke-based order. When found, get its page number (p49) on the right side. Then, according to the page number, find the radical family in the character lookup table in stroke-based order. Count the number of strokes in the remaining parts of the character (except radical 宀, there are 7 strokes in 家) and find the target character within the family. And the page number on the right (217) is the page number in the dictionary main body for the entry of the character (characters entries in the main body of Xinhua Zidian are sorted by Pinyin). Characters with radicals difficult to find out can be looked up in the Index of Characters with Radicals Difficult to Find in stroke-based order.
The first radical system in history was created by a Chinese Scholar Xu Shen in his Shuowen Jiezi dictionary almost two thousand years ago in the Eastern Han Dinasty. This dictionary is still available today, with a total number of 540 radicals. Another milestone is the Kangxi radical system employed in the Kangxi Dictionary in 1716 in the era of Emperor Kangxi, with the number of radicals reduced to 214. The Kanxi radical sorting method is still in use in China, Japan and Korea. It is also used by the Unicode collation algorithm to sort CJK Unified Ideographs. The latest standard radical table of Chinese Mainland is the Table of Indexing Chinese Character Components with a list of 201 radicals.
Chinese characters are written in the form of a square block. The Four-Corner Method assigns a 4-digit code to a character, each digit representing one corner of the block. The four corner digits appear in the sequence of "upper-left, upper-right, lower-left and lower-right". For example, the code of character 顏 (meaning "face") is 0128, where the first digit 0 represents the upper-left component 亠, 1 for the upper right 一, 2 for the lower-left ㇓, and 8 represents the lower-right 八.
A fifth digit can be added to represent an extra part above the lower-right corner to gain higher sorting accuracy. For example, the extended code of character 佳 is 24214, where the fifth digit 4 represents component 十 above the final 一 in the lower-right corner.
When a set of characters are encoded in four-corner codes, they are sorted ascendingly into a four-corner order by the first four digits (followed by the fifth digits if they exist).
In this method, Chinese characters are arranged alphabetically by their codes used in Cangjie input method. The Cangjie code of a character is a string of English letters each representing a selected Cangjie component in the character. For example, the Cangjie codes of the characters in 漢字排檢法 (Methods for Chinese character sorting and retrieving) are 漢(ETLO)字(JND)排(QLMY)檢(DOMO)法(EGI), and can be sorted into a Cangjie-code order of 檢(DOMO)法(EGI)漢(ETLO)字(JND)排(QLMY).[1]
Compared with sound-based orders, form-based orders are usually more complicated, but have the advantages of (a) allowing character and word lookup without knowing its pronunciation, and (b) effective collation of large character sets without support from other sorting methods.
There are two sound representation systems currently in use for Standard Chinese, i.e., pinyin and bopomofo. Accordingly, we have two methods of sound-based sorting for Standard Chinese.
See main article: Pinyin alphabetical order.
In this method, Chinese characters are sorted by their Pinyin alphabetically, for example, 汉字拼音排序法 (the Pinyin sorting method of Chinese characters) is sorted into "法(fǎ)汉(hàn)排(pái)拼(pīn)序(xù)音(yīn)字(zì)" with pinyin in brackets. Pinyin expressions of similar letters are ordered by their tones in the order of "tone 1, tone 2, tone 3, tone 4 and tone 5 (light tone)", such as "妈(mā), 麻(má), 马(mǎ), 骂(mà), 吗(ma)". Characters of the same sound, i.e., same Pinyin letters and tones, are normally sorted by a stroke-based method.
Words of multiple characters can be sorted in two different ways . One is to sort character by characters, if the first characters are the same, then sort by the second character, and so on. For example, "归并 (guībìng), 归还 (guīhuán), 规划 (guīhuà), 鬼话(guǐhuà), 桂花 (guìhuā)". This method is used in Xiandai Hanyu Cidian. Another method is to sort according to the pinyin letters of the whole words, followed by sorting on tones when word pinyin letters are the same. For example, "归并 (guībìng), 规划 (guīhuà), 鬼话 (guǐhuà), 桂花 (guìhuā), 归还 (guīhuán)". This method is used in the ABC Chinese–English Dictionary.
Pinyin-based sorting is very convenient for looking up characters or words which you know its pronunciation and Pinyin expressions. But you can not find words which you do not know the sound.
Bopomofo is a Chinese phonetic system created by the Commission on the Unification of Pronunciation in 1913, and formally issued by the Ministry of Education of the Chinese Government in 1918. It consists of a table (or alphabet) of letters or symbols in the order of ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩ and 5 tone diacritics of “ˉ, ˊ, ˇ, ˋ, ˙”.
Chinese characters are sorted according to the Bopomofo expressions of their sounds by their order in the alphabet table, first by letters, then by tones in the order of "first tone, second tone, third tone, fourth tone, and fifth tone (also called neutral tone, light tone)". For example, the Bopomofo order for the characters in “注音字母排序法 (Bopomofo-based sorting)” are “排(ㄆㄞˊ)母(ㄇㄨˇ)法(ㄈㄚˇ)序(ㄒㄩˋ)注(ㄓㄨˋ)字(ㄗˋ)音 (ㄧㄣ)”. Characters of the same sounds are normally sorted by a stroke-based method.
The first dictionary sorted in Bopomofo is 國語辭典 (Guoyu Dictionary) published in 1937, followed by many other dictionaries. Bopomofo is more popular in Taiwan than in Chinese Mainland, where Pinyin is predominant.
In addition to the sounds of standard Chinese, Chinese characters can be sorted alphabetically by the sounds of dialects as well.
In Jyutping for the Cantonese dialect popular in Hong Kong, the sound of a character is represented by a string of English letters, followed by a number (1 through 6) to represent the tone. For instance, the Jyutping order of the characters in “粵拼排檢法 (Jyutping-based sorting and retrieving)” is “法[faat3]檢[gim2]粵[jyut6]排[paai4]拼[ping3]”, where Jyutping expressions are in square brackets”.
The most serious limitation of sound-based orders is their lack of support to look up words with unknown pronunciation. Hence dictionaries collated by sounds often provide some indexes in form-based orders.
Meaning-based orders, also called semantics-based orders, arrange characters and words in a hierarchical structure of semantic categories.The first surviving Chinese dictionary Erya (date from the 3rd century BC) is arranged by semantic classification. The words were divided into nine categories, each with a large number of entries. An entry is a list of synonyms, which are explained by a word commonly used. For instance, entry 林、烝、天、地、皇、王、後、辟、公、侯,君也。where the ending "君也" means (the previous words are) synonyms of "君 (king)".
Modern semantically sorted dictionaries include "同义词词林" and "实用广州话分类词典". Their classification systems are much more accurate and detailed than the ancient dictionaries, but still need indexes of radicals or strokes. That means meaning-based sorting is not powerful enough to function as an independent sorting method.
Semantics-based sorting involves these questions: What are the categories and subcategories to use? How to put a word into its category and subcategory? How to arrange the categories and subcategories in order? How to arrange the words in the lowest subcategories in order? And the answers to these questions may vary between the user and compiler of the dictionary, and that will lead to difficulties in word lookup.
In fact, radical-based sorting is meaning-based to a certain degree, because in many cases the radical represents the semantic category of a character, e.g., radical 氵(water) in character 江(river), 扌(hand) in 推 (push) and 艹 (grass) in 花草 (flowers and grasses).
This category of orders have Chinese characters sorted by their frequencies of use, normally in descending order. That means the most frequently-used character is at the top of the list. A frequency list is created from a text corpus. In corpus linguistics, the frequency of a character is the ratio percentage of its number of occurrences in the corpus to the total number of characters of the corpus.
The first frequency list of Chinese characters based on a corpus was created by Chen Heqin (陳鶴琴). In the 1920s, he and his assistants spent two years manually counting the characters in a corpus of 554,478 characters, and obtained 4,261 different characters with frequency information.The top 10 characters in their frequency list are (in descending order): 的(of), 不(no, not), 一(one, a/an), 了(had, done), 是 (be), 我(I, me), 上(on, up), 他(he, him), 有(have, has), 人(person, people).
In 2001, the Chinese University of Hong Kong published a number of frequency lists on the Web,[2] entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a trans-regional diachronic survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, mainland China and Taiwan and in the two time periods of the 1960s and 1980/1990's. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists. The top 10 characters in the frequency lists for the three regions of the 1980/1990's are Hong Kong: 的,一,是,不,人,有,在,了,我,中; Taiwan: 的,一,是,不,人,在,有,我,了,中; Mainland: 的,一,是,了,不,在,有,人,我,他.As a matter of fact, both meaning-based sorting and frequency-based sorting are employed in other languages as well, though often at word level, not at character or letter level.
A Chinese word consist of one or more characters. Single-character words can be sorted by a character order, and multi-character words can be sorted character by character in a similar way. For example, according to the Pinyin, Radical and Stroke-based orders used in the Xiandai Hanyu Cidian (version 7), the five words of [爱, 好, 好事, 好人, 好人家] are arranged in the following orders:
There are software applications to support sorting and lookup of Chinese characters on the computer.
Chinese characters can be automatically sorted on the computer. For example, on Microsoft Windows and Office,[3] users can sort their Chinese texts in the orders of:
Here are two examples.
Unihan Radical-Stroke Index[4] uses the Kangxi radical system. It allows the user to lookup a character from the Unihan Database of more than 75,000 CJK characters,[5] by the procedure of
The latest (i.e., 12th) edition of Xinhua Zidian has an accompanying app which supports 3 lookup methods: