Chinese Dasher wiki
Chinese "Ruby" Corpus
I have found a Chinese corpus which gives both pinyin and Chinese Character strings
together. I used this corpus to make our pinyin corpus
download/training/training_pinyin_CN.txt
and
a "Ruby" corpus
download/training/training_chineseRuby_CN.txt
. [Ruby is our name for mixed phonetic text and chinese or Japanese characters;
in Japanese, we call Ruby furigana.]
The original corpus is in
/home/mackay/dasher/incoming/chinese/pinyin
and
/home/mackay/dasher/incoming/chinese/character.
My perl program that creates the Ruby output is
/home/mackay/dasher/incoming/chinese/pinyin/CONVERTP.p
.
The associated alphabet file is
alphabet.chineseRuby.xml
My perl program that creates the pure pinyin output is
/home/mackay/dasher/incoming/chinese/pinyin/CONVERT3.p
.
The associated alphabet file is
alphabet.pinyin.xml
.
On Fri 5/8/05 I fixed an error in my conversion program, with the help of Chunlin Ji.
Here are his notes.
Rules to mark the tone for Pinyin:
-
if there are more than
one vowels and the first one is 'i', 'u' or 'ü', then the second vowel
takes the mark;
-
Otherwise,the first vowel takes the mark. (the vowels in
Pinyin: 'a', 'e', 'i', 'o', 'u', 'ü' )
By the way, there are several small tricks in writing Pinyin, e.g. "Hanyu
Pinyin" simplifies the spellings of syllables with 'ü' by using the 'u' form
instead in cases where no ambiguity could result, for example when 'ü'
comes after 'j', 'q', 'x' or 'y' . This is merely a spelling convention;
the 'u's here are still pronounced 'ü'".
For a detailed guide to the rules of Pinyin,please refer to the following
webpages (in English) Combinations of initials and finals
(http://www.pinyin.info/rules/initials_finals.html) Where do the tone marks
go? (http://www.pinyin.info/rules/where.html) Basic Rules of Hanyu Pinyin
Orthography (http://www.pinyin.info/readings/zyg/rules.html)
Software: Here are some free and popular input methods in Linux. I guess
they may contain the source codes to convert Pinyin to Chinese characters.
1.SICM: http://www.scim-im.org/ (Input methods include
(Simplified/Traditional) Chinese, Japanese, Korean and many European
languages) 2.Fcitx: http://www.fcitx.org/main/ (In English:
http://www.fcitx.org/main/?q=node/10) 3.XCIN:
http://xcin.linux.org.tw/intro.En.html (widely used in Taiwan) 4.Chinput:
http://www.opencjk.org/~yumj/project-chinput-e.html 5.XSIM:
http://developer.berlios.de/projects/xsim/
a software which can translate Chinese character
to Pinyin is useful to create training data? If so, the following
software may help. (Webpage is in Chinese)
The bopomofo alphabet is
here.