Building corpus, and using corpus technology to assist compiling various types of dictionary in Anglo-American countries have been a long time. However, unlike languages such as English where space is used as word delimiter, Chinese is written without explicit word delimiters, therefore, word segmentation is undoubtedly a basic and preliminary step of Chinese corpus processing. After reviewing the status of Traditional Chinese word segmentation tools, we find that there is a need for an open-source tool of domain-specific word segmentation to serve as a robust foundation for the integrated research project. The project thus aim at utilizing the hand-crafted segmentation corpus and the domain terminology knowledge base as training data to develop a machine learning algorithm, which learns the basic Chinese word segmentation concept and the domain lexical characteristics. The algorithm is expected to provide a high degree of precision in the task of domain-specific word segmentation. For practical application, an segmentation tool and an online system based on the proposed algorithm will be implemented for other projects to conduct vocabulary analysis and the preprocessing of dictionary editing materials to enhance the performance of building foundational knowledge. In order to promote the project result to benefit the research community, the tool will be open-source and available for all researchers.
In conclusion, the research results of this project can not only serve as the basis of other projects, but also many corpus-based research projects, including dictionary compilation, language analysis, and teaching materials and curriculum design, to improve the quality of education.