首页 | 本学科首页   官方微博 | 高级检索  
     

基于词典和全切分的中文农业网页分词算法的研究
引用本文:白涛,张太红,吴乃宁. 基于词典和全切分的中文农业网页分词算法的研究[J]. 新疆农业大学学报, 2014, 0(2): 168-172
作者姓名:白涛  张太红  吴乃宁
作者单位:新疆农业大学计算机与信息工程学院,乌鲁木齐,830052
基金项目:新疆维吾尔自治区科技攻关项目(200931103)
摘    要:针对农业垂直搜索中中文分词要求的特殊性,提出-5基于词典和全切分的中文分词算法。该算法首先对经过预处理的网页进行基于词典的机械式切分,对未识别的字串再进行基于贝叶斯(Bayes)方法的全切分概率计算,通过计算字串的最大切分可信度确定最合理的切分,并更新词典。实验从120万张农业中文网页中随机抽取14组生成测试集,测试结果表明,该算法与正向最大匹配算法(FMM)和逆向最大匹配算法(RMM)相比具有更高的召回率,F1测度平均达到88%。

关 键 词:中文分词  未登录词识别  贝叶斯  全切分

Research on Word Segmentation Algorithm of Chinese Agriculture Web Page Besed on Dictionary and Omni-word Segmentation
BAI Tao,ZHANG Tai-hong,WU Nai-ning. Research on Word Segmentation Algorithm of Chinese Agriculture Web Page Besed on Dictionary and Omni-word Segmentation[J]. Journal of Xinjiang Agricultural University, 2014, 0(2): 168-172
Authors:BAI Tao  ZHANG Tai-hong  WU Nai-ning
Affiliation:(College of Computer & Information Engineering, Xinjiang Agricultural University, Urumqi 830052 ,China)
Abstract:This paper proposes a kind of Chinese word segmentation algorithm besed on dictionary and omni-word segmentation in view of special requirements of agricultural vertical search.Firstly,the algo-rithm segments pretreated web pages preliminarily with mechanical participle based on dictionary,and then calculates probability of omni-word segmentation of unknown string based on Bayesian method,and ascer-tains the most reasonable segmentation through calculating maximum credibility of unknown string,and improves dictionary.Fourteen groups selected randomly from 1 .2 million Chinese agricultural web pages constituted the test set.Result showed that the recall ratio of proposed algorithms was higher than FMM’s and BMM’s,and average of F1 reached 88%.
Keywords:Chinese word segmentation  unknown word identification  Bayes  omni-word segmentation
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号