首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于多源信息融合的中文农作物病虫害命名实体识别
引用本文:李林,周晗,郭旭超,刘成启,苏洁,唐詹.基于多源信息融合的中文农作物病虫害命名实体识别[J].农业机械学报,2021,52(12):253-263.
作者姓名:李林  周晗  郭旭超  刘成启  苏洁  唐詹
作者单位:中国农业大学
基金项目:国家重点研发计划项目(2016YFD0300710)
摘    要:随着农作物病虫害研究文献的快速增长,对农作物病虫害领域文献进行文本挖掘变得越来越重要。开发有效、准确的农作物病虫害命名实体识别系统有助于在农作物病虫害相关研究报告中提取研究成果,为农作物病虫害的治理提供有效建议。本文针对中文农作物病虫害数据集缺失问题,提出了基于半远程监督的停等算法,利用该算法构建中文农作物病虫害领域语料库,大幅度减少标注过程的人工成本和时间成本;同时,提出了中文农作物病虫害命名实体识别模型(Agricultural information extraction, Agr-IE),该模型基于BERT-BILSTM-CRF,辅以多源信息融合(多源分词信息和全局词汇嵌入信息)丰富字符向量,使其充分结合字符级与词汇级的信息,以提高模型捕捉上下文信息的能力。实验表明,该模型可以有效地识别病害、虫害、药剂、作物等实体,F1值分别为96.56%、95.12%、94.48%、95.54%,并对识别难度较大的病原实体具有较好的识别效果,F1值为81.48%,高于BERT-BILSTM-CRF、BERT等模型的相应值。本文所提模型在MSRA和Weibo等其他领域数据集上与CAN-NER、Lattice-LSTM-CRF等模型进行了对比实验,并取得最佳的识别效果,F1值分别为95.80%、94.57%,表明该算法具有一定的泛化能力。

关 键 词:命名实体识别  农作物病虫害  农业自然语言处理  中文分词  停等算法
收稿时间:2020/12/5 0:00:00

Named Entity Recognition of Diseases and Insect Pests Based on Multi Source Information Fusion
LI Lin,ZHOU Han,GUO Xuchao,LIU Chengqi,SU Jie,TANG Zhan.Named Entity Recognition of Diseases and Insect Pests Based on Multi Source Information Fusion[J].Transactions of the Chinese Society of Agricultural Machinery,2021,52(12):253-263.
Authors:LI Lin  ZHOU Han  GUO Xuchao  LIU Chengqi  SU Jie  TANG Zhan
Institution:China Agricultural University
Abstract:Crop diseases and insect pest text mining is becoming increasingly important as the number of crop diseases and insect pest documents rapidly grows. The development of effective and highly accurate named entity recognition (NER) systems of crop diseases and insect pests can be beneficial to extract research results from related research reports and provide effective suggestions for the control of diseases and insect pests. Stop wait algorithm based on semi-remote supervision was proposed to construct the corpus of Chinese crop diseases and insect pests to solve the problem of corpus missing. Moreover, an agricultural information extraction (Agr-IE) method was proposed. The method was based on BERT-BILSTM-CRF, and multi-source word segmentation information and global lexical embedding was used to enrich the information of character vector before character information integrated. Experiments performed by Agr-IE on the datasets of crop diseases and insect pests showed that the model can effectively distinguish four types of entities: the F1 score of diseases, pests, pharmaceuticals, and plant were 96.56%, 95.12%, 94.48% and 95.54%, respectively. And the model also performed well in identifying entities about pathogens (81.48% F1 score), which was higher than the corresponding values of BERT-BILSTM-CRF, BERT and other models. The recognition effect was higher than that of the compared models. In addition, the proposed model was compared with CAN-NER, Lattice-LSTM-CRF and other models on MSRA, Weibo datasets, and the best recognition results were obtained. The F1 scores were 95.80% and 94.57% respectively, which showed that the algorithm had good generalization ability and stability.
Keywords:named entity recognition  crop diseases and insect pests  agricultural natural language processing  Chinese word segmentation  Stop-wait algorithm
点击此处可从《农业机械学报》浏览原始摘要信息
点击此处可从《农业机械学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号