基于Web数据的农业网络信息自动采集与分类系统 Automatic acquisition and classification system for agricultural network information based on Web data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Web数据的农业网络信息自动采集与分类系统

引用本文：	段青玲,魏芳芳,张磊,肖晓琰. 基于Web数据的农业网络信息自动采集与分类系统[J]. 农业工程学报, 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025

作者姓名：	段青玲魏芳芳张磊肖晓琰

作者单位：	1. 中国农业大学信息与电气工程学院,北京,100083;2. 中国农业大学信息与电气工程学院,北京 100083;北京市农业物联网工程技术研究中心,北京 100083

基金项目：	国家高技术研究发展计划（863计划）资助项目（2013AA102306）；山东省自主创新资助项目（2014XGA13054）；中央高校基本科研业务费专项资金资助项目（2015XD001）。

摘要：	为了快速、高效地获取农业Web信息,解决信息孤岛和信息不对称的问题,重点研究了农业Web数据自动采集与抽取、基于SVM(support vector machine)的文本分类、物联网异构数据采集等技术,并采用统一建模语言(unified modeling language,UML)描述了农业网络信息自动采集与分类系统。该系统实现了农业网站、物联网数据的自动抓取和共享,为用户提供农业资讯、农产品市场行情、供求信息在线查询,环境数据实时监测和个性化信息服务等功能。应用结果表明,该系统对样本集网站的信息抓取准确率为98.2%,资讯分类准确率为92.5%,具有数据采集实时性强、用户参与度好、通用性高等特点,该系统为农业信息整合和服务提供参考。
关键词：	农业文本处理采集系统信息物联网
收稿时间：	2015-12-11
修稿时间：	2016-04-24
Automatic acquisition and classification system for agricultural network information based on Web data

Duan Qingling,Wei Fangfang,Zhang Lei and Xiao Xiaoyan. Automatic acquisition and classification system for agricultural network information based on Web data[J]. Transactions of the Chinese Society of Agricultural Engineering, 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025

Authors:	Duan Qingling Wei Fangfang Zhang Lei Xiao Xiaoyan

Affiliation:	1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China,1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China,1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China; 2. Beijing Agricultural Networking Engineering Technology Research Center, Beijing 100083, China and 1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

Abstract:	Abstract: The purpose of this study is to obtain agricultural web information efficiently, and to provide users with personalized service through the integration of agricultural resources scattered in different sites and the fusion of heterogeneous environmental data. The research in this paper has improved some key information technologies, which are agricultural web data acquisition and extraction technologies, text classification based on support vector machine (SVM) and heterogeneous data collection based on the Internet of things (IOT). We first add quality target seed site into the system, and get website URL (uniform resource locator) and category information. The web crawler program can save original pages. The de-noised web page can be obtained through HTML parser and regular expressions, which create custom Node Filter objects. Therefore, the system builds a document object model (DOM) tree before digging out data area. According to filtering rules, the target data area can be identified from a plurality of data regions with repeated patterns. Next, the structured data can be extracted after property segmentation. Secondly, we construct linear SVM classification model, and realize agricultural text classification automatically. The procedures of our model include 4 steps. First of all, we use segment tool ICTCLAS to carry out the word segment and part-of-speech (POS) tagging, followed by combining agricultural key dictionary and document frequency adjustment rule to choose feature words, and building a feature vector and calculating inverse document frequency (IDF) weight value for feature words; lastly we design adaptive classifier of SVM algorithm. Finally, the perception data of different format collected by the sensor are transmitted to the designated server as the source data through the wireless sensor network. Relational database in accordance with specified acquisition frequency can be achieved through data conversion and data filtering. The key step of data conversion can be implemented on the basis of mapping rules between source data and target data. The mapping rules include 3 kinds of rules. The first is the source data directly corresponding to the target data; the second is that we create a temporary table, which corresponds to target table if they have same field name; and the third is converting perception data of XML (extensible markup language) type to relational database. Besides, data filtering is required to process abnormal values of the measured value beyond the sensor range. In this paper, unified modeling language (UML) is used to describe the agricultural network information automatic acquisition and classification system. User requirement analysis is described by the system''s use case diagram. Web data extraction process is described by the system activity diagram. These help the system''s key function implement of automatic information acquisition from Internet. The IOT data sharing module is implemented based on the proposed data conversion and filtering rules. The system can supply the services of on-time agricultural news, agricultural product prices, supply and demand information browsing query, real-time agricultural environment monitoring and personalized information statistics. The preliminary application shows that the agricultural network information automatic acquisition and classification system improves the accuracy of information extraction and text classification. The information acquisition accuracy rate for sample web sets is 98.2%, and the accuracy rate of text classification with rules is 92.5%. Compared with sequential minimal optimization (SMO), Bayesian, C4.5 decision tree and radial basis function (RBF) based SVM algorithm, linear SVM is more suitable for agricultural news classification. The system has high real-time performance and good user participation for IOT applications, which will expect to be applied to agricultural information integration and intelligent processing.

Keywords:	agriculture text processing information systems information the Internet of things
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《农业工程学报》浏览原始摘要信息
	点击此处可从《农业工程学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏