首页 | 本学科首页   官方微博 | 高级检索  
     检索      

农业新闻数据源增量爬虫的研究
引用本文:杨广召.农业新闻数据源增量爬虫的研究[J].现代农业科技,2021,2(2).
作者姓名:杨广召
作者单位:塔里木大学
摘    要:随着农业新闻数据日益膨胀,保证以农业为主题的增量爬虫成为爬取农业信息的相关手段,增量爬虫的原理可以依据农业新闻数据的更新爬取数据相关更新的内容,剔除出已经爬取的重复内容]]。文章结合农业新闻数据信息的特点,提出了一种适用于农业新闻信息的基于Redis的布隆过滤器的增量去重方法,摆脱超大的持久化文件撑爆内存的问题。通过实验证明随着抓取相关农业信息的增加,该方法在保证内存不被撑爆同时能有效提高增量爬取农业信息的效率,在增量信息爬取的过程中具有很好的应用价值]]。

关 键 词:增量爬虫  农业新闻  去重
收稿时间:2020/8/2 0:00:00
修稿时间:2020/8/2 0:00:00

Research on Incremental Crawler of Agricultural News Data Source
Abstract:With the increasing expansion of agricultural news data, it is ensured that incremental crawlers with the theme of agriculture become a relevant means of crawling agricultural information. The principle of incremental crawlers can crawl the content related to the data based on the update of agricultural news data, and remove the crawled content. Duplicate content taken. Combining the characteristics of agricultural news data and information, this article proposes an incremental deduplication method based on Redis-based Bloom filter for agricultural news information, which can get rid of the problem of large persistent files bursting memory. Experiments show that with the increase of crawling related agricultural information, this method can effectively improve the efficiency of incremental crawling of agricultural information while ensuring that the memory is not burst. It has good application value in the process of incremental information crawling.
Keywords:incremental crawler  agricultural forum  de-duplication
点击此处可从《现代农业科技》浏览原始摘要信息
点击此处可从《现代农业科技》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号