基于PDF版式特征的文献篇章结构细粒度抽取方法研究 A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于PDF版式特征的文献篇章结构细粒度抽取方法研究

引用本文：	赵婉婧,刘敏娟,刘洪冰,王新,段飞虎.基于PDF版式特征的文献篇章结构细粒度抽取方法研究[J].农业图书情报学刊,2021,33(9):93-103.

作者姓名：	赵婉婧刘敏娟刘洪冰王新段飞虎

作者单位：	1.中国农业科学院农业信息研究所,北京 100081; 2.农业农村部农业大数据重点实验室,北京 100081; 3.同方知网数字出版技术股份有限公司,北京 100192

基金项目：	中国农业科学院科技创新工程（CAAS-ASTIP-2016-AII）

摘要：	目的/意义]为实现文献资源的细粒度组织,满足用户日趋精准的信息服务需求,研究提出一种基于PDF版式特征的文献篇章结构细粒度自动抽取方法。方法/过程]方法充分利用机器学习在信息分类方面的优势,针对非结构化的PDF文档,基于其版式特征对章节标题进行自动分析、识别与抽取。根据章节标题的坐标定位,将正文内容以段落为最小颗粒度自动匹配至所属标题的下级位置,最终实现文档全文结构的细粒度抽取和重组。结果/结论]经实测,机器自动抽取平均正确率达80%,针对非结构化PDF文档的细粒度抽取工作具有较好的现实意义和应用前景,基于底层方法设计构建的数据处理系统现已投入实际应用,大幅解放人工进行篇章结构细粒度抽取的工作。
关键词：	版式特征篇章结构章节标题细粒度抽取机器学习
收稿时间：	2021-04-01
A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features

ZHAO Wanjing,LIU Minjuan,LIU Hongbing,WANG Xin,DUAN Feihu.A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features[J].Journal of Library and Information Sciences in Agriculture,2021,33(9):93-103.

Authors:	ZHAO Wanjing LIU Minjuan LIU Hongbing WANG Xin DUAN Feihu

Institution:	1. Agricultural Information Institute of CAAS, Beijing 100081; 2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081; 3. Tongfang Knowledge Network Digital Publishing Technology Co., Ltd., Beijing 100192

Abstract:	Purpose/Significance] This paper proposes a fine-grained automatic extraction method for document structure based on PDF layout features, in order to realize fine-grained organization of literature resources and meet the increasingly growing needs of users for accurate information services. Method/Process] The method takes full advantage of machine learning in information classification, which can automatically analyze, identify and extract the chapter title of unstructured PDF documents based on layout features. And according to the coordinate positioning of chapter titles, the body content is automatically matched to the subordinated position of the title with paragraph as the minimum granularity, and the fine-grained extraction and identification of the full text of the document is finally realized. Results/Conclusions] The test result shows that the average accuracy of automatic extraction can reach 80%. The method of fine-grained extraction of unstructured PDF documents proposed has practical significance and application prospect, and the data processing system designed based on the underlying method has been put into practical application, which will greatly liberate us from the mechanical drudgery of chapter structure extraction tasks.

Keywords:	layout features chapter structure chapter title fine-grained extraction machine learning

	点击此处可从《农业图书情报学刊》浏览原始摘要信息
	点击此处可从《农业图书情报学刊》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏