首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于PDF版式特征的文献篇章结构细粒度抽取方法研究
引用本文:赵婉婧,刘敏娟,刘洪冰,王新,段飞虎.基于PDF版式特征的文献篇章结构细粒度抽取方法研究[J].农业图书情报学刊,2021,33(9):93-103.
作者姓名:赵婉婧  刘敏娟  刘洪冰  王新  段飞虎
作者单位:1.中国农业科学院农业信息研究所,北京 100081;
2.农业农村部 农业大数据重点实验室,北京 100081;
3.同方知网数字出版技术股份有限公司,北京 100192
基金项目:中国农业科学院科技创新工程(CAAS-ASTIP-2016-AII)
摘    要:目的/意义]为实现文献资源的细粒度组织,满足用户日趋精准的信息服务需求,研究提出一种基于PDF版式特征的文献篇章结构细粒度自动抽取方法。方法/过程]方法充分利用机器学习在信息分类方面的优势,针对非结构化的PDF文档,基于其版式特征对章节标题进行自动分析、识别与抽取。根据章节标题的坐标定位,将正文内容以段落为最小颗粒度自动匹配至所属标题的下级位置,最终实现文档全文结构的细粒度抽取和重组。结果/结论]经实测,机器自动抽取平均正确率达80%,针对非结构化PDF文档的细粒度抽取工作具有较好的现实意义和应用前景,基于底层方法设计构建的数据处理系统现已投入实际应用,大幅解放人工进行篇章结构细粒度抽取的工作。

关 键 词:版式特征  篇章结构  章节标题  细粒度抽取  机器学习  
收稿时间:2021-04-01

A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features
ZHAO Wanjing,LIU Minjuan,LIU Hongbing,WANG Xin,DUAN Feihu.A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features[J].Journal of Library and Information Sciences in Agriculture,2021,33(9):93-103.
Authors:ZHAO Wanjing  LIU Minjuan  LIU Hongbing  WANG Xin  DUAN Feihu
Institution:1. Agricultural Information Institute of CAAS, Beijing 100081;
2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081;
3. Tongfang Knowledge Network Digital Publishing Technology Co., Ltd., Beijing 100192
Abstract:Purpose/Significance] This paper proposes a fine-grained automatic extraction method for document structure based on PDF layout features, in order to realize fine-grained organization of literature resources and meet the increasingly growing needs of users for accurate information services. Method/Process] The method takes full advantage of machine learning in information classification, which can automatically analyze, identify and extract the chapter title of unstructured PDF documents based on layout features. And according to the coordinate positioning of chapter titles, the body content is automatically matched to the subordinated position of the title with paragraph as the minimum granularity, and the fine-grained extraction and identification of the full text of the document is finally realized. Results/Conclusions] The test result shows that the average accuracy of automatic extraction can reach 80%. The method of fine-grained extraction of unstructured PDF documents proposed has practical significance and application prospect, and the data processing system designed based on the underlying method has been put into practical application, which will greatly liberate us from the mechanical drudgery of chapter structure extraction tasks.
Keywords:layout features  chapter structure  chapter title  fine-grained extraction  machine learning  
点击此处可从《农业图书情报学刊》浏览原始摘要信息
点击此处可从《农业图书情报学刊》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号