首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于左归词频向量空间模型的中文文本抄袭检测算法
引用本文:谢松山,、唐雁.基于左归词频向量空间模型的中文文本抄袭检测算法[J].西南农业大学学报,2015,37(5):158-161.
作者姓名:谢松山  、唐雁
作者单位:西南大学计算机与信息科学学院 ,重庆,400715
基金项目:教育部"春晖计划"资助项目
摘    要:提出一种基于左归词频向量空间模型的抄袭检测算法.通过左归处理将抄袭文本的指代还原,借助同义词链对所有同义词统一左对齐于同义词链首词,然后以直接统计词频构造文本词频特征,抛弃词频统计抄袭检测算法中以TF-IDF多步计算相对词频的处理,最后以词频特征构造向量空间模型,用余弦相似计算文本相似度.实验表明,算法在各种抄袭类型的数据集上综合性能更优、稳定性更好、效率更高.

关 键 词:抄袭检测    相似度    向量空间模型    左归

An Algorithm for Chinese Plagiarism Detection Based on Left Align Word Frequency Vector Space Model
XIE Song-shan,TANG Yan.An Algorithm for Chinese Plagiarism Detection Based on Left Align Word Frequency Vector Space Model[J].Journal of Southwest Agricultural University,2015,37(5):158-161.
Authors:XIE Song-shan  TANG Yan
Abstract:In this paper ,an algorithm for plagiarism detection based on left align word frequency vector space model is proposed .First ,left align treatment is made to recover the reference in the copied text . Next ,all synonyms are unified with the synonym chain by left-justifying them with the first word at their synonym chain .Then ,a text word frequency features are constructed directly with statistical method ,a-bandoning the multi-step process of TF-IDF to calculate the relative word frequency in other word frequen-cy plagiarism detection algorithms .Finally ,a vector space model is constructed ,using the word frequency features ,and the text similarity is calculated using the cosine similarity .Experimental results show that this algorithm on various types of plagiarism data sets has better overall performance ,better stability and greater efficiency .
Keywords:
本文献已被 CNKI 等数据库收录!
点击此处可从《西南农业大学学报》浏览原始摘要信息
点击此处可从《西南农业大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号