基于左归词频向量空间模型的中文文本抄袭检测算法 An Algorithm for Chinese Plagiarism Detection Based on Left Align Word Frequency Vector Space Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于左归词频向量空间模型的中文文本抄袭检测算法

引用本文：	谢松山,、唐雁.基于左归词频向量空间模型的中文文本抄袭检测算法[J].西南农业大学学报,2015,37(5):158-161.

作者姓名：	谢松山、唐雁

作者单位：	西南大学计算机与信息科学学院 ,重庆,400715

基金项目：	教育部"春晖计划"资助项目

摘要：	提出一种基于左归词频向量空间模型的抄袭检测算法.通过左归处理将抄袭文本的指代还原,借助同义词链对所有同义词统一左对齐于同义词链首词,然后以直接统计词频构造文本词频特征,抛弃词频统计抄袭检测算法中以TF-IDF多步计算相对词频的处理,最后以词频特征构造向量空间模型,用余弦相似计算文本相似度.实验表明,算法在各种抄袭类型的数据集上综合性能更优、稳定性更好、效率更高.
关键词：	抄袭检测相似度向量空间模型左归
An Algorithm for Chinese Plagiarism Detection Based on Left Align Word Frequency Vector Space Model

XIE Song-shan,TANG Yan.An Algorithm for Chinese Plagiarism Detection Based on Left Align Word Frequency Vector Space Model[J].Journal of Southwest Agricultural University,2015,37(5):158-161.

Authors:	XIE Song-shan TANG Yan

Abstract:	In this paper ,an algorithm for plagiarism detection based on left align word frequency vector space model is proposed .First ,left align treatment is made to recover the reference in the copied text . Next ,all synonyms are unified with the synonym chain by left-justifying them with the first word at their synonym chain .Then ,a text word frequency features are constructed directly with statistical method ,a-bandoning the multi-step process of TF-IDF to calculate the relative word frequency in other word frequen-cy plagiarism detection algorithms .Finally ,a vector space model is constructed ,using the word frequency features ,and the text similarity is calculated using the cosine similarity .Experimental results show that this algorithm on various types of plagiarism data sets has better overall performance ,better stability and greater efficiency .

Keywords:
本文献已被 CNKI 等数据库收录！
	点击此处可从《西南农业大学学报》浏览原始摘要信息
	点击此处可从《西南农业大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏