首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于最小编码长度的基因数据聚类
引用本文:汪雪红,焦清局,常盼盼,黄继风.基于最小编码长度的基因数据聚类[J].安徽农业科学,2012(19):10003-10005,10072.
作者姓名:汪雪红  焦清局  常盼盼  黄继风
作者单位:1. 上海师范大学信息与机电工程学院,上海,200234
2. 上海交通大学自动化系,系统控制与信息处理国家重点实验室,上海200240
摘    要:目的]分析基于最小编码长度的基因数据聚类算法的聚类效果,以期为基因数据聚类提供新的方法。方法]将基因数据的聚类看成是高维混合数据的聚类,通过对基因数据进行预处理后,再利用主成分分析将基因数据降维,降维后基因数据呈类高斯分布,这样分布的基因数据能够被一个简单的基于有损数据压缩的聚类算法进行有效的聚类,而该基于有损数据压缩的聚类算法是根据聚类后使基因的总体编码长度最小原则对基因进行聚类的。试验中分别利用该新算法与传统聚类算法对酵母和拟南芥基因数据进行聚类,并通过基因聚类内部评价和功能评价来验证该新算法的有效性。结果]通过利用酵母和拟南芥基因数据对新算法的验证试验表明,该研究中的新算法得到的聚类效果优于传统聚类算法,且避免了聚类数需要主观确定和对初始聚类中心敏感等问题。结论]该研究结果为基因数据聚类提供了一种全新的聚类方法。

关 键 词:基因聚类  有损压缩  高斯分布  最小编码长度

Genetic Data Clustering Based on Minimum Coding Length
Institution:WANG Xue-hong et al(College of Information,Mechanical and Electrical Engineering,Shanghai Normal University,Shanghai 200234)
Abstract:Objective] This paper aimed to provide new method for genetic data clustering by analyzing the clustering effect of genetic data clustering algorithm based on minimum coding length.Method] The genetic data clustering was regarded as high dimension mixed data clustering.After preprocessing genetic data,the dimensions of the genetic data were reduced by principal component analysis,when genetic data presented Gaussian-like distribution.This distribution of genetic data could be clustered effectively through lossy data compression,which clustered the genes based on a simple clustering algorithm.This algorithm could achieve its best clustering result when the length of the codes of clustered genes reached its minimum value.This algorithm and the traditional clustering algorithms were used to do the genetic data clustering of yeast and Arabidopsis,and the effectiveness of the algorithm was verified through genetic clustering internal evaluation and function evaluation.Result] The clustering effect of the new algorithm in this study was superior to traditional clustering algorithm,and it also avoided the problems of objective determination of clustering data and sensitiveness to initial clustering center.Conclusion] This study provides a new clustering method for the genetic data clustering.
Keywords:Genetic clustering  Lossy compression  Gaussian distribution  Minimum coding length
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号