Abstract
We have assessed the impact of 13 different data transformation methods on the performance of four types of clustering methods (partitioning (K-mean), hierarchical distance (Average Linkage), multivariate normal mixture, and non-parametric kernel density) and four cluster number determination statistics (CNDS) (Pseudo F, Pseudo t2, Cubic Clustering Criterion (CCC), and Bayesian Information Criterion (BIC), using both simulated and real gene expression profile data. We found that Square Root, Cubic Root, and Spacing transformations have mostly positive impacts on the performance of the four types of clustering methods whereas Tukey's Bisquare and Interquantile Range have mostly negative impacts. The impacts from other transformation methods are clustering method-specific and data type-specific. The performance of CNDS improves with appropriately transformed data. Multivariate Mixture Clustering and Kernel Density Clustering perform better than K-mean and Average Linkage in grouping both simulated and real gene expression profile data.
Keywords
cluster analysis, gene expression profile, data transformation, data normalization, cluster number determination statistics, robustness, Pseudo F, Pseudo t2, cubic clustering criterion, Bayesian information criterion, Average linkage, k-mean, multivariate mixture-model, kernel density clustering, nonparametric clustering
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Recommended Citation
Shu, Guoping; Zeng, Beiyan; Wright, Deanne; and Smith, Oscar
(2002).
"IMPACT OF DATA TRANSFORMATION ON THE PERFORMANCE OF DIFFERENT CLUSTERING METHODS AND CLUSTER NUMBER DETERMINATION STATISTICS FOR ANALYZING GENE EXPRESSION PROFILE DATA,"
Conference on Applied Statistics in Agriculture.
https://doi.org/10.4148/2475-7772.1203
IMPACT OF DATA TRANSFORMATION ON THE PERFORMANCE OF DIFFERENT CLUSTERING METHODS AND CLUSTER NUMBER DETERMINATION STATISTICS FOR ANALYZING GENE EXPRESSION PROFILE DATA
We have assessed the impact of 13 different data transformation methods on the performance of four types of clustering methods (partitioning (K-mean), hierarchical distance (Average Linkage), multivariate normal mixture, and non-parametric kernel density) and four cluster number determination statistics (CNDS) (Pseudo F, Pseudo t2, Cubic Clustering Criterion (CCC), and Bayesian Information Criterion (BIC), using both simulated and real gene expression profile data. We found that Square Root, Cubic Root, and Spacing transformations have mostly positive impacts on the performance of the four types of clustering methods whereas Tukey's Bisquare and Interquantile Range have mostly negative impacts. The impacts from other transformation methods are clustering method-specific and data type-specific. The performance of CNDS improves with appropriately transformed data. Multivariate Mixture Clustering and Kernel Density Clustering perform better than K-mean and Average Linkage in grouping both simulated and real gene expression profile data.