Biotechnological
Communication
Biosci. Biotech. Res. Comm. 10(2): 25-31 (2017)
An ef cient density-based algorithm for clustering
gene expressions
Prateek A. Meshram and Pradeep Singh*
Department of Information Technology and and *Department of Computer Science and Technology NIT
Raipur, Raipur, Chhattisgarh, India
ABSTRACT
Clustering of genetic expression in accordance to delineation is very important. To explore co-systematizes gene
groups whose collective samples are equal to the expressions used is one of the major work of gene clustering. In this
regards, an ef cient modi ed clustering algorithm is proposed for gene expression clustering. This algorithm forms
the various samples of the data and groups into a number of clusters in accordance with their density. In order to
achieve better clustering, we employed parameter tuning using differential evolution. The gene-gene class relevance
is being optimized by differential evolution to  nd the ef cient clusters. The performance of the algorithm is calcu-
lated with the study of some gene expression datasets and the validity measures. The proposed clustering algorithm
has outperformed when compared to the average linkage method, Self-Organizing Map (SOM) and a recently devel-
oped weighted Chinese restaurant-based clustering method (CRC), for clustering gene expression data.
KEY WORDS: CLUSTERING, GENE CLUSTERING, MUTUAL INFORMATION, VALIDITY MEASURES
25
ARTICLE INFORMATION:
*Corresponding Author: psingh.cs@nitrr.ac.in
Received 17
th
April, 2017
Accepted after revision 27
th
June, 2017
BBRC Print ISSN: 0974-6455
Online ISSN: 2321-4007 CODEN: USA BBRCBA
Thomson Reuters ISI ESC and Crossref Indexed Journal
NAAS Journal Score 2017: 4.31 Cosmos IF : 4.006
© A Society of Science and Nature Publication, 2017. All rights
reserved.
Online Contents Available at: http//www.bbrc.in/
INTRODUCTION
The advancement of biological data analysis has made
the study of gene expression data faster and more ef -
cient. Biotechnology and intelligent informatics helped
in extraction of knowledge form collection of a very
large number of gene expressions of different times
and conditions (Meagher 2002), (Searls 2000). Extract-
ing patterns fromlarge amount of data is always a
great problem and clustering is the primary solution.
Clustering is an unsupervised learning process which
divides the data into number of sample spaces N regions
{C1,C2,…,CN} regards to speci c similarity or dissimi-
larity parameter, where N is unknown (Bandyopadhyay
et al. 2007). The main aim in partitional clustering is to
obtain a N×m matrix U(Y) of a given data Y, having
26 AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS
Prateek A. Meshram and Pradeep Singh
m objects, Y={y1,y2,…,ym}. To identify groups of co-
regulated genes groups clustering is extensively used on
gene expression data. Different unsupervised clustering
techniques such as hierarchical clustering (Herrero et al.
2001) (Wang et al. 2007) k-means algorithm (Heyer et al.
1999), self-organizing map (Wang et al. 2007)(Tamayo
et al. 1999), and principal component analysis (Yeung &
Ruzzo 2001)have been widely used to  nd patterns of
genes from microarray data. The hierarchical clustering
identi es sets of correlated genes with similar behavior
across the samples, but clusters in a tree-like structure,
which makes the identi cation of functional groups very
dif cult (Herrero et al. 2001) (Wang et al. 2007).
In contrast, self-organizing map (Heyer et al. 1999
and Tamayo et al. 1999) and k-means algorithm (Heyer
et al. 1999) require a prespeci ed number and an ini-
tial spatial structure of clusters, but this may be hard
to come up with in real problems. However, these algo-
rithms usually fail to reveal functional groups of genes
that are of special interest in sample classi cation as the
genes are clustered by similarity only, without using any
information about the sample categories or class labels
(Dettling & Bühlmann 2002) (Ben-Dor Shamir 1999).
Clustering conglomerates the like input patterns to de -
nite, and similar subsets having mutual inclusion are
termed as clusters (Xiao et al. 2003). Genes with identi-
cal changes in expressions are kept into separate clusters
in order to perform genome clustering. By considering
a mapping schemewe appoint a label to every gene and
the motive of clustering is applicable to this mapping
scheme i.e. to assign each gene to a speci c group.
There has been a signi cant growth in research in the
last decades of clustering and various clustering tech-
nologies have been developed, (Li & Tian 2007, Naka-
mura & Kehtarnavaz 1998, Frossyniotis et al. 2004) . Due
to the special characteristics of the gene data a special
study is dedicated to the genome data clustering.Density
Based Spatial Clustering Applications with noise is the
most used clustering algorithm for density clustering. By
giving a certain density threshold the linked regions are
identi ed with respect to the threshold. Density based
hierarchical clustering algorithm (DHC) using time series
proposed by Jiang in 2003 (Jiang & Zhang 2003). A new
version of K-means was developed by Barigov and Mar-
daneh, called an incremental algorithm for gene clus-
tering (Bagirov & Mardaneh 2006). Another technique
using K-means was proposed and given the name GAGR
(Chang et al. 2009). Zhihua in 2008 proposed a vari-
ant of PSO particle swarm optimization called PK-means
by combining the two techniques K-means and the
particle–pair optimizer in order to cluster the GE data
(Bagirov & Mardaneh 2006). GO Fuzzy a semi-super-
vised technique proposed by Tari in 2009 (Bagirov &
Mardaneh 2006). Juan A Nepomuceno proposed a new
evolutionary clustering technique in 2011 to cluster gene
expression data name bi-clustering with the help of gene
correlation based searching techniques(Nepomuceno et
al. 2011). A hybridized fuzzy algorithm named HFCEF a
framework developed for detecting tumor data was pro-
posed by Zhiwen Yu in 2013 using biomolecular data
(Yu et al. 2013).
Yiyi Liu proposed a new technique in 2014 to iden-
tify cancer subtypes based expression datasets which
accessed gene network information to perform cluster-
ing (Liu et al. 2014). Shudong Huang proposed a spectral
clustering algorithm named SCCE spectral co-clustering
ensemble algorithm, in 2015 to apply multiple based co-
clustering algorithms on various datasets (Huang et al.
2015) on a single platform. Recently Xianxue Yu, pro-
posed a new technique in 2017 to cluster gene expression
cancerous datasets with the help of projective ensemble
clustering named PCE (Yu et al. 2017).
There are lot of clustering techniques but DBSCAN is
the most used clustering algorithm for large data than
other algorithms (Ester et al. 1996). Most of the tech-
niques mentioned above don’t meet the requirements of
ef cient clustering of gene data, due to the constantly
changing characteristics of the microarray data. Con-
sidering all these issues, we propose, here, an ef cient
DBSCAN method which provides better genes groups
clustering. The rest of the paper is arranged as follows.
In Section 2, elaborates the Materials and Methods. The
Results and Discussion in Section 3. And Section 4 con-
cludes the paper.
MATERIAL AND METHODS
In this section, we have used 2 gene datasets and imple-
mented them using our new proposed algorithm with
the help of MATLAB 2016a. The implementation process
was carried onto Intel core i3 processor with 4GB RAM
and 500GB hard disk space, running on Windows 10
platform. The results obtained from the proposed algo-
rithm proved to be ef cient in terms of validity indexes.
Adjusted rand index (ARI) and Jaccard Index (JI) are
used as the main parameter for checking the ef ciency
of the algorithm. The value of the average of compact-
ness is varying between the range 0 to1 and the more
the value tends towards 0 the compactness leads to good
clustering and if the value of the average of compact-
ness tends to 1 the compactness leads to bad clustering
which is clari ed by the values of other measures.
GENE DATASETS
The gene datasets used in this paper are real life data set
namely yeast sporulation data and ratCNS central nerv-
ous system dataset.
BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS 27
Prateek A. Meshram and Pradeep Singh
Yeast Sporulation data: The dataset consists of sporu-
lation process genes of the budding yeast consisting of
6118 genes. The dataset is available on the following
website: http://cmgm.stanford. edu/pbrown/sporulation.
From the total 6118 genes, the genes which didn’t change
during the sporulation process are not considered. The
nal set consists of 474 genes for the clustering.
RatCNS: The dataset consists of expression level of
the central nervous system development of rat which
consists of 112 genes over 9 time points. The data is
easily available on the following website: http://staff.
washington.edu/kayee/cluster. To obtain a good data for
clustering the dataset is normalized in such a manner
that each row is having mean 0 and variance 1(Ka Yee
Yeung, David R. Haynor n.d.).
DBSCAN is one of the most commonly clustering
algorithm based on a density-based notion of clusters,
DBSCAN is able to generate clusters of arbitrary shapes
ef ciently. DBSCAN accomplishes the clustering pro-
cess by starting from an arbitrary point X unvisited,
DBSCAN retrieves all points in the Eps neighbourhood
of X. Then try  nding all the density-reachable points
to the point X by considering the eps and minpts. If the
selected point is the core point then the cluster formation
is complete in according to the input parameters. Now
we check the border points by checking that no point is
density-reachable from the point X, which become a new
point and start again with the whole process. Even the
slightest of change in the values of the input parameters
can lead to the great changes in the clustering results.
As in the complete process of the DBSCAN algorithm, a
single point is visited multiple times, its time complex-
ity is O (N 2). By considering the density connectivity
and the density reachability we could easily determine
the working of DBSCAN. The values of both the density
functions depend on the two input parameters eps and
minpts where eps can be called as the radius de ned to
plot the cluster and the minpts are the minimum criteria
which are used to form a cluster. Finding the appropri-
ate value of eps and minpts is a challenging task. We
have introduced a new concept for getting the optimal
value of eps and minpts in ef cient way. The differential
evolution based parameter optimization for cluster com-
pactness improvement. The details of DE based param-
eter optimization is discussed below.
DIFFERENTIAL EVOLUTIONS
A new algorithm in evolutionary computing proposed
by Storn in 1977 for optimizing continuous spaces glob-
ally (Storn & Price 1997). The overall genetic algorithm
structure is in resemblance with (D.E.) Differential Evo-
lutions (Zou et al. 2013) (Marco Locatelli, Mirko Maisch-
berger 2014). Three basic operations such as mutation,
crossover, and selection are performed in DE. Mutation
is the most important operation in DE, its work is to
generate new population elements, in order to obtain
anoptimum solution in accordance with the objective
function (Yang Lou, Junli Li n.d.) (Chi 2009). In our sce-
nario we have used average compactness as the objec-
tive function as it is one of the suitable cluster validity
measure for density based clustering.
Time Complexity: Time complexity of the proposed
algorithm is where Iteration- is the number of DE itera-
tion. Population, number of individuals, complexity of
DBSCAN algorithm and n, number of data objects in the
datasets
For calculating the ef ciency of the algorithm we
have used validity indexes as the parameters for check-
ing its ef ciency, in it we used to adjust the Rand index,
Jaccard Index, Average of Compactness as the quality
measures of the proposed algorithm.
Average of Compactness
An internal measure based on the compactness of all
clusters which is used to measure the quality of cluster-
ing called as the average of compactness(Zaït & Mes-
satfa 1997). The lower value of average compactness
will provide the higher quality of clustering.
Adjusted Rand Index (ARI)
The adjusted Rand index is the corrected for chance ver-
sion of the Rand index(Rand & Rand 2016).Though the
Rand Index may only produce a value among 0 and +1,
the adjusted Rand index can produce negative values if
the index is less than the predictable index(Wagner &
Wagner 2007). If we have the result of clustering algo-
rithm R and the known classes L then the ARI can be
calculated in the following way. Let x, y, z and u be the
gene pairs in the same clusters related to both R and
L, clusters related to R and not L, clusters related to L,
not R and the clusters not related to both R and L. The
Adjusted Rand Index, is
(5)
Jaccard Index (JI)
The formula to calculate jaccard index is given below
it is a similarity measurement technique through which
we check the similarity between the expected(calculated)
and the actual number of clusters.
(6)
(IfAandBare both empty, we de neJ (A, B)=1.)
This distance is ametricon the collection of all  nite
sets (Lipkus 1999).
28 AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS
Prateek A. Meshram and Pradeep Singh
FIGURE 1. Variations in RatCNS dataset at various iterations.
FIGURE 2. Variations in Yeast dataset at various iterations.
Table 1. Number of clusters obtained by
the proposed algorithm
Yeast RatCNS
Original Clusters 10 6
Obtained Clusters 10 6
Table 2. Result of Proposed algorithm
Dataset
Algorithm
Yeast RatCNS
Our 0.9800 0.5560
SiMM-TS 0.6353 0.5147
IFCM 0.4717 0.4032
VGA 0.5800 0.4542
Average Link 0.5007 0.3684
SOM 0.5842 0.4134
CRC 0.5675 0.4455
gure 2 shows the variations in the formation of clusters
in both the data sets at various iterations.
We compared our method with the previous algo-
rithms and results reported by Bandyopadhyay (Ban-
dyopadhyay et al. 2007) and found that the proposed
algorithm clusters more effectively than the previously
reported algorithms in terms of ARI. The observed values
for our algorithm on yeast gene expression is 0.9800 and
RESULTS AND DISCUSSION
In this section two datasets are implemented with our
proposed algorithm by optimizing the eps and minpts
with the objective function average of compactness
(C
avg
). After getting the desired/ optimized value of the
minpts and eps we have used those parameters to get
the external measures such as ARI, and JI as the qual-
ity of measurement and to check the ef ciency of the
proposed system. The number of clusters formed by the
proposed algorithm is given in table 1 and it can be seen
that our proposed algorithm is identifying same number
of clusters in each dataset. The results of the proposed
algorithm with number of DE iterations are in terms of
average compactness is shown in table 2. Figure1 and
Prateek A. Meshram and Pradeep Singh
BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS 29
Table 3. Comparison of gene datasets for various algorithms with respect to ARI.
Dataset Eps Minpts Itr Proposed Algorithm
Cavg ARI JI
RatCNS
12.58 3 1 0.291 0.41 0.398
2.901 10 10 0.268 0.503 0.497
0.84 2 20 0.257 0.535 0.535
0.89 2 30 0.228 0.556 0.532
Yeast
29.46 5 1 0.200 0.222 0.222
0.198 4 10 0.123 0.775 0775
0.19 4 20 0.113 0.918 0.905
0.20 2 30 0.10 0.98 0.915
FIGURE 3. Variations in RatCNS dataset at different number of iterations.
on ratCNS is 0.5560 for ARI. The comparative results
vSiMM-TS, IFCM, VGA, Average Link SOM and CRC
are shown in table 3 .Figure 3 and Figure 4 shows the
graphical outputs of the indexes and the datasets with
respect to the number of iterations on validity indexes
ARI, JI and C
avg
respectively. Figure 5 shows the com-
parative graph with proposed method on the basis of
ARI index obtained from six algorithms.
CONCLUSION
A modi ed DBSACN which works without providing
the parameter eps and minpts is proposed in order to
reduce the complexity of the DBSCAN algorithm and
make it ef cient. The algorithm proposed is tested on
real life gene expression data sets. The results obtained
with the objective of minimization of average compact-
ness are compared to the existing algorithm in terms
of the ARI and JI validity measures. In general it is
found that the MDBSCAN clustering algorithm out-
performs all the other clustering methods signi cantly.
This algorithm provides the parameter optimization to
remove the major drawback of conventional DBSCAN.
Furthermore, we will study the selection of the auto-
matic parameter estimation model from the theoretical
aspect.
Prateek A. Meshram and Pradeep Singh
30 AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS
FIGURE 4. Variations in Yeast dataset at different intervals
FIGURE 5. Comparitive graph for the variations in the dataset with respect to ARI in all the algorithms compared.
REFERENCES
Bagirov, A. & Mardaneh, K., 2006. Modi ed global k-means
algorithm for clustering in gene expression data sets. Proceed-
ings of the 2006 workshop on, (4), pp.1–6. Available at: http://
portal.acm.org/citation.cfm?id=1274172.1274176.
Bandyopadhyay, S., Mukhopadhyay, A. & Maulik, U., 2007.
An improved algorithm for clustering gene expression data.
Bioinformatics, 23(21), pp.2859–2865.
Ben-Dor A, Shamir R, Y.Z., 1999. Clustering Gene Expression
Patterns. Computational Biology, 4, pp.281–97.
Chang, D.-X., Zhang, X.-D. & Zheng, C.-W., 2009. A genetic
algorithm with gene rearrangement for K-means clustering.
Pattern Recognition, 42, pp.1210–1222.
Chi, Y.A.; H., 2009. Experimental Study on Differential Evolu-
tion Strategies. IEEE Intelligent Systems, 2009. GCIS ’09. WRI
Global Congress on.
Prateek A. Meshram and Pradeep Singh
BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS AN EFFICIENT DENSITY-BASED ALGORITHM FOR CLUSTERING GENE EXPRESSIONS 31
Dettling, M. & Bühlmann, P., 2002. Supervised clustering of
genes. Genome biology, 3(12), p.RESEARCH0069.
Ester, M. et al., 1996. A Density-Based Algorithm for Discover-
ing Clusters in Large Spatial Databases with Noise. Proceed-
ings of the 2nd International Conference on Knowledge Dis-
covery and Data Mining, pp.226–231.
Frossyniotis, D., Likas, A. & Stafylopatis, A., 2004. A clustering
method based on boosting. Pattern Recognition Letters, 25(6),
pp.641–654.
Herrero, J., Valencia, a & Dopazo, J., 2001. A hierarchical unsu-
pervised growing neural network for clustering gene expression
patterns. Bioinformatics (Oxford, England), 17(2), pp.126–136.
Heyer, L.J., Kruglyak, S. & Yooseph, S., 1999. Exploring
Expression Data : Identi cation and Analysis of Coexpressed
Genes Exploring Expression Data : Identi cation and Analysis
of Coexpressed Genes. , (213), pp.1106–1115.
Huang, S. et al., 2015. Spectral co-clustering ensemble. Knowl-
edge-Based Systems, 84, pp.46–55. Available at: http://dx.doi.
org/10.1016/j.knosys.2015.03.027.
Jiang, D. & Zhang, A., 2003. DHC: a density-based hierarchical
clustering method for time series gene expression data. Third
IEEE Symposium on Bioinformatics and Bioengineering, 2003.
Proceedings., pp.393–400. Available at: http://ieeexplore.ieee.
org/lpdocs/epic03/wrapper.htm?arnumber=1188978.
Ka Yee Yeung, David R. Haynor, W.L.R., Validating Clustering
for Gene Expression Data. Available at: http://staff.washing-
ton.edu/kayee/cluster.
Li, X. & Tian, Z., 2007. Optimum cut-based clustering. Signal
Processing, 87(11), pp.2491–2502.
Lipkus, A.H., 1999. A proof of the triangle inequality for the
Tanimoto distance. , 26, pp.263–265.
Liu, Y. et al., 2014. A network-assisted co-clustering algorithm
to discover cancer subtypes based on gene expression. BMC
bioinformatics, 15(1), p.37. Available at: http://www.biomed-
central.com/1471-2105/15/37.
Marco Locatelli, Mirko Maischberger, and F.S., 2014. No Title.
Differential evolution methods based on local searches. Com-
puters & Operations Research, 43:169–180, 2014.
Meagher, R.B., 2002. Post-genomics Networking of Biotech-
nology for Interpreting Gene Function. Current Opinion in
Plant Biology, 5(2), pp.135–140. Available at: http://dx.doi.
org/10.1016/S1369-5266(02)00245-5.
Nakamura, E. & Kehtarnavaz, N., 1998. Determining num-
ber of clusters and prototype locations via multi-scale clus-
tering. Pattern Recognition Letters, 19(14), pp.1265–1283.
Available at: http://www.sciencedirect.com/science/article/pii/
S0167865598000993.
Nepomuceno, J.A., Troncoso, A. & Aguilar-Ruiz, J.S., 2011.
Biclustering of gene expression data by correlation-based scat-
ter search. BioData mining, 4(1), p.3. Available at: http://www.
pubmedcentral.nih.gov/articlerender.fcgi?artid=3037342&tool
=pmcentrez&rendertype=abstract.
Rand, W.M. & Rand, W.M., 2016. Objective Criteria for the
Evaluation of Clustering Methods Objective Criteria for the
Evaluation of Clustering Methods. , 66(336), pp.846–850.
Searls, D.B., 2000. Using Bioinformatics in Gene and Drug Dis-
covery. Drug Discovery Today, 5(4), pp.135–143.
Storn, R. & Price, K., 1997. Differential Evolution – A Simple
and Ef cient Heuristic for global Optimization over Continu-
ous Spaces. Journal of Global Optimization, 11(4), pp.341–359.
Available at: http://dx.doi.org/10.1023/A:1008202821328.
Tamayo, P. et al., 1999. Interpreting patterns of gene expres-
sion with self-organizing maps: methods and application to
hematopoietic differentiation. Proceedings of the National
Academy of Sciences of the United States of America, 96(6),
pp.2907–2912.
Wagner, S. & Wagner, D., 2007. Comparing Clusterings - An
Overview. , (1907), pp.1–19.
Wang, H., Zheng, H. & Azuaje, F., 2007. Poisson-based self-
organizing feature maps and hierarchical clustering for serial
analysis of gene expression data. IEEE/ACM transactions on
computational biology and bioinformatics / IEEE, ACM, 4(2),
pp.163–75. Available at: http://www.ncbi.nlm.nih.gov/pub-
med/17473311.
Xiao, X. et al., 2003. Gene clustering using self-organizing
maps and particle swarm optimization. Parallel and Distributed
Processing Symposium, 2003. Proceedings. International, 0(C),
p.10–pp. Available at: http://ieeexplore.ieee.org/xpls/abs_all.
jsp?arnumber=1213290.
Yang Lou, Junli Li, and Y.W., A Binary-Differential Evolution
algorithm based on Ordering of individuals. IEEE Natural Com-
putation (ICNC), 2010 Sixth International Conference on.
Yeung, K.Y. & Ruzzo, W.L., 2001. Principal component analy-
sis for clustering gene expression data. Bioinformatics, 17(9),
pp.763–774. Available at: http://www.cs.washington.
Yu, X., Yu, G. & Wang, J., 2017. Clustering cancer gene expres-
sion data by projective clustering ensemble. Plos One, 12(2),
p.e0171429. Available at: http://dx.plos.org/10.1371/journal.
pone.0171429.
Yu, Z. et al., 2013. for Tumor Clustering from Biomolecular
Data. , 10(3), pp.657–670.
Zaït, M. & Messatfa, H., 1997. A comparative study of cluster-
ing methods. Future Generation Computer Systems - Special
double issue on data mining, Volume 13(Issue 2-3), p.Pages
149-159.
Zou, D. et al., 2013. A modi ed differential evolution algo-
rithm for unconstrained optimization problems. Neurocomput-
ing, 120, pp.469–481.