Biotechnological

Communication

Biosci. Biotech. Res. Comm. 9(3): 523-529 (2016)

Classi cation of cancerous and non-cancerous tissues

of serial analysis of gene expression data through

various classi ers

Gotam S. Lalotra

and R.S.Thakur

Department of Computer Applications, Maulana Azad National Institute of Technology Bhopal,

India-462003

ABSTRACT

Cancer can be cured if detected early and it can be detected by the expression level analyzed in the suspected tissues.

Serial Analysis of Gene Expression (SAGE) is a gene expression technique used to analyze the genes on the basis of

expression level of the genes. The Libraries of SAGE data contained very large number of genes, considering all genes

for classifying is very tedious task and is not wise thing to do. The preprocessing of the SAGE data is performed to

remove the irrelevant genes by comparing the expression level of genes in normal and cancerous libraries, and the

further analysis of the dataset is done considering the reduced genes. This paper compares classi cation techniques

for classifying the cancerous and non-cancerous tissues of human brain. The Naive Bayes (NB), Linear Discriminant

Analyzer (LDA), Decision Table (DT), Support Vectors Machine (SVM) and K Nearest Neighbor (KNN) classi ers have

been implemented for the analysis of SAGE data. WEKA (The Waikato Environment for Knowledge Analysis) open

source software which consists of a collection of machine learning algorithms for data mining used for analysis. The

results obtained reveal that the K-Nearest Neighbor (KNN) and Linear Discriminant Analyzer (LDA) have given better

performance over other classi es in most of the performance measures except few. Different errors measures have

also been studied in this paper for SAGE data of human brain tissues. The KNN and LDA both have given signi cant

improvement over other classi ers.

KEY WORDS: CLASSIFICATION, SAGE, DIMENSIONAL, ERROR, PERFORMANCE

523

ARTICLE INFORMATION:

*Corresponding Author: singh.gotam@gmail.com,

ramthakur2000@yahoo.com

Received 11

Aug, 2016

Accepted after revision 15

Sep, 2016

BBRC Print ISSN: 0974-6455

Online ISSN: 2321-4007

Thomson Reuters ISI ESC and Crossref Indexed Journal

NAAS Journal Score 2015: 3.48 Cosmos IF : 4.006

reserved.

Online Contents Available at: http//www.bbrc.in/

524 SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS

Gotam Lalotra and R. S. Thakur

Table 1: Sample SAGE data output

Tag CCAAAACCCA ACAAGATTCC ACCAATTCTA GCCCTCTGAA ACCCTAGGAG

Frequency 27 8 56 90 389

INTRODUCTION

Machine learning is the process of learning structure

from data, there are various machine learning techniques

being implemented to learn from the data. Classi cation

is a data mining technique used to predict group mem-

bership for data instances from instances described by a

set of attributes and a class label data mining remains

the hope for revealing patterns that underlie it (Witten

et al., 2011; Li et al., 2016 and Kumar et al., 2016).

There are some basic techniques for data mining like

classi cation, clustering, association rule mining (Piz-

zuti et al., 2003; Marr, 1981; Wong et al., 2008). Various

state of the art classi cation techniques like Naïve Bayes

(Becker et al., 2001), LDA (Quinlan, 1993), SVM (Cortes

et al., 1995; Burges et al., 1998; Han et al. 2012; Cun-

ningham et al. 2007), KNN (Han et al. 2012) and Decision

Table (DT) has been used for analysis of data.

This paper focuses on the study of the SAGE data of

human brain tissues, which is based on the gene expres-

sion techniques for analysis of genes. SAGE data sets

were collected from SAGE libraries from http://www.

ncbi.nlm.nih.gov/projects/SAGEThe classi cation data

were classi ed into one of the prede ned classes and

hence from the machine learning perspective it is a

supervised learning technique. The Gene expression data

is an example of presenting a large number of features

(genes), most of the features are irrelevant to the de ni-

tion of the problem which consequently could degrade

the classi cation process signi cantly while performing

analysis (Banka et al. 2015). This paper primarily focuses

on experimentally evaluating different methods for clas-

sifying cancerous and non- cancerous tissues.

DATASET PREPARATION

Dataset contains 10 Cancerous and 4 normal libraries,

these datasets are represented in the form of Table 1

containing tag and frequency. These libraries in the form

Tag, frequency1, frequency2, frequency3, frequency14

were combined.

ALGORITHM FOR PREPROCESSING

Step 1: The maximum frequency (maxf) and minimum

frequency (minf) of each gene in the normal

libraries was calculated.

Step 2: The frequency of each gene was compared in the

cancerous libraries with the maximum and the

minimum frequency of normal libraries.

Step 3: Let a

is the frequency of gene j in library i.

1. If (a

> maxf) or (a

< minf)

2. Change frequency value to 1

3. And 0 otherwise

Step 4: 1 shows the differently expressed genes in the

tumor tissue and 0 means no change in the

expression level.

Step 5: Records corresponds to ambiguous tags (genes

which show over expression in some cancer tis-

sues and under expression in some other cancer

tissues) are removed.

The above steps were used for preprocessing on dataset

matrix (14×65454) and have been reduced into matrix

size (14×1898).

RESULTS AND DISCUSSION

The comparison was conducted using the WEKA (The

Waikato Environment for Knowledge Analysis) open

source software which consists of a collection of machine

learning algorithms for data mining. Different classi ers

used for evaluation of the cancerous and non- cancerous

tissues are discussed below in Table 2. The Performance

of the Classi er is discussed in Table 3.

It has been observed that the different classi cation

measures have been calculated and compared for can-

cerous and non-cancerous tissues of human brain. The

measures like True Positive (TP) rate, False Positive (FP)

rate, Precision, Recall, F-Measure, Mathews Correlation

Coef cient (MCC), Receiver Operating Characteristic

(ROC) Area and Precision Recall Curve (PRC) Area have

been used. The all classi ers have performed well after

reducing the number of genes from 65454 to 1898 and

the analysis is performed on the 1898 genes which is

a signi cant improvement in reducing the number of

features but, it can be revealed from the results that

the K-Nearest Neighbor (KNN) and Linear Discriminant

Analyzer (LDA) have outperformed the other classi es in

most of the performance measures.

Discriminant analyzer technique by (Li et al. 2016)

has been proposed to enhance the classi cation accu-

racy. Nearest neighbor classi er requires large memory

and time (Kumar et al. 2016) but, with our algorithm

for preprocessing the dataset has signi cantly reduced

for analysis purpose. A variant of LDA is introduced by

(Bacchus et al. 2013) where LDA has performed better

than SVM and KNN.

BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS 525

Gotam Lalotra and R. S. Thakur

Table 2: Classi ers and their principle

S. No Name of Classi er Year Principle of Classi er

1 Naive Bayes Classi ers are in use since 1950 Bayesian theorem, it is used when the data is high

dimensional.

2 KNN M. Cover and P. E. Hart in 1967 Closest neighbor whose class is already known.

3 SVM SVM was introduced by Vapnik in

1995 (Wong et al., 2008).

SVM is based on statistical learning theory and

structural risk minimization principal with the aim

of determining location of decision boundaries also

known as hyperplane.

4 LDA LDA was introduced by Fishers

in 1936.

Compute thed-dimensional mean vectors for

the different classes from the dataset, calculate

the scatter matrices. Calculate the eigenvectors

(e1, e2,...,ed) and corresponding eigenvalues

(1,2,...,d) for the scatter matrices. Sort the

eigenvectors by decreasing eigenvalues and choosek

eigenvectors with the largest eigenvalues to form

ad×k dimensional matrixW(where every column

represents an eigenvector). Use thisd×keigenvector

matrix to transform the samples onto the new

subspace. This can be summarized by the matrix

multiplication:Y=X×W.

5 DT Decision tables were introduced

during 1960-70.

one of the earliest classi cation models to be

represented graphically because of their easy to

understand structure (Quinlan, 1993). It is rule based

classi er, where Table represents complete set of

conditional expressions, expressions are mutually

exclusive in in a prede ned area.

Table 3: Classi ers and their performances

S. No Name of

Classi er

Performance of Classi er

1. Naive Bayes For our dataset it has Correctly

Classi ed 12 instances that is the

85.7143 % and 2 instance are

incorrectly classi ed which is

14.2857 %

2. KNN For our dataset it has correctly

classi ed instances are 13 which

is 92.8571 % and the incorrectly

classi ed instances 1 that is 7.1429 %

3. SVM SVM classi ed correctly 10 instance

which is 71.4286 % of the total

and the remaining 28.5714 % are

incorrectly Classi ed Instances.

4. LDA Correctly classi ed instances are13

which is 92.8571 % and incorrectly

classi ed instance is 1 that is

7.1429 %. By LDA.

5 DT In our experiment 12 instances are

correctly classi ed which is 85.7143

% and 2 instances are incorrectly

classi ed which accounts for

14.2857 % by DT.

ure Mean Absolute Error (MAE), Root Means Square

Error (RMSE), Relative Absolute Error (RAE) and Root

Relative Square Error (RRSE). These measures also

show that LDA and KNN have performed very well

and the error are less in comparison to other classi ers

used.

Comparison of Different Performance measures for

Tumorous Tissues and Non-Tumorous Brain Tissues

FIGURE 1.

With the help of preprocessing algorithm, we have

achieved very good results for SAGE dataset as show

in Table 3. The error measures in Figure 17, Figure 18,

Figure 19 and Figure 20 Show the different error meas-

526 SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS

Gotam Lalotra and R. S. Thakur

FIGURE 2.

FIGURE 3.

FIGURE 4.

FIGURE 5.

FIGURE 6.

FIGURE 7.

BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS 527

Gotam Lalotra and R. S. Thakur

FIGURE 10.

FIGURE 8.

FIGURE 9.

FIGURE 11.

FIGURE 12.

FIGURE 13.

528 SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS

Gotam Lalotra and R. S. Thakur

FIGURE 16.

FIGURE 19.

FIGURE 14.

FIGURE 15.

Error Measures for various Classi ers

FIGURE 18.

FIGURE 17.

Gotam Lalotra and R. S. Thakur

BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS 529

FIGURE 20.

CONCLUSION

SAGE data has been preprocessed and thereby reduc-

ing the number of features to 1898, samples are clas-

si ed and their performance is compared. The classi-

 ers employed here have used statistical approaches

and focused on individual features. Future work can be

enhanced to the study of features in the groups. Imple-

menting the association rule mining techniques along

with soft fuzzy techniques can be of great signi cance

for the reduction in the number of features and perfor-

mance enhancement can be achieved.

ACKNOWLEDGEMENT

This work was supported by research grant from MANIT,

Bhopal, India under Grants in Aid Scheme 2010-11, No.

Dean (R&C)/2010/63 dated 31/08/2010.

REFERENCES

Bacchus A, Biet M, Macaire L, Menach Y L and Tounzi A

(2013). Comparison of Supervised Classiﬁcation Algorithms

Combined with Feature Extraction and Selection: Application

to a Turbo-generator Rotor Fault Detection, 9th IEEE Interna-

tional Symposium on Diagnostics for Electric Machines, Power

Electronics and Drives (SDEMPED). 558-565.

Banka H, Dara S (2015). A Hamming distance based binary

particle swarm optimization (HDBPSO) algorithm for high

dimensional feature selection, classi cation and validation.

Pattern Recognition Letters. Vol 52, 94–100.

Becker B, Kohavi R, Sommer eld D (2001). Visualizing the

simple Baysian classi er, in: Information Visualization in Data

Mining and Knowledge Discovery. Inc. San Francisco, CA,

USA, Morgan Kaufmann Publishers. 237-249.

Burges C. J.C. (1998). A Tutorial on Support Vector Machines

for Pattern Recognition. Data Mining and Knowledge Discov-

ery 2, 121-167.

Cortes C, Vapnik V (1995). Support vector networks. Machine

Learning, Boston, Kluwer Academic Publishers. Vol 20, 273–

297.

Cunningham P, Delany S. J (2007) k-Nearest Neighbour Classi-

 ers Technical Report UCD-CSI.

Han J, Kamber M (2012). Data Mining Concepts and Tech-

niques. Elsevier.

Kumar R.R, Viswanath P, Bindu C.S (2016). Nearest Neighbor

Classiﬁers: Reducing the Computational Demands, 6th IEEE

International Conference on Advanced Computing. 45-50.

Li L, Du L, Zhang W, Hen H, Wang P (2016). Enhancing infor-

mation discriminant analysis: Feature extraction with linear

statistical model and information-theoretic criteria, Pattern

Recognition Vol. 60,554–570.

Marr J.D (1981). Comparison of several clustering algorithms

for data rate compression of LPC parameters. IEEE Interna-

tional Conference on Acoustics, Speech, and Signal Processing

ICASSP’81. Vol 6, 964-966.

Pizzuti C, Talia D (2003). P-AutoClass: scalable parallel clus-

tering for mining large data setsIEEE Transactions on Knowl-

edge and Data Engineering. Vol 15, Issue 03, 629-641.

Quinlan J.R (1993). C4.5: Programs For Machine Learning,

Morgan Kaufmann Publishers, Inc.

Witten I.H, Frank E, Hall M.A (2011). Data mining practical

machine learning tools and techniques, Morgan Kaufmann

publisher, Burlington.

Wong A.K.C, Li G.C.L (2008). Simultaneous pattern and data

clustering for pattern cluster analysis. IEEE Transactions on

Knowledge and Data Engineering. Vol 20, Issue 07, 911-923.