524 SERIAL ANALYSIS OF GENE EXPRESSION DATA THROUGH VARIOUS CLASSIFIERS BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS
Gotam Lalotra and R. S. Thakur
Table 1: Sample SAGE data output
Tag CCAAAACCCA ACAAGATTCC ACCAATTCTA GCCCTCTGAA ACCCTAGGAG
Frequency 27 8 56 90 389
INTRODUCTION
Machine learning is the process of learning structure
from data, there are various machine learning techniques
being implemented to learn from the data. Classi cation
is a data mining technique used to predict group mem-
bership for data instances from instances described by a
set of attributes and a class label data mining remains
the hope for revealing patterns that underlie it (Witten
et al., 2011; Li et al., 2016 and Kumar et al., 2016).
There are some basic techniques for data mining like
classi cation, clustering, association rule mining (Piz-
zuti et al., 2003; Marr, 1981; Wong et al., 2008). Various
state of the art classi cation techniques like Naïve Bayes
(Becker et al., 2001), LDA (Quinlan, 1993), SVM (Cortes
et al., 1995; Burges et al., 1998; Han et al. 2012; Cun-
ningham et al. 2007), KNN (Han et al. 2012) and Decision
Table (DT) has been used for analysis of data.
This paper focuses on the study of the SAGE data of
human brain tissues, which is based on the gene expres-
sion techniques for analysis of genes. SAGE data sets
were collected from SAGE libraries from http://www.
ncbi.nlm.nih.gov/projects/SAGEThe classi cation data
were classi ed into one of the prede ned classes and
hence from the machine learning perspective it is a
supervised learning technique. The Gene expression data
is an example of presenting a large number of features
(genes), most of the features are irrelevant to the de ni-
tion of the problem which consequently could degrade
the classi cation process signi cantly while performing
analysis (Banka et al. 2015). This paper primarily focuses
on experimentally evaluating different methods for clas-
sifying cancerous and non- cancerous tissues.
DATASET PREPARATION
Dataset contains 10 Cancerous and 4 normal libraries,
these datasets are represented in the form of Table 1
containing tag and frequency. These libraries in the form
Tag, frequency1, frequency2, frequency3, frequency14
were combined.
ALGORITHM FOR PREPROCESSING
Step 1: The maximum frequency (maxf) and minimum
frequency (minf) of each gene in the normal
libraries was calculated.
Step 2: The frequency of each gene was compared in the
cancerous libraries with the maximum and the
minimum frequency of normal libraries.
Step 3: Let a
ij
is the frequency of gene j in library i.
1. If (a
ij
> maxf) or (a
ij
< minf)
2. Change frequency value to 1
3. And 0 otherwise
Step 4: 1 shows the differently expressed genes in the
tumor tissue and 0 means no change in the
expression level.
Step 5: Records corresponds to ambiguous tags (genes
which show over expression in some cancer tis-
sues and under expression in some other cancer
tissues) are removed.
The above steps were used for preprocessing on dataset
matrix (14×65454) and have been reduced into matrix
size (14×1898).
RESULTS AND DISCUSSION
The comparison was conducted using the WEKA (The
Waikato Environment for Knowledge Analysis) open
source software which consists of a collection of machine
learning algorithms for data mining. Different classi ers
used for evaluation of the cancerous and non- cancerous
tissues are discussed below in Table 2. The Performance
of the Classi er is discussed in Table 3.
It has been observed that the different classi cation
measures have been calculated and compared for can-
cerous and non-cancerous tissues of human brain. The
measures like True Positive (TP) rate, False Positive (FP)
rate, Precision, Recall, F-Measure, Mathews Correlation
Coef cient (MCC), Receiver Operating Characteristic
(ROC) Area and Precision Recall Curve (PRC) Area have
been used. The all classi ers have performed well after
reducing the number of genes from 65454 to 1898 and
the analysis is performed on the 1898 genes which is
a signi cant improvement in reducing the number of
features but, it can be revealed from the results that
the K-Nearest Neighbor (KNN) and Linear Discriminant
Analyzer (LDA) have outperformed the other classi es in
most of the performance measures.
Discriminant analyzer technique by (Li et al. 2016)
has been proposed to enhance the classi cation accu-
racy. Nearest neighbor classi er requires large memory
and time (Kumar et al. 2016) but, with our algorithm
for preprocessing the dataset has signi cantly reduced
for analysis purpose. A variant of LDA is introduced by
(Bacchus et al. 2013) where LDA has performed better
than SVM and KNN.