ABSTRACT – Early disease prediction is
one of the core elements of biomedical and
healthcare communities to improve the
quality of prior diagnostics for fatal
diseases like Congenital Heart Disease,
Cancer etc. Advanced Data Mining
techniques can help remedial situations.
Experimenting the medical structured data
with data mining concepts like Classifiers
and Association Rule Mining (ARM)
techniques helps in the detection of
occurrence for a particular disease. Medical
data set obtained from the open source of
United Kingdom is processed and analysed
for heart disease prediction and then the
system suggests hospital for further
treatment. Accuracy comparison between
the classifier algorithms used is generated
from R Studio. This prediction results pave
way for proper diagnosis and early
treatment of chronic diseases. It can be used
to mitigate the death rate increase due to the
late prediction of fatal diseases only at the
critical stage.
Key Words – Data mining, Congenital
Heart Disease, ARM.
I. INTRODUCTION
The healthcare industry collects reliable
and huge amounts of healthcare data which,
unfortunately, are not “mined” to discover
hidden information for effective decision
making. Clinical decisions are often made
based on doctors’ intuition and experience
rather than on the knowledge rich data
hidden in the database.
I.1 Classification
There are two forms of data analysis that
can be used for extracting models
describing important classes or to predict
future data trends. These two forms are as
follows?
Classification
? Prediction
Classification models predict categorical
class labels; and prediction models predict
continuous valued functions. The two main
efficient classifiers implemented here are
the Decision Tree and Naïve Bayes
classification algorithm.
I.2 Association Rule Mining
Association means finding relationship
between different data items in a same data
transaction that is used to discover various
hidden patterns. For instance, if someone
buys a desktop (A), then they also
purchases a speaker (B) in 55% of the
occurrence. This relationship occurs in
8.2% of desktop buys. An association rule
in this condition can be A intend B. 55% is
the CF (confidence factor) and 8.2% is the
SF (support factor). Apriori algorithm,
Pincer search and AprioriDP are the
efficient ARM algorithms in data mining.
Figure-1 Data Set Preview
I.3 Data Set Collection
For this proposed work, the dataset with an
entry count of 750 values for 10 parameters
was taken from 11 data.gov.uk dataset.
The dataset is called the Congenital Heart
Disease (CHD) data published by
Healthcare Quality Improvement
Partnership which has been licensed under
OGL Open Government Licence. The data
contain 30 day outcomes (alive or dead) for
congenital heart disease treatment in
England, although the audit covers all of the
UK and the Republic of Ireland. All data is
available in the National Institute for
Cardiovascular Outcomes Research
(NICOR) Congenital Public Portal. The
preview of the structured dataset with all
the ten attributes’ explanation is given
above in Figure-1 which bears the resource
from
“CongenitalDataFieldDescription2010201
1”.
There are some of the experimental
procedure adapted to this work involves the
following steps as in Figure-2. Data preprocessing
is done to remove duplication
and cleaning to make it fit for mining.
Figure-2 Methodology Workflow
II. LITERATURE SURVEY
Quite a number of research work have been
carried out in recent decades using data
mining techniques on medical data. Ajad
Patel et al., 7 has developed a system that
indicates whether a patient had a risk of
heart disease or not. The work describes
about a prototype using Naïve Bayes and
WAC (Weighted Associated Classifier).
The work of Ms.M.C.S.Geetha et al., 1
will be able to respond more difficult
queries in forecasting the heart attack
disease. The predictive accuracy
determined by REPTREE, J48 and
BayesNet algorithms propose that
parameters used are consistent indicator to
predict the heart disease.
Ilham KADI et al., 3 have constructed a
cardiovascular dysautonomias prediction
system using a decision tree based classifier
developed using C4.5 tree algorithm and
proved it to be accurate and efficient.
Swaroopa Shastri et al., 4 analysed a
dataset using Apriori algorithm to produce
a detailed correlation involving diabetes
and kidney disease.
The works of Jagdeep Singh et al., 9
focused on heart disease prediction using
association classification methods. The
proposed hybrid associative classification
is implemented on weka environment.
Similarly, Dao-l Lin et al., 8 presented a
novel algorithm called the pincer-search
that can efficiently discover the maximum
frequent set. It does not require the explicit
examination of every frequent itemset.
III. CLASSIFICATION TECHNIQUES
Classification consists of predicting a
certain result based on a given input
training data. In order to predict the result,
the algorithm processes a training set
containing a set of attributes and the
individual outcome, usually called
prediction attribute. Data classification is
the process of organizing data into
categories for its most effective and
efficient use. There is some algorithm in
classification which helps to analyse our
work are decision tree, Naïve Bayes.
A. Decision Tree
Decision tree is a predictive model to go
from observations about an attribut to
conclusions about the attribute’s target
value represented in the leaf nodes. This
model is used in statistics, data mining and
machine learning. Decision trees where the
target variable can take continuous values
(typically real numbers) are called
regression trees. Decision tree J48 is the
implementation of algorithm ID3 (Iterative
Dichotomiser 3) developed by the WEKA
project team. J48 allows classification via
either decision trees or rules generated from
them 1.
C4.5 is a standard algorithm for inducing
classification rules in the form of decision
tree. It was introduced by Quinlan. It is an
extension of the basic ID3 algorithm used
to overcome its disadvantages. C4.5
algorithm has the most accuracy rate while
compared with KNN and Naïve Bayes.
Some of these are 3 :
1. Choosing an appropriate attribute
selection measure.
2. Pruning the decision tree after its
creation.
3. Handling continuous attributes.
It uses divide and conquer approach to form
a binary tree as shown in Figure-3 when
analysed with the given data set 11.
Figure-3 Obtained Decision Tree
B. Naïve Bayes Algorithm
This is a family of simple Probabilistic
classifiers based on applying Bayes’
theorem with strong independence
assumptions between the features.
Maximum-likelihood training can be done
by evaluating a closed-form expression,
which takes linear time, rather than by
expensive iterative approximation as used
for many other types of classifiers. Bayes’
Theorem finds the probability of an event
occurring given the probability of another
event that has already occurred. The
mathematical equation is,
P(A|B) = P(B|A) P(A)
P(B)
Naïve Bayes can answer diagnostic and
predictive problems. It is particularly suited
when the dimensionality of the inputs are
high 1. The model is based on the
conditional independence model of each
predictor given the target class 7. The
work in 3 achieved high accuracy rates up
to 98.4% for training set and 97.76% for
testing set respectively.
The obtained bar charts after processing the
training data set 11 for Naïve Bayes
classifier is as shown in Figure-4 and the
threshold value for disease prediction is
fixed to be 9 in Figure-5.
Figure-4 Naïve Bayes’ Classifier
Figure-5 Threshold line for disease
prediction
IV. ASSOCIATION RULE MINING
Proposed by Agrawal et al in 1993. It is an
important data mining model studied
extensively by the database and data mining
community. Initially used for Market
Basket Analysis to find how items
purchased by customers are related. This
method uses the support and confidence
factors.
A. Apriori Algorithm
The Apriori Algorithm is an influential
algorithm for mining frequent itemsets for
Boolean association rules. It uses a “bottom
up” approach, where frequent subsets are
extended one item at a time ( a step known
as candidate generation, and groups of
candidates are tested against the data).
Hence, we have used this algorithm to
predict disease occurrence with the
available factors obtained from the dataset
11. Its main advantages are: (i) uses large
itemset property (ii) easily parallelized (iii)
easy to implement.
This is used in order to mine the most
occurred set of items with a transactional
database (collection of items bought by
customers or details of a website
frequentation), to identify these items onto
the catalog and extend it largely until they
are reached or settled in the list sufficiently
4. They have implemented this to predict
Diabetes influenced Kidney disease. In
algorithms like Apriori, FP-Growth, Naïve
bayes, ZeroR et al are applied in their study
for prediction of heart diseases.
B. Pincer Search algorithm
To overcome the disadvantages of Apriori
algorithm which requires many database
scans, Pincer search algorithm can be used
which can mine the frequent candidate set
in just two pass.
Dao-l Lin et al in 8 have combined both
the “bottom-up and the top-down”
searches. This search is used only for
maintaining and updating a new data
structure, the maximum frequent candidate
set. This work mainly focuses on two
closure properties:
1. If an itemset is infrequent, all it
supersets must be infrequent and
they need not be examined further.
2. If an itemset is frequent, all its
subsets must be frequent and they
need not be examined further.
With the above proposed work, we have
implemented the knowledge of this search
algorithm for predicting the disease with
which has the high frequency among the
common attributes (IF this, THEN that).
Association rule mining thus paves for
disease prediction as well with the
generated candidate set.
V. CONCLUSION
In this paper, we proposed a system for
prediction of diseases by Data Mining
concepts like Decision Tree, Naïve bayes
classifier, Apriori algorithm and Pincersearch
algorithm using the structured data
11. To the best of our knowledge, the
existing papers have not focused on
applying the Pincer-search algorithm for
disease predictions which shows
considerably good prediction results. Once
predicted with disease, we can suggest
patients with particular hospital
(correspondence to the data available) for
further diagnosis and surgical treatments.
In future, this work can be extended by
suggesting dietary suggestions and the
precautionary points can be offered.
Parameters used are consistent indicators
for heart disease, thus more parameters with
geographical variations can be used for
better prediction results. Various
parameters like processing time, resources
and memory used can be enhanced in future
for making it an important aid for medical
and healthcare communities.