ABSTRACT – Early disease prediction is

one of the core elements of biomedical and

healthcare communities to improve the

quality of prior diagnostics for fatal

diseases like Congenital Heart Disease,

Cancer etc. Advanced Data Mining

techniques can help remedial situations.

Experimenting the medical structured data

with data mining concepts like Classifiers

and Association Rule Mining (ARM)

techniques helps in the detection of

occurrence for a particular disease. Medical

data set obtained from the open source of

United Kingdom is processed and analysed

for heart disease prediction and then the

system suggests hospital for further

treatment. Accuracy comparison between

the classifier algorithms used is generated

from R Studio. This prediction results pave

way for proper diagnosis and early

treatment of chronic diseases. It can be used

to mitigate the death rate increase due to the

late prediction of fatal diseases only at the

critical stage.

Key Words – Data mining, Congenital

Heart Disease, ARM.

I. INTRODUCTION

The healthcare industry collects reliable

and huge amounts of healthcare data which,

unfortunately, are not “mined” to discover

hidden information for effective decision

making. Clinical decisions are often made

based on doctors’ intuition and experience

rather than on the knowledge rich data

hidden in the database.

I.1 Classification

There are two forms of data analysis that

can be used for extracting models

describing important classes or to predict

future data trends. These two forms are as

follows?

Classification

? Prediction

Classification models predict categorical

class labels; and prediction models predict

continuous valued functions. The two main

efficient classifiers implemented here are

the Decision Tree and Naïve Bayes

classification algorithm.

I.2 Association Rule Mining

Association means finding relationship

between different data items in a same data

transaction that is used to discover various

hidden patterns. For instance, if someone

buys a desktop (A), then they also

purchases a speaker (B) in 55% of the

occurrence. This relationship occurs in

8.2% of desktop buys. An association rule

in this condition can be A intend B. 55% is

the CF (confidence factor) and 8.2% is the

SF (support factor). Apriori algorithm,

Pincer search and AprioriDP are the

efficient ARM algorithms in data mining.

Figure-1 Data Set Preview

I.3 Data Set Collection

For this proposed work, the dataset with an

entry count of 750 values for 10 parameters

was taken from 11 data.gov.uk dataset.

The dataset is called the Congenital Heart

Disease (CHD) data published by

Healthcare Quality Improvement

Partnership which has been licensed under

OGL Open Government Licence. The data

contain 30 day outcomes (alive or dead) for

congenital heart disease treatment in

England, although the audit covers all of the

UK and the Republic of Ireland. All data is

available in the National Institute for

Cardiovascular Outcomes Research

(NICOR) Congenital Public Portal. The

preview of the structured dataset with all

the ten attributes’ explanation is given

above in Figure-1 which bears the resource

from

“CongenitalDataFieldDescription2010201

1”.

There are some of the experimental

procedure adapted to this work involves the

following steps as in Figure-2. Data preprocessing

is done to remove duplication

and cleaning to make it fit for mining.

Figure-2 Methodology Workflow

II. LITERATURE SURVEY

Quite a number of research work have been

carried out in recent decades using data

mining techniques on medical data. Ajad

Patel et al., 7 has developed a system that

indicates whether a patient had a risk of

heart disease or not. The work describes

about a prototype using Naïve Bayes and

WAC (Weighted Associated Classifier).

The work of Ms.M.C.S.Geetha et al., 1

will be able to respond more difficult

queries in forecasting the heart attack

disease. The predictive accuracy

determined by REPTREE, J48 and

BayesNet algorithms propose that

parameters used are consistent indicator to

predict the heart disease.

Ilham KADI et al., 3 have constructed a

cardiovascular dysautonomias prediction

system using a decision tree based classifier

developed using C4.5 tree algorithm and

proved it to be accurate and efficient.

Swaroopa Shastri et al., 4 analysed a

dataset using Apriori algorithm to produce

a detailed correlation involving diabetes

and kidney disease.

The works of Jagdeep Singh et al., 9

focused on heart disease prediction using

association classification methods. The

proposed hybrid associative classification

is implemented on weka environment.

Similarly, Dao-l Lin et al., 8 presented a

novel algorithm called the pincer-search

that can efficiently discover the maximum

frequent set. It does not require the explicit

examination of every frequent itemset.

III. CLASSIFICATION TECHNIQUES

Classification consists of predicting a

certain result based on a given input

training data. In order to predict the result,

the algorithm processes a training set

containing a set of attributes and the

individual outcome, usually called

prediction attribute. Data classification is

the process of organizing data into

categories for its most effective and

efficient use. There is some algorithm in

classification which helps to analyse our

work are decision tree, Naïve Bayes.

A. Decision Tree

Decision tree is a predictive model to go

from observations about an attribut to

conclusions about the attribute’s target

value represented in the leaf nodes. This

model is used in statistics, data mining and

machine learning. Decision trees where the

target variable can take continuous values

(typically real numbers) are called

regression trees. Decision tree J48 is the

implementation of algorithm ID3 (Iterative

Dichotomiser 3) developed by the WEKA

project team. J48 allows classification via

either decision trees or rules generated from

them 1.

C4.5 is a standard algorithm for inducing

classification rules in the form of decision

tree. It was introduced by Quinlan. It is an

extension of the basic ID3 algorithm used

to overcome its disadvantages. C4.5

algorithm has the most accuracy rate while

compared with KNN and Naïve Bayes.

Some of these are 3 :

1. Choosing an appropriate attribute

selection measure.

2. Pruning the decision tree after its

creation.

3. Handling continuous attributes.

It uses divide and conquer approach to form

a binary tree as shown in Figure-3 when

analysed with the given data set 11.

Figure-3 Obtained Decision Tree

B. Naïve Bayes Algorithm

This is a family of simple Probabilistic

classifiers based on applying Bayes’

theorem with strong independence

assumptions between the features.

Maximum-likelihood training can be done

by evaluating a closed-form expression,

which takes linear time, rather than by

expensive iterative approximation as used

for many other types of classifiers. Bayes’

Theorem finds the probability of an event

occurring given the probability of another

event that has already occurred. The

mathematical equation is,

P(A|B) = P(B|A) P(A)

P(B)

Naïve Bayes can answer diagnostic and

predictive problems. It is particularly suited

when the dimensionality of the inputs are

high 1. The model is based on the

conditional independence model of each

predictor given the target class 7. The

work in 3 achieved high accuracy rates up

to 98.4% for training set and 97.76% for

testing set respectively.

The obtained bar charts after processing the

training data set 11 for Naïve Bayes

classifier is as shown in Figure-4 and the

threshold value for disease prediction is

fixed to be 9 in Figure-5.

Figure-4 Naïve Bayes’ Classifier

Figure-5 Threshold line for disease

prediction

IV. ASSOCIATION RULE MINING

Proposed by Agrawal et al in 1993. It is an

important data mining model studied

extensively by the database and data mining

community. Initially used for Market

Basket Analysis to find how items

purchased by customers are related. This

method uses the support and confidence

factors.

A. Apriori Algorithm

The Apriori Algorithm is an influential

algorithm for mining frequent itemsets for

Boolean association rules. It uses a “bottom

up” approach, where frequent subsets are

extended one item at a time ( a step known

as candidate generation, and groups of

candidates are tested against the data).

Hence, we have used this algorithm to

predict disease occurrence with the

available factors obtained from the dataset

11. Its main advantages are: (i) uses large

itemset property (ii) easily parallelized (iii)

easy to implement.

This is used in order to mine the most

occurred set of items with a transactional

database (collection of items bought by

customers or details of a website

frequentation), to identify these items onto

the catalog and extend it largely until they

are reached or settled in the list sufficiently

4. They have implemented this to predict

Diabetes influenced Kidney disease. In

algorithms like Apriori, FP-Growth, Naïve

bayes, ZeroR et al are applied in their study

for prediction of heart diseases.

B. Pincer Search algorithm

To overcome the disadvantages of Apriori

algorithm which requires many database

scans, Pincer search algorithm can be used

which can mine the frequent candidate set

in just two pass.

Dao-l Lin et al in 8 have combined both

the “bottom-up and the top-down”

searches. This search is used only for

maintaining and updating a new data

structure, the maximum frequent candidate

set. This work mainly focuses on two

closure properties:

1. If an itemset is infrequent, all it

supersets must be infrequent and

they need not be examined further.

2. If an itemset is frequent, all its

subsets must be frequent and they

need not be examined further.

With the above proposed work, we have

implemented the knowledge of this search

algorithm for predicting the disease with

which has the high frequency among the

common attributes (IF this, THEN that).

Association rule mining thus paves for

disease prediction as well with the

generated candidate set.

V. CONCLUSION

In this paper, we proposed a system for

prediction of diseases by Data Mining

concepts like Decision Tree, Naïve bayes

classifier, Apriori algorithm and Pincersearch

algorithm using the structured data

11. To the best of our knowledge, the

existing papers have not focused on

applying the Pincer-search algorithm for

disease predictions which shows

considerably good prediction results. Once

predicted with disease, we can suggest

patients with particular hospital

(correspondence to the data available) for

further diagnosis and surgical treatments.

In future, this work can be extended by

suggesting dietary suggestions and the

precautionary points can be offered.

Parameters used are consistent indicators

for heart disease, thus more parameters with

geographical variations can be used for

better prediction results. Various

parameters like processing time, resources

and memory used can be enhanced in future

for making it an important aid for medical

and healthcare communities.

x

Hi!

I'm Tyrone!

Would you like to get a custom essay? How about receiving a customized one?

Check it out