Data analysis and visualization in Python (Pima Indians diabetes data set)

Today I am going to perform data analysis for a very common data set i.e. Pima Indians Diabetes data set. You can download the data from here. I'll not give the meta information here in detail because it is given exclusively here. So it is recommended for all who want to understand the complete data analysis that what kind of data we are working with. In our analysis we'll be using two major Python libraries to do analysis and visualization. Pandas is for data processing, cleaning and analysis whereas Matplotlib is for visualization of our data.

Data Analysis

We start by calculating the descriptives which allows us to see the data summary. One of the reaasons why initial descriptives are important because we see the data summary and do preprocessing again if we find any potential outliers and do normalization if there is a significant difference of scales between the variables. Normalization makes our analysis easier specially when we try to visualize data.

If you are using pandas, there is a very simple of calculating the descriptive statistics

	import pandas as pd

	filename = 'pima-indians-diabetes.csv'
	df = pd.read_csv(filename, sep=',', encoding='utf-8', header=None)

	# Assigning column names and mapping the meta information for each attribute
	cols = df.columns = ['n_pregnant', 'glu_conc', 'bp', 'tst', 'insulin', 'bmi', 'dpf', 'age', 'diabetes?']
	metainfo = {'n_pregnant': 'Number of times pregnant',
	'glu_conc': 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test',
	'bp': 'Diastolic blood pressure (mm Hg)',
	'tst': 'Triceps skin fold thickness (mm)',
	'insulin': '2-Hour serum insulin (mu U/ml)',
	'bmi': 'Body mass index (weight in kg/(height in m)^2)',
	'dpf': 'Diabetes pedigree function',
	'age': 'Age (years)',
	'diabetes?': 'Class variable (0 or 1)'}


	df_temp = df.drop(labels = 'diabetes?', axis=1)
	descriptives = df_temp.describe()

view raw da.py hosted with ❤ by GitHub

We see that 'df_temp.describe()' does all the calculations. We drop the binary variable 'diabetes?' in df_temp because its descriptive statistics are calculated by binomial distribution formula and the way pandas caluclated the descriptives will not give any insights. Rest of the other work is just to load the data and mapping the columns and meta information. The df.describe() method will return the following output.

Here I am not going to spend much time in interprating the the results because it is very basic and you can find various sources to see the interpratation of these metrics. The idea here is to get our hands on with these basic statistics with pandas in a simple way.

Now we see from the results that our data is not scaled uniformly rather it has few high scale variables like glu_conc (glucose concentration) and some very low scaled variables like dpf (diabetes pedigree function). So we need to normalize our data in order to visualize it better. Let us see an example where normalization will really help us in understanding the data better.

Box plot without normalization

Box plot with data normalization

You can see that the box plots are from the same data but above one is the original data and below one is the normalized data. With below box plot we can visualize the box plot features effectively i.e. one can visualize all the descriptive statistics effectively in the box plot with the normalized data whereas with the original data it is difficult to analyze. So let us normalize our data using pandas in a very simple and intuitive way i.e. $(x-mean)/(max-min)$

	df_norm = (df - df.mean()) / (df.max() - df.min())
	df_norm.describe()

view raw da3.py hosted with ❤ by GitHub

The above code return the following table

Let us look at the data a bit deeply and start visualizing. Lets see if the classes i.e. one with postivie diabetes test and the other with with non-postivie test have similar mean values of the attributes or not. For this we need a bar graph plot like the following

Bar graph plot

From the bar graph we can analyze that insulin and glucose concentrations (glu_conc) are significantly higher in women whose diabetes test is diagnosed positive. Other attributes might be impacting the diabetes significantly but from our bar graph plot we didn't find any significant difference in the other attributes.

After determining the features which have significant differences in the two classes we'll split each feature into a histogram and visualize the levels at which we observes differences in the class values for each attribute. For this we need a stacked histogram which is shown below.

Stacked histogram all attributes

The stacked histogram shows a more clearer picture of the data. Green bars shows the women with positive diabetes test and blue bars shows the women with negative diabetes test. In glu_conc we see that diabetes are diagnosed to those women having high glucose concentration levels. Similarly, bp (blood pressure diastolic) histogram shows that positive diabetes was found in women with high bp. Although there were samples where high bp women had a negative test but positive diabetes was found only with high bp. Insulin histogram shows that women with lower insulin levels have a positive diabetes test. There are also women with low insulin which have negative diabetes test so higher insulin levels are following exponential distribution but women having lower insulin levels have higher chances for having a positive diabetes test. The other attributes are not showing any correlation with the diabetes across various levels so they might not be impacting significantly.

Finally we'll try to visualize the bivariate relationship between each variable using a scatter plot. We know that we have 8 independent variables and one class variable so in order to to visualize the scatter plot between each of the variables we need a scatter matrix like the following which allows us to visualize all the bivariate scatter plots at the same time.

Bivariate scattered matrix plot

We used normalized data for plotting the scattered matrix. We see that both the classes are overlapping with each other in a bivariate scatter plot. The distributions shown in the diagonal are following either exponential distribution or slightly skewed normal distribution.

Note*: All the python code for plotting the above graphs and calculating the statistics is available on github here.

Data analysis and visualization in Python (Pima Indians diabetes data set)

54 comments

Post a Comment

Categories

Recent Comments

Popular Posts

Blog Archive

Text Widget

Text Widget

Advertisement

Follow us

Video of the day

Pages

About Me

ADs

POPULAR POSTS

Categories

Contact Form

Popular Posts

Label

Labels

About