Love for Data Science: data-visualization | Your Blog Description Here

ABC analysis for inventory management in MS Excel and Python from scratch

ABC analysis as the name shows that it is a technique in inventory management to categorize the overall catalogue of products into 3 classes "A","B" and "C". Categorization "A" relates to the class of most valuable products, "B" being less valuable, and "C" beaing the least valuable.

The criteria of how much a product is valuable is generally determined on the basis of Pareto principle (or 80/20 rule). This rule says that 80% of the effects (in our case 80% of total sales) comes from 20% of the causes (in our case 20% of the products (SKUs)). It is a very simple and effective method that allows inventory managers to take decisions about which SKUs are important to produce first and which SKUs are least important. It has to be noted that in a normal business scenario all orders should have equal importance but when a business reaches a level where it cannot produce according to market demand then managers have to decide which SKUs are generating more sales to the business. The first focus of the managers is to produce the products which generates more sales and then go for less important products with respect to sales.

Implementation in MS Excel
The calculation for the segmentation of ABC products is pretty straight forward. It generally requires quantity and volumes of each SKU sold over a specific period of time. In inventory management, managers usually consider 1 year data but it not a benchmark. In our example, we'll generate some random data for quantity and volumes based on different SKUs. The data looks like the following

We take the above data and make pivots in Excel based on 3 factors. The first factor is the total number of lines. We'll use the =COUNTIF function to count the number of lines of a particular SKU in the data set. This will show that how many times a particular SKU has been ordered over the period of time. Then, for the second and third pivot tables we'll use the =SUMIF function to calculate total quantity for each SKU ordered and total sales it generates over the period of time. See the below implementation

Next we,ll calculate the respective percentage of each SKU with respect to all SKU in all 3 pivots and then calculate the cumulative sum in the next column like below. We'll then see if each SKU (presented in a descending order i.e. sorted by highest to lowest) is greater then our threshold value of 80% or not by making a Decision column. If the SKU value is greater than 80% it lies in the top 80% SKU with respect to either frequency, quantity or volumes, we'll mark the Decision column as 1 otherwise 0.

We'll then merge all the 3 tables by applying =VLOOKUP function on SKU and sum each of the 3 attributes. If the the sum for a specific SKU is equal to 3 then it will be categorized as "A" SKU. Similarly, if the sum is 2 them "B" and if the sum is 3 then "C". We also segmented SKUs as category "Q" whose sum of all 3 factors equals 0 or in other words, those SKUs are not important either in terms of frequency, quantity or value.

Python Implementation
The above process for ABC analysis in Excel is a very easy to implement and understandable to beginners as well. We also have Python implementation below for the same process explained above. We start by loading the data as a dataframe.

After that we simply make 3 pivots by using Pandas and then in the end merge all the 3 tables. Once the merging is done, we'll add the results from each column in the and does the segmentation like we did above by using segmentation method.

data-visualization October 24, 2017 21 comments

Portfolio optimization of financial assets in Python from scratch

Portfolio optimization is a technique in finance which allow investors to select different proportions of different assets in such a way that there is no way to make a better portfolio under the given criterion. The criteria for optimization is generally to minimize the risk for a certain level of expected returns used as a constraint or maximize the returns while a certain level of risk used a constraint.

We write the portfolio optimization problem as

$minimize \sum_{i=1}^{N} \sum_{j=1}^{N} w_i w_j \sigma_{ij}$
subject to
$\sum_{i=1}^{N} w_i R_i = \mu$
$\sum_{i=1}^{N} w_i=1$

Here $i$ are the number of assets in the portfolio, $w_i$ are the weights (proportions) assigned to each $i_{th}$ asset, $R_i$ are the returns from each asset, $\mu$ are the average returns and $\sigma_{ij}$ is the variance-covariance matrix. In simple words the above model can be understood as minimization of weighted variance-covariance matrix (or financial risk) with a certain level of returns $\mu$ kept under constraint. The last constraint makes sure that all the weights are add up to 1.

Data collection and sources
I am going to use 6 assets in this example namely stocks, gold, foreign exchange (FOREX). bitcoins, real estate, and bonds. Bitcoins as assets are relatively new and less than a decade old but if you have not heard about them you can find an introduction of bitcoins here. The time period we are considering is between January 2015 till December 2016. I collected stocks data from yahoo finance API and bitcoin historical data from here.The data for remaining assets was collected using Quandl API from here. The data was collected using different APIs including Yahoo Finance, Quandl and was cleaned and merged such that it looks like the following

We see that different assets data has different scales so it must be normalized in order to make each asset comparable with each other. There is another way by which we can make this data comparable i.e. to calculate the daily returns for each asset which will make the range of our data from -1 to +1. In Python this task is quite straight forward

Financial data is mostly compared using this method. After calculating the returns our data looks like the following and ready to be processed for optimization.

Portfolio optimization implementation in Python
We start optimizing our portfolio by doing some visualization so we have a general idea that how our data looks like. For this we'll simply plot our returns against the time and the following code will do that

We'll get the following graph as our output

We see that bitcoins show extremely high variations followed by bonds but on the other hand the returns are also extremely high verifying the assumption behind the modern portfolio theory. In the next stage we need to setup the returns and variance methods which will be used in optimization method. The idea behind building these methods is that we'll initialize our optimization with some random weights whose sum is 1. Then we'll update the weights by using some optimization method, say "SLSQP" in our example and will keep updating them until convergence. Following code will do that

The methods calc_exp_returns() and var_cov_matrix() have been implemented from the objective function above. In the objective function we have minimization of risk for which we implemented var_cov_matrix() and for the constraints we have calc_exp_returns() and sum(weights) = 1 in the optimize() method. These constraints are set in the minimization problem under contraints parameter. For the purpose of performing analysis, we have used different expected returns values and determining how risk behaves if we vary expected returns. The list named exp_return_constraint contains different level of expected returns and we run the optimization as many times as we have elements in the list such that we'll get different risk levels for different expected returns.
The results will allow us to plot the efficient frontier for all feasible portfolios. Below is the method which plots the efficient frontier

From the graph we can see that higher expected returns leads to higher risk. In fact we see that as our expected returns are getting close to zero, our risk is also getting close to zero. Furthermore, the trajectory formed by the scatter plot is called "Efficient Frontier" in Finance and all the solutions or point lie on that trajectory are feasible for investors.
Finally, we need to see how weights are changing while we increase our expected returns. For this purpose we need to plot bar charts of weights for each level of expected returns. Following code will do that

The legend of each bar chart shows the percentage risk and the optimal weights for assets on that level of risk. We can analyze from the graph that for higher level of returns the weights are biased towards bitcoins because its prices have increased significantly in the last few years. On contrary, when we decrease the level of expected returns, we see different variations in the weights but foreign exchange (forex) is the optimal due to its least risky nature. Furthermore, if we expect very low returns, say 0.01% , we see that the weights are biased towards forex. So if someone is willing to play very safe, then investing in forex is a good option but if someone is interested in earning more, he/she has to go with higher risk but then bitcoin would be a better investment choice.

Conclusion
In this tutorial I have tried to explain how we can find efficient portfolios and how to analyze them. With the current data for US and the given time period, we analyze that bitcoin is a good investment option for risk taking investors while forex is a good option for risk averse investors. This analysis is considered as a basic analysis in finance so a lot other factors can be incorporated in our model which might change our decision variable values to an extent.

The complete code for this example is available here on github.

data-visualization October 20, 2017 16 comments

Data analysis and visualization in Python (Pima Indians diabetes data set)

Today I am going to perform data analysis for a very common data set i.e. Pima Indians Diabetes data set. You can download the data from here. I'll not give the meta information here in detail because it is given exclusively here. So it is recommended for all who want to understand the complete data analysis that what kind of data we are working with. In our analysis we'll be using two major Python libraries to do analysis and visualization. Pandas is for data processing, cleaning and analysis whereas Matplotlib is for visualization of our data.

Data Analysis

We start by calculating the descriptives which allows us to see the data summary. One of the reaasons why initial descriptives are important because we see the data summary and do preprocessing again if we find any potential outliers and do normalization if there is a significant difference of scales between the variables. Normalization makes our analysis easier specially when we try to visualize data.

If you are using pandas, there is a very simple of calculating the descriptive statistics

We see that 'df_temp.describe()' does all the calculations. We drop the binary variable 'diabetes?' in df_temp because its descriptive statistics are calculated by binomial distribution formula and the way pandas caluclated the descriptives will not give any insights. Rest of the other work is just to load the data and mapping the columns and meta information. The df.describe() method will return the following output.

Here I am not going to spend much time in interprating the the results because it is very basic and you can find various sources to see the interpratation of these metrics. The idea here is to get our hands on with these basic statistics with pandas in a simple way.

Now we see from the results that our data is not scaled uniformly rather it has few high scale variables like glu_conc (glucose concentration) and some very low scaled variables like dpf (diabetes pedigree function). So we need to normalize our data in order to visualize it better. Let us see an example where normalization will really help us in understanding the data better.

Box plot without normalization

Box plot with data normalization

You can see that the box plots are from the same data but above one is the original data and below one is the normalized data. With below box plot we can visualize the box plot features effectively i.e. one can visualize all the descriptive statistics effectively in the box plot with the normalized data whereas with the original data it is difficult to analyze. So let us normalize our data using pandas in a very simple and intuitive way i.e. $(x-mean)/(max-min)$

The above code return the following table

Let us look at the data a bit deeply and start visualizing. Lets see if the classes i.e. one with postivie diabetes test and the other with with non-postivie test have similar mean values of the attributes or not. For this we need a bar graph plot like the following

Bar graph plot

From the bar graph we can analyze that insulin and glucose concentrations (glu_conc) are significantly higher in women whose diabetes test is diagnosed positive. Other attributes might be impacting the diabetes significantly but from our bar graph plot we didn't find any significant difference in the other attributes.

After determining the features which have significant differences in the two classes we'll split each feature into a histogram and visualize the levels at which we observes differences in the class values for each attribute. For this we need a stacked histogram which is shown below.

Stacked histogram all attributes

The stacked histogram shows a more clearer picture of the data. Green bars shows the women with positive diabetes test and blue bars shows the women with negative diabetes test. In glu_conc we see that diabetes are diagnosed to those women having high glucose concentration levels. Similarly, bp (blood pressure diastolic) histogram shows that positive diabetes was found in women with high bp. Although there were samples where high bp women had a negative test but positive diabetes was found only with high bp. Insulin histogram shows that women with lower insulin levels have a positive diabetes test. There are also women with low insulin which have negative diabetes test so higher insulin levels are following exponential distribution but women having lower insulin levels have higher chances for having a positive diabetes test. The other attributes are not showing any correlation with the diabetes across various levels so they might not be impacting significantly.

Finally we'll try to visualize the bivariate relationship between each variable using a scatter plot. We know that we have 8 independent variables and one class variable so in order to to visualize the scatter plot between each of the variables we need a scatter matrix like the following which allows us to visualize all the bivariate scatter plots at the same time.

Bivariate scattered matrix plot

We used normalized data for plotting the scattered matrix. We see that both the classes are overlapping with each other in a bivariate scatter plot. The distributions shown in the diagonal are following either exponential distribution or slightly skewed normal distribution.

Note*: All the python code for plotting the above graphs and calculating the statistics is available on github here.

data-visualization October 14, 2017 54 comments

ABC analysis for inventory management in MS Excel and Python from scratch

Portfolio optimization of financial assets in Python from scratch

Data analysis and visualization in Python (Pima Indians diabetes data set)

Categories

Recent Comments

Popular Posts

Blog Archive

Text Widget

Text Widget

Advertisement

Follow us

Video of the day

Pages

About Me

ADs

POPULAR POSTS

Categories

Contact Form

Popular Posts

Label

Labels

About