ABC analysis for inventory management in MS Excel and Python from scratch

ABC analysis as the name shows that it is a technique in inventory management to categorize the overall catalogue of products into 3 classes "A","B" and "C". Categorization "A" relates to the class of most valuable products, "B" being less valuable, and "C" beaing the least valuable.

The criteria of how much a product is valuable is generally determined on the basis of Pareto principle (or 80/20 rule). This rule says that 80% of the effects (in our case 80% of total sales) comes from 20% of the causes (in our case 20% of the products (SKUs)). It is a very simple and effective method that allows inventory managers to take decisions about which SKUs are important to produce first and which SKUs are least important. It has to be noted that in a normal business scenario all orders should have equal importance but when a business reaches a level where it cannot produce according to market demand then managers have to decide which SKUs are generating more sales to the business. The first focus of the managers is to produce the products which generates more sales and then go for less important products with respect to sales.

Implementation in MS Excel
The calculation for the segmentation of ABC products is pretty straight forward. It generally requires quantity and volumes of each SKU sold over a specific period of time. In inventory management, managers usually consider 1 year data but it not a benchmark. In our example, we'll generate some random data for quantity and volumes based on different SKUs. The data looks like the following



We take the above data and make pivots in Excel based on 3 factors. The first factor is the total number of lines. We'll use the =COUNTIF function to count the number of lines of a particular SKU in the data set. This will show that how many times a particular SKU has been ordered over the period of time. Then, for the second and third pivot tables we'll use the =SUMIF function to calculate total quantity for each SKU ordered and total sales it generates over the period of time. See the below implementation

Next we,ll calculate the respective percentage of each SKU with respect to all SKU in all 3 pivots and then calculate the cumulative sum in the next column like below. We'll then see if each SKU (presented in a descending order i.e. sorted by highest to lowest) is greater then our threshold value of 80% or not by making a Decision column. If the SKU value is greater than 80% it lies in the top 80% SKU with respect to either frequency, quantity or volumes, we'll mark the Decision column as 1 otherwise 0.


We'll then merge all the 3 tables by applying =VLOOKUP function on SKU and sum each of the 3 attributes. If the the sum for a specific SKU is equal to 3 then it will be categorized as "A" SKU. Similarly, if the sum is 2 them "B" and if the sum is 3 then "C". We also segmented SKUs as category "Q" whose sum of all 3 factors equals 0 or in other words, those SKUs are not important either in terms of frequency, quantity or value.




Python Implementation
The above process for ABC analysis in Excel is a very easy to implement and understandable to beginners as well. We also have Python implementation below for the same process explained above. We start by loading the data as a dataframe.


 After that we simply make 3 pivots by using Pandas and then in the end merge all the 3 tables. Once the merging is done, we'll add the results from each column in the and does the segmentation like we did above by using segmentation method.

Portfolio optimization of financial assets in Python from scratch

Portfolio optimization is a technique in finance which allow investors to select different proportions of different assets in such a way that there is no way to make a better portfolio under the given criterion. The criteria for optimization is generally to minimize the risk for a certain level of expected returns used as a constraint or maximize the returns while a certain level of risk used a constraint.

We write the portfolio optimization problem as


subject to



Here are the number of assets in the portfolio,  are the weights (proportions) assigned to each   asset,   are the returns from each asset,  are the average returns and  is the variance-covariance matrix. In simple words the above model can be understood as minimization of weighted variance-covariance matrix (or financial risk) with a certain level of returns  kept under constraint. The last constraint makes sure that all the weights are add up to 1.

Data collection and sources
I am going to use 6 assets in this example namely stocks, gold, foreign exchange (FOREX). bitcoins, real estate, and bonds. Bitcoins as assets are relatively new and less than a decade old but if you have not heard about them you can find an introduction of bitcoins here. The time period we are considering is between January 2015 till December 2016. I collected stocks data from yahoo finance API and bitcoin historical data from here.The data for remaining assets was collected using Quandl API from here. The data was collected using different APIs including Yahoo Finance, Quandl and was cleaned and merged such that it looks like the following

We see that different assets data has different scales so it must be normalized in order to make each asset comparable with each other. There is another way by which we can make this data comparable i.e. to calculate the daily returns for each asset which will make the range of our data from -1 to +1.  In Python this task is quite straight forward


Financial data is mostly compared using this method. After calculating the returns our data looks like the following and ready to be processed for optimization.




Portfolio optimization implementation in Python
We start optimizing our portfolio by doing some visualization so we have a general idea that how our data looks like. For this we'll simply plot our returns against the time and the following code will do that

We'll get the following graph as our output
We see that bitcoins show extremely high variations followed by bonds but on the other hand the returns are also extremely high verifying the assumption behind the modern portfolio theory. In the next stage we need to setup the returns and variance methods which will be used in optimization method. The idea behind building these methods is that we'll initialize our optimization with some random weights whose sum is 1. Then we'll update the weights by using some optimization method, say "SLSQP" in our example and will keep updating them until convergence. Following code will do that

The methods calc_exp_returns() and var_cov_matrix() have been implemented from the objective function above. In the objective function we have minimization of risk for which we implemented var_cov_matrix() and for the constraints we have calc_exp_returns()  and sum(weights) = 1 in the optimize() method. These constraints are set in the minimization problem under contraints parameter. For the purpose of performing analysis, we have used different expected returns values and determining how risk behaves if we vary expected returns. The list named exp_return_constraint contains different level of expected returns and we run the optimization as many times as we have elements in the list such that we'll get different risk levels for different expected returns.
The results will allow us to plot the efficient frontier for all feasible portfolios. Below is the method which plots the efficient frontier

From the graph we can see that higher expected returns leads to higher risk. In fact we see that as our expected returns are getting close to zero, our risk is also getting close to zero. Furthermore, the trajectory formed by the scatter plot is called "Efficient Frontier" in Finance and all the solutions or point lie on that trajectory are feasible for investors.
Finally, we need to see how weights are changing while we increase our expected returns. For this purpose we need to plot bar charts of weights for each level of expected returns.  Following code will do that

The legend of each bar chart shows the percentage risk and the optimal weights for assets on that level of risk. We can analyze from the graph that for higher level of returns the weights are biased towards bitcoins because its prices have increased significantly in the last few years. On contrary, when we decrease the level of expected returns, we see different variations in the weights but foreign exchange (forex) is the optimal due to its least risky nature. Furthermore, if we expect very low returns, say 0.01% , we see that the weights are biased towards forex. So if someone is willing to play very safe, then investing in forex is a good option but if someone is interested in earning more, he/she has to go with higher risk but then bitcoin would be a better investment choice.

Conclusion
In this tutorial I have tried to explain how we can find efficient portfolios and how to analyze them. With the current data for US and the given time period, we analyze that bitcoin is a good investment option for risk taking investors while forex is a good option for risk averse investors. This analysis is considered as a basic analysis in finance so a lot other factors can be incorporated in our model which might change our decision variable values to an extent.

The complete code for this example is available here on github

Data analysis and visualization in Python (Pima Indians diabetes data set)



Today I am going to perform data analysis for a very common data set i.e. Pima Indians Diabetes data set. You can download the data from here. I'll not give the meta information here in detail because it is given exclusively here. So it is recommended for all who want to understand the complete data analysis that what kind of data we are working with. In our analysis we'll be using two major Python libraries to do analysis and visualization. Pandas is for data processing, cleaning and analysis whereas Matplotlib is for visualization of our data.

Data Analysis

We start by calculating the descriptives which allows us to see the data summary. One of the reaasons why initial descriptives are important because we see the data summary and do preprocessing again if we find any potential outliers and do normalization if there is a significant difference of scales between the variables. Normalization makes our analysis easier specially when we try to visualize data.

If you are using pandas, there is a very simple of calculating the descriptive statistics


We see that 'df_temp.describe()' does all the calculations. We drop the binary variable 'diabetes?' in df_temp because its descriptive statistics are calculated by binomial distribution formula and the way pandas caluclated the descriptives will not give any insights. Rest of the other work is just to load the data and mapping the columns and meta information. The df.describe() method will return the following output.


Here I am not going to spend much time in interprating the the results because it is very basic and you can find various sources to see the interpratation of these metrics. The idea here is to get our hands on with these basic statistics with pandas in a simple way.

Now we see from the results that our data is not scaled uniformly rather it has few high scale variables like glu_conc (glucose concentration) and some very low scaled variables like dpf (diabetes pedigree function). So we need to normalize our data in order to visualize it better. Let us see an example where normalization will really help us in understanding the data better.

Box plot without normalization
Box plot with data normalization
You can see that the box plots are from the same data but above one is the original data and below one is the normalized data. With below box plot we can visualize the box plot features effectively i.e. one can visualize all the descriptive statistics effectively in the box plot with the normalized data whereas with the original data it is difficult to analyze. So let us normalize our data using pandas in a very simple and intuitive way i.e. 


The above code return the following table

Let us look at the data a bit deeply and start visualizing. Lets see if the classes i.e. one with postivie diabetes test and the other with with non-postivie test have similar mean values of the attributes or not. For this we need a bar graph plot like the following

Bar graph plot
From the bar graph we can analyze that insulin and glucose concentrations (glu_conc) are significantly higher in women whose diabetes test is diagnosed positive. Other attributes might be impacting the diabetes significantly but from our bar graph plot we didn't find any significant difference in the other attributes.

After determining the features which have significant differences in the two classes we'll split each feature into a histogram and visualize the levels at which we observes differences in the class values for each attribute. For this we need a stacked histogram which is shown below.

Stacked histogram all attributes
The stacked histogram shows a more clearer picture of the data. Green bars shows the women with positive diabetes test and blue bars shows the women with negative diabetes test. In glu_conc we see that diabetes are diagnosed to those women having high glucose concentration levels. Similarly, bp (blood pressure diastolic) histogram shows that positive diabetes was found in women with high bp. Although there were samples where high bp women had a negative test but positive diabetes was found only with high bp. Insulin histogram shows that women with lower insulin levels have a positive diabetes test. There are also women with low insulin which have negative diabetes test so higher insulin levels are following exponential distribution but women having lower insulin levels have higher chances for having a positive diabetes test. The other attributes are not showing any correlation with the diabetes across various levels so they might not be impacting significantly.

Finally we'll try to visualize the bivariate relationship between each variable using a scatter plot. We know that we have 8 independent variables and one class variable so in order to to visualize the scatter plot between each of the variables we need a scatter matrix like the following which allows us to visualize all the bivariate scatter plots at the same time.
Bivariate scattered matrix plot
We used normalized data for plotting the scattered matrix. We see that both the classes are overlapping with each other in a bivariate scatter plot. The distributions shown in the diagonal are following either exponential distribution or slightly skewed normal distribution.

Note*: All the python code for plotting the above graphs and calculating the statistics is available on github here.

An introduction to Artificial Neural Networks and its detailed implementation in Python and Excel


Artificial Neural Networks (ANNs) is a classification algorithm in machine learning which is inspired by biological neural networks using which our brain works. Let us consider a very intuitive example to understand the concept. So suppose you are sleeping in your apartment and someone rings your door bell. You'll go, open the door and see that the person who has visited you is your ex-girlfriend :) But have you ever thought for a moment that how could you correctly recognize your ex-girlfriend in a millisecond. Well, a layman would answer that because I knew her and we have met so many times before. Intuitively this is not an incorrect answer at all but this is how exactly our brain works. So if you look a person (or a similarly looking person) your interconnected brain neurons with synapses which allows dendrites to receives an input of your ex-girlfriend face image, and based on this input they produce an output signal through an axon to another neuron. This firing of neuron from one to another does some magical calculations in our brain which we'll discuss later and suddenly you classify your ex-girlfriend correctly.

Neural nets have many applications in various industries including speech recognition, image recognition to name a few. This blog will first cover the overview and mathematical concepts behind the ANNs and then I'll solve the process of ANN (1 iteration) in Excel. Finally, I'll implement a very basic 3 layered ANN in Python from the scratch using Pima-Indians-Diabetes-Dataset

Activation Functions

In this blog, you'll frequently see the term activation function. So let me explain it here and then I'll just the term activation function throughout the blog. In a very general way, activation functions are functions which takes any input $x_i$ such that and return an output between 0 and 1. In other words, activation functions returns a probabilistic score for a given input $x$. From the probabilistic score return by the activation function we can set the threshold and output 1 if the score the greater than that threshold else it is 0. Below you can see various activation function used in different machine learning algorithms and the table is copied from here.


There are various activation functions used in ANNs including Sigmoid function, tangent hyperbolic, ReLU function etc but in our example we shall be using the Sigmoid (Logistic) function. The Sigmoid function is written as \[f(x) = \frac{1}{1+e^{-x}}\]

Mathematical Background for ANNs

We start by looking at the below image from which we can analyze that how we mimic a biological neuron with an artificial neuron.We use input matrix like synapses does in the biological neuron, then hidden layer and activation function helps us in processing that input which dendrites does in a biological neuron and then we fire an output based on a probability score similar to how an axon works. The overview is quite simple and intuitive and so as the structure. You only need to understand exactly what is going on during the whole process and I'll help you understand that process as easily as possible.



Let or be the weights and be the bias at the hidden layer and or be the weights and be the bias at the output layer. We apply the process shown in the below process diagram to estimate $Z$.



Once we estimate $Z$, we can now determine how well we are performing in our estimates by calculating the performance function (P) and we define P as follows \[P=\frac{(Y-Z)^2}{2} --- (1)\] The intuition is quite clear because P shows that we are just calculating the difference of the estimated output (Z) and the original output (Y). The squared difference is just to make P continuous, differentiable and we can optimize it easily using gradient descent. This whole process is called Forward Propagation.


Now in order to determine that how well P has performed by changing the weights and , we have to calculate the partial derivatives of P w.r.t and or in pure math terms we need and . By looking at the process diagram we can easily compute partial derivatives by using the chain rule.
\]
Where . From (1) we calculate 

Also, if we see the process diagram carefully 

Now we are left with . From the process diagram, we see that is converted to $Z$ using the activation function. So we have to calculate the derivative our activation function
If the output of looks weird to you then don't worry. The given form can be easily calculated by doing some algebra tricks. So, the final version of (2) is 

Let us now calculate . Below we expand   using the chain rule.

We have already computed   previously. So let us calculate the remaining partial derivatives. 




Substituting all these values to (5) we get the following 

Since we have calculated the partial derivatives, we can use the gradient descent to update the new weights in the following way



Implementation in MS Excel

This implementation is planned for those who have less programming knowledge and want to see with their eye ball what exactly is going on inside neural networks. While explaining the example in Excel, I cannot work with complete data and iterate the process till convergence for optimal weights and bias. But the intuition is to make you understand how matrix are calculated, how activation functions work and how back propagation works using gradient descent. So, I am taking a toy example of 10 input variables where each input is a bit vector of 5 dimensions makes input matrix dimensions 10 x 5 and output vector also contains binary values makes a matrix of dimension 10 x 1.


Here I have divided this section into 3 i.e. forward propagation and backward propagation and updating the weights and bias. Lets us see each section in detail

1. Forward Propagation

We start applying forward propagation by introducing random weights to initialize out network. In ANNs we work with several nodes (depending on data dimensions i.e. in our case 5 dimension) and layers (depending on our feasibility) so the hidden layer weight matrix would be 3 x 5 matrix which we initialize using random weights or using =RAND() method in Excel. Similarly, bias matrix at the hidden layer is initialized with zeros, having dimension 3 x 1 i.e. bias for each hidden layer. After initializing the hidden layer's weights biases, we have to initialize the same for output layer. Below you can see all initialization with =RAND() or just with zeros.


After initialization, we need to start doing some matrix multiplication just like sending signals to our other parts of the through an axon. We multiply the input data matrix with the weights and add bias at the hidden such that $X.{W_h}^T + b_h$. Once this matrix is calculated, we'll apply the activation (Sigmoid) function. Below you can see the calculations using basic matrix multiplication which can be done using =MMULT(array1, array2). The usage of this method is a bit different which you can see here.

   


After applying the activation function, we generate a matrix w_output by multiplying the activation matrix with weight matrix at the output layer and add bias which we initialized earlier. Once that is done, we apply the activation function again at the output layer to estimate our output matrix like the following and then we calculated the error from the true values of Y.


2. Backward Propagation

Since we have calculated the total change at hidden layer and output layer, we need to update the weights and bias at each layer. For updating the weights at the hidden layer we need to multiply the input matrix with the delta_hidden_activations matrix and then multiplying it with the learning rate. Similarly, weights at the output layer will be calculated using the chain rule or in other words by multiplying delta_output matrix with the hidden activation matrix and then multiplying it with the learning rate.. On the other hand, bias matrix will be simply calculated by summing up the delta at each node in the network and multiply it with the learning rate.
Till now we have smoothly worked with the matrices and activation functions and determine out estimated output. But we know that the output or weights that we determined are not optimal or in other words these weights does not minimize the loss function. For this we need to run this process many times, calculate the weights and bias at each iteration and then update the old weights and bias. As discussed earlier that we'll cover only 1 iteration process so in this case our objective now to update our initialized weights and bias with the new weights and bias we determined.

Weights and updates are done using the gradient descent method where we calculate the derivatives at each layer and sequentially updating all the weights. We start by calculating the slope of the  estimated output. We know that derivative of the sigmoid function is $x(1-x)$ and therefore we can calculate the slope matrix using =X*(1-X) in Excel. After determining the slope matrix we need the total change at the output layer which is calculated in delta_output. It is calculated by multiplying the error matrix E which was calculated in the forward propagation and multiply it by slope_output matrix.


After calculating the total change at the output layer, its time calculate the total change at the hidden activation layer. For this we calculate the slope of the hidden activation matrix using =X*(1-X) again and then we'll calculate the error matrix at the hidden activation layer multiplying delta_output matrix and weights matrix at the output layer. Once error matrix is calculated then we do the same process of multiplying error matrix by slope matrix to get delta_hidden_activations matrix or in other words total change at the hidden layer.



3. Updating weights



These updated weights are now used in the second iteration in forward propagation process.

Note: A complete example in Excel with formulas is available here so you can either use the same data or even change it to see different results.


Implementation in Python

I have explained the complete process in detail above while doing Excel implementation so here I'll just explain what each method does and you can find the complete and ready to run code from here on github. We start by loading our data and splitting it into train and test data set. In our data set we have continuous variables therefore, we'll standardize our data as well using $(x-\mu) / \sigma$. Also make sure that all these methods are applicable only on data sets which have structure like our Pima data set otherwise you have to tweak this method a little.

We then implement our activation (Sigmoid) function and its slope function to calculate values at the hidden and output layer during each iteration.


Finally we implement the two most important methods in the algorithm i.e. fit() and predict(). The first method train our model using the whole process defined in the Excel implementation so not much to say about that. The second method predict() takes a matrix of test inputs, apply the optimal weights and bias calculated at the time of model training and then returns a matrix which classifies each input as 1 or 0.

In order to run the code successfully, you need to clone it from github here. In this blog I have just explained the methods that I am using in order to train our neural network. After copying the code from github you just need to download the Pima data set from the above mentioned link and you are good to go.