For one project I was tasked with predicting the sales of stores for a retail company. To do this I was given the sales for two years on a daily level. Normally one would try to find out what drives the sales but after some discussion it was decided to try to find whether there are different types of stores with different sales trends. For example, think about stores in popular vacation spots with higher sales in the summer than other stores, or stores with certain holidays. While we saw a very clear trend in our data we could not see whether there were truly different groups since the clusters were very close to each other. In order to have truly different groups we would need the groups to be consistent over time. This is also necessary in order to use the groups for predicting sales. Therefore I made the hypothesis that I will test in this blog:
If clusters remain the same over time, we can use clustering to make predictions.
So what is clustering. According to Wikipedia cluster analysis is “The task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).” There are several clustering algorithms to divide the data in groups of similar stores, but I went with K-means because it is the easiest and because in my case it was quite handy to hand-pick the number of clusters. Usually clustering is done in order to get information about your data before doing the big analysis on it. In this case we use the clusters as a sort of prediction because we assume that stores within a cluster are similar. This is effective because we don’t have a lot of years to work with and by doing this we can use the data of similar stores to predict the sales of a single store.
Training sets and test sets
Often when predicting analysts split their data in training sets and test sets. The training data is used to train the prediction model and the test data is to check how well the model performs on new data. This makes sure that we don’t overfit the model. Overfitting happens when we tailor our model too much to a training dataset in such a way that we base our predictions too much on variance of the data. This is illustrated in the following picture. The dots are our data points. The green line is the true distribution and the yellow line is the prediction based on the training data. One can see that the yellow line is not very smooth like the green line. This is because the yellow line tries to adjust to the outliers, which is not what we want.
Indeed a model that works very well on training data may not give good results on newly encountered data. For clustering it is not very intuitive to find out whether the stores truly belong to a cluster or whether variance causes a store to be far from where the store is actually supposed to be. While it is not always possible to do this, you need at least two instances of every observation. My business case was well suited for this approach. I constructed an experiment to tackle this problem.
In order to test this hypothesis I generated two datasets with sales data per week for a whole year. Within each dataset there were three groups:
- Group with higher sales in the summer.
- Group with higher sales in the winter.
- Group with constant sales throughout the year.
All sales data is normally distributed with a standard deviation of 1. The means of the sales are represented in the graph above (+10, we can’t have negative sales). Finally all sales of the year per store are divided by the total sales of that store in order to obtain the percentage of the yearly revenue that the store generates in a particular week. This is done in order to standardize all the sales. The sales would then look something like this:
Now let’s do clustering on this data. I have been using K-means clustering with Euclidean distance. The algorithm did a great job at finding the different groups of stores given that it almost perfectly divides the stores into three even groups, just like I constructed the data:
We now know that the individual data sets can be divided into three groups. However we do not know if this holds for the stores in general. It could be that the algorithm found completely different clusters for each year. This is why we do cross-validation to see whether the groups are consistent over time, assuming that sales per store have the same random distribution each year (which is the case for my data). The assumption not being true would imply that we can’t figure out next year’s sales with previous years sales which would make analysis pointless or at least very hard. We will now check whether the groups are consistent over time:
It would seem that the clusters are very consistent over time. The number of a cluster can be different each time one does the analysis so we shouldn’t expect cluster 0 of 2017 to be the same one as cluster 0 of 2018. Instead we should look in this graph and find that all of the stores of cluster 0 of 2017 ended up in cluster 1 of 2018. This is very good, although the differences between the groups were numerically clear. In this case only one out of the 90 stores moved to a different cluster over the years so we can conclude that the clusters that we have found are consistent over time. This means that we can make predictions based on the groups that the stores belong to. This experiment was done with dummy data. In practice I aim to have 90% of the stores stay in the same cluster over the years, otherwise we can conclude that the clustering is based more on variance and overfitting than on the true distribution of your sales.
One major problem with this analysis, at least in its current form, is the curse of dimensionality. The curse of dimensionality is when you have so many variables that it becomes difficult for the algorithm to prioritize which information to find. In this situation we have 52 weeks of sales data. Now it could be that the distributions for all clusters are the same, with the exception of one week. Because there are so many weeks to base the clusters on the best Euclidean distance might be reached by defining the clusters based on slight differences in all the weeks based on variance, instead of on the one week where it matters. One way I solved this was by working with monthly data instead of weekly data, the downside however is that you can lose information due to aggregation of the data. Therefore one should closely inspect the results of your clustering before applying them everywhere.