Instacart market basket analysis: Part 3

Sachin Kumar
5 min readJun 22, 2021

Training models to predict reorders

Solving Instacart market basket analysis challenge with machine learning to enhance customer’s shopping experience.

In our previous section we saw how to come up with new features using feature engineering techniques.

If you have not seen my previous blogs on this problem, please refer below:

In this section we will focus on various machine learning modelling techniques to solve our problem of recommending products to a user based on their purchase history.

Table of contents:

  1. Machine learning approach to our problem
  2. Discussion on Loss or Cost function
  3. Modelling techniques
  4. Submission and final score
  5. Further improvements

A machine learning approach to the problem:

I was confused on how to approach this problem from modelling point of view i.e which type of problem it is?

Is it a binary classification, regression or multi class classification. I was confused the most at this part. At first I thought of it as a multi label classification problem, where given an order id or user id we have to predict multiple number of products which could be reordered by the user based on past orders.

But problem with this approach was that there are huge number of products(around 50k) and the predicted products number can range from 0 to n for each order id. So creating such a huge target variable one hot vector is not feasible.

Therefore, we transformed this problem from multi-label classification to a binary classification problem. So given an order id, we will take the user of that order and for each product calculate the probability of it being in the user’s next order or not.

In short our problem statement is now given user , product -> reordered (0 or 1).

Loss or Cost function:

As we have converted our problem to a binary classification problem, we will be using Log Loss for this problem. We want to get as much as possible high probability for the correctly predicted values.

Threshold for probabilities:

Generally rule of thumb is to keep the threshold for correct prediction is 0.5, Anything greater to 0.5 is classified to positive class and lesser to negative class.

Here it is not so simple. As we already discussed we are using mean F1 score as the performance metric to evaluate the model, we should set the threshold such that it maximizes the score. With several experiments I got the threshold to be fixed at 0.22 . Probability more than that will be classified as positive i.e. the product having probability than threshold will be in the user’s next order.

Modelling:

Now that we have all the concepts clear and data ready for training we will start the modelling process. Given a user and a product we will also take into account various features that were already available and also the newly generated features to help in our classification problem.

Note that we will be doing hyperparameter using RandomizedSearchCV.

Model 1: SGDClassifier

We used SGDClassifier() model from sklearn with loss as log loss and hinge loss both.

Using best parameters in SGDClassifier we got very low F1 score: 0.055

Model 2: Decision Trees

Using Decision tree with the parameters obtained from hyperparameter tuning we got F1 Score: 0.221

Model 3: Random forest

Using Random forest we got F1 Score: 0.365

Model 4: XGBoost Classifier

Here we got the highest F1 score till now from the models we have used.

Got F1 Score: 0.413

Model 5: AdaBoost Classifier

Here we got F1 score: 0.178

Model 6: Light GBM classifier

Using light GBM model we got good F1 score: 0.4106

Deep learning models: Multi Layer Perceptron

In my experiments with using deep learning models, these didn’t gave good results. This may be because of the type of features we used to train the model. So I decided to stick to Machine learning algorithms.

Model performances and comparision:

F1 score for different models

Submission and final score:

The submission file will be containing the prediction of products to be reordered for an order Id. This is only on the data in orders.csv where eval_set’ column is equal to ‘test’ set.

Prediction of product being in a user’s next order:

We will use light GBM model for our final prediction on submission test data as it gave best results.

Finally after dropping unnecessary columns our data frame look like:

Above data is what we actually want. Given a user and a product unique pair, predict if the product will be ordered by the user in their next order.

Now our final dataframe for submission will look like below:

Notice that we removed user Id column as it no more required to create our final submission file.

Final Results:

After submission to kaggle with trained model’s prediction I got a F1 score of 0.35446 on leader board with LightGBM model.

Further Improvements:

The F1 score for this challenge can easily be increased up to 0.40 using different methods and techniques available online. As the main focus of this project was understanding how to analyze data properly and explore various feature engineering techniques, we can pick that up later.

Some of the techniques are like F1 Optimization or Maximization discussed by 2nd winner of the challenge here .

You can find the code in my github profile .

Link to my Linkedin profile

References

--

--

Sachin Kumar

Data science learner. ML and DL enthusiast. Love to solve new problems