Instacart market basket analysis: Part 1

Sachin Kumar
8 min readJun 21, 2021

In depth EDA and problem formulation

Solving Instacart market basket analysis challenge with machine learning to enhance customer’s shopping experience.

source

This is part 1 of the 3-part series of my blog on Instacart market basket analysis:

The following parts of this series can be found here:

In this part of Instacart market basket analysis we will understand the problem statement and do EDA on the data to further explore the techniques to solve it.

Table of Contents:

  1. Description
  2. Business problem and constraints
  3. Data Overview
  4. Machine learning problem
  5. Performance metric
  6. In depth EDA
  7. References

Description

Here we are trying to solve the kaggle challenge “Instacart market basket analysis”.
Instacart is an online platform or app which enables people to do online grocery shopping and does doorstep delivery . After we select the items we wish to purchase, delivery personnel reviews the order and do shopping for us in nearby stores physically and deliver it to us.

This saves customers a lot of time in mundane daily shopping activities and that can be used in something productive.

Business problem and constraints

Instacart wants to enhance a customer’s shopping experience by suggesting products customer may order again in ‘Buy these again’ suggestion section. This will save a lot of time for the customer by skipping search and select process.

This will also help Instacart to help local vendors with their retail business by reminding them to restock items which are bought frequently.

Problem statement

Based on customers past orders/purchase we have to predict if an item will be in the customer’s next order or not. So, given a new order we have to predict the products that he/she is more likely to order again. The results will be evaluated on F1 mean score.

Data Overview

The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 users. For each user, they provided between 4 and 100 of their orders, with the sequence of products purchased in each order. They also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Data

The dataset is divided in three parts for us to make the analysis easy.

  • Prior data- Past orders of users. Between 4 to 100 of their orders are provided, using which we will do our analysis and prediction.
  • Train data- Current order of user. Using this as a current order we will check what products have been reordered.
  • Test data- These are the orders which will be done in future and we have to predict the products which might be reordered.
  1. orders.csv : This file contains details of an order like order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order. Here eval_set tells us which category of data it belongs to : prior, train or test.
  2. products.csv : This file contains a product’s information like product_id, product_name, department_id and aisle_id.
  3. order_products__prior.csv : details of previous orders and its products (which products were bought in each order) like order_id, product_id, add_to_cart_order, reordered.
  4. order_products__train.csv : contains last order of some customers only.
  5. department.csv : details of department
  6. aisles.csv : details of aisle

Let’s have a peek into these tables :

orders.csv :

sample data in orders.csv

In above table, important thing to notice is the ‘eval_set’ column. It tells if a specific order is from prior, train or test. Using this we will train on specific data with prior order history.

products.csv :

sample data in products.csv
  • Here for each product we have a unique product Id along with it’s name and aisle id, department id it belong to.

orders_products_prior.csv :

data in orders_products_prior.csv
  • Every row in this table is defining that for each order id, what is the product id (items) ordered in it and whether it is a reordered item for that user or not by reordered column. 1 stands for item is reordered and 0 for 1st time order.

orders_products_train.csv:

sample data in orders_products_train
  • Same as orders_products_prior table except it is for current order to train on.

departments.csv:

sample data in departments.csv
  • This table contains mapping of department id to its name.

aisles.csv :

  • This table contains mapping of department id to its name.

Machine learning problem

We can pose it as a classification problem where given a customer and their previous orders, we have to predict if a product will be in his/her next order or not.

It could have also be posed as a multi label classification problem, which involve predicting zero or more mutually non-exclusive class labels. But as we are having huge number of products we will stick to binary classification approach. So, basically we have to predict reordered column of the data as target variable.

So, for every order id we will classify each product against it as reordered or not.

Performance metric

We will be using Micro-Averaged F1-Score (Mean F1 Score) as a performance metric.

So, First we will take all the products which will be predicted as reordered and combine them as space separated string of product_id. On this F1 score will be calculated for each order. Later for all orders combined we will use mean F1 score.

reference — https://www.kaggle.com/c/instacart-market-basket-analysis/data

EDA- Exploratory data analysis

EDA is an important aspect of any machine learning or data science problem. It helps get insights in data through proper examination and is very important because it exposes trends, patterns, and relationships that are not readily apparent.

Check for null/missing values in data :

columns containing null values

We can see that we have null or missing data in days_since_prior_order column of orders table.

percentage of null values :

We will remove/modify the null values as they are not helpful in our classification task. After analysis I came to conclusion that all nulls in ‘days_since_prior_order’ column are present because they are 1st orders for any user. We will impute these with value 0.

Analyzing orders table :

orders by day of the week -

Above bar plot tells us about how many orders has been placed on what days. By the distribution of data we can assume that 0 represents Saturday and 1 represents Sunday, as most of the orders generally happens on weekend.

orders by hour of day:

Above visualization tells at which time customers orders the most. We can notice that orders count starts rising at 8 am and continue till 9 pm.
From morning 9 am to 8 pm we see good number of orders by customers compared to other time/hours.

days since prior order:

In above plot we can see that most reorders happen at 7th day i.e a week gap. Apart from that 1,2,3,4,5,6,8 days also has fair amount of reorders. One thing to notice is at 30th day we see a large number of reorders i.e in a month time. And as it is the highest value, we can safely assume that it is a cut off day to represent number of days since prior order greater than 30.

This means number of days since prior order being equal or greater than 30 are represented on 30th day. This explains the huge number of reorders on that day.

Merge data:

Now we will merge our order_products_prior with the products, aisles and departments table for uni variate analysis.

We will now do uni variate analysis for each component w.r.t orders done:

  1. Total number of orders by department-

Produce department (which contains products like vegetables and fruits) has highest orders, which explains that they are essentials. Dairy and eggs are next highest ordered from departments.

2. Top 5 highest orders made by aisles-

Fresh fruits, Fresh vegetables, Packaged vegetables fruits, Yogurt, packaged cheese are top 5 aisles from which highest orders are made. It also explains why Produce and Dairy & eggs departments were in highest orders.

3. Total orders by products:

Bananas is ordered highest among all products. Followed by Bag of organic bananas which is also bananas. Other products also are from produce category.

Uni variate analysis for each component w.r.t reorders done:

  1. Reorders w.r.t products (which products are reordered most):

Most number of reorders are similar to most number of orders. This explains reorders are huge chunk in orders data. We will verify that below also.

2. Reorders with respect to departments:

Here also Produce department has highest number of reorders followed by Dairy & eggs. Similar to distribution of orders.

3. Reorders w.r.t day of week feature:

We cannot see any difference in reordered status with this feature.

4. Number of orders of a user which are reordered or 1st time:

In number of orders feature we can definitely see values which favor reordered or not. IQR (Inter quantile range) of num_orders where product is not reordered is in range 10 to 35. Whereas, IQR range for reordered products is from 19 to 47. Therefore, we can infer that higher num_orders value i.e a user’s total orders are high then chances of reorders are high.

We will also do some EDA after generating some new features using Feature engineering techniques.

In Next Section we will discuss:

  • Feature engineering and some analysis on new features generated.
  • Merging dataframes to create final data to train on.
  • Modelling using different machine learning and deep learning algorithms.
  • optimizing results and selecting best model.

You can find the next part here.

You can find the code in my github profile .

Link to my Linkedin profile.

References :

--

--

Sachin Kumar

Data science learner. ML and DL enthusiast. Love to solve new problems