Instacart market basket analysis: Part 2

Sachin Kumar
6 min readJun 22, 2021

Feature engineering and analysis

Solving Instacart market basket analysis challenge with machine learning to enhance customer’s shopping experience.

source

As you have seen in my last blog, we discussed about the problem statement and also did an in depth EDA of the data provided by Instacart. If you not seen my last blog, please refer to it here .

This is Part 2 of the three part series of my blog on Instacart market basket analysis.

Other parts of this series can be found here:

In this part we will look into some feature engineering techniques to generate new features in our data, which will help us in solving our problem better.

Table of contents :

  1. Description
  2. Generating new features
  3. Univariate analysis of new features
  4. Creating final data for model training
  5. References

Description:

Feature Engineering is the heart of the solution for this problem, as we can derive many new features based on user, product and orders. These new features will definitely help us in analyzing and discover more insights from data.

Generate new features:

  1. Number of orders for each user:

This feature helps in determining how many orders each user has made in total till now.

2. Order length:

This feature tells us the order length w.r.t each order Id, i.e how many orders are already there in cart. This feature may help us in determining which products to suggest based on already selected products.

3. Order importance:

Order importance is defined here as the ratio of order number (eg. 1st, 2nd.. order of customer) to the total number of orders made by user in past. This gives more weight to recent orders. Thus giving more importance to recent purchasing behavior. order importance = order_number/num_orders.

4. Product importance:

This feature assigns more importance to the products that are added earlier in the cart for each order. It is defined as (order length - add_to_cart_order + 1 ) / order_length. This defines a product’s importance in each order.

5. Importance score:

This feature is defined for each unique product & order pair. This is defined as product_importance multiplied with order_importance.

6. Importance score Sum:

Now using above features we will generate a new feature called importance score sum. For all unique user & product pair we will have a importance score sum value which will sum all the importance scores for a user with that product.

Reorder Features:

7. Reorder count:

This is the number of reorders done by a user. We can sum the reorderd column w.r.t each user to obtain this.

8. Reorder ratio:

This feature can be defined as the mean of reordered column or simply the ratio of orders w.r.t total orders for a user.

Univariate analysis of newly generated features

Num_orders feature- Number of orders for a user.

In num_orders feature we can definitely see values which favor reordered target variable. Inter Quantile Range (IQR) of num_orders where product is not reordered is in range 10 to 35. Whereas, IQR range for reordered products is from 19 to 47. Therefore, we can infer that higher num_orders value i.e a user’s total orders are high then chances of reorders are high.

Order Importance:

In above plots also we can observe that higher order_importance are having more reorder density. Also IQR range(Inter Quantile Range. 25th to 75th percentile) of order_importance is higher for reordered products. Thus, this feature will also help us in determining reordered products.

Product Importance:

violin plot for product importance

In box plot we can observe that IQR is high of reorders when product importance feature’s value is high. Also in violin plot we can see that the density and mean of product importance values are higher when product is reordered.

Importance score:

Here also we can see ‘not reordered’ product having high density when importance_score is low in pdf.

The IRQ range is on higher side i.e. 0.18 to 0.5 for products reordered. Whereas IQR for ‘not reordered’ products are on the lower range i.e. 0.02 to 0.3.

In violin plot also we could see that density of points is higher when importance_score value is near 0.

Importance score sum:

The above box plot of importance_score gives a very nice explanation of product reordered or not. We can see that IQR of feature w.r.t ‘not reordered’ is in very less range (0–1). Whereas the IQR for feature w.r.t ‘ordered’ is 0 to 6. 100th percentile is also in range of 0 to 0.2 for ‘not reordered’ and 0 to 10 for reordered products.

Our final dataframe:

After cleaning, removing redundant features and incorporating newly generated features we will have our dataframe ready for training as shown below.

After creating the dataframe I stumbled upon a thought that instead of having order length feature why not have mean order length. So I created a new feature called mean order length which is the mean of order numbers of a user.

So, our final dataframe will look like :

Create Training and Test data

Using above code we will create a dataframe having future orders which are necessarily orders from train and test set only.

Then we will merge this with our orders_with_order_products dataframe using left join. To now more about join operations refer this link .

Next we will filter the rows having eval_set column as train.

Now, we will merge using left join order_products__train.csv to the above data_train dataframe on unique pair of product id and order id . We are doing this because we want to get the target variable i.e reordered column for our order and product pair.

data_train output

We can see some Nan values in the reordered column. This happens because in train data we have some orders which are 1st time orders. We will simply impute them to value 0.

After dropping unnecessary columns our train data frame looks like this.

Similarly create test data:

So now our data will look something like this.

test data

Finally after a lot of data analysis and feature engineering we are ready with our final dataframes to train our model on.

In next part of this solution we will see the modelling strategies.

In next section we will discuss on:

  • Modelling using different machine learning and deep learning algorithms.
  • optimizing results and selecting best model.
  • Creating submission file in the format required.

Prediction using various models can be found here.

You can find the code in my github profile .

Link to my Linkedin profile

References:

--

--

Sachin Kumar

Data science learner. ML and DL enthusiast. Love to solve new problems