Skip to content

niladri18/Purchase-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Purchase-Analytics

Table of Contents

  1. Problem
  2. Input Dataset
  3. Solution

Problem

Instacart has published a dataset containing 3 million Instacart orders.

In this challenge, I calculate, for each department, the number of times a product was requested, number of times a product was requested for the first time and a ratio of those two numbers.

Input Datasets

For this challenge, we have two separate input data sources, order_products.csv and products.csv.

We assume each line of the file order_products.csv holds data on one request. The file contains data of the form

order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
3,17668,1,1
3,46667,2,1
3,17461,4,1
3,32665,3,1
4,46842,1,0

where

  • order_id: unique identifier of order
  • product_id: unique identifier of product
  • add_to_cart_order: sequence order in which each product was added to shopping cart
  • reordered: flag indicating if the product has been ordered by this user at some point in the past. The field is 1 if the user has ordered it in the past and 0 if the user has not. While data engineers should validate their data, for the purposes of this challenge, you can take the reordered flag at face value and assume it accurately reflects whether the product has been ordered by the user before.

The file products.csv holds data on every product, and looks something like this:

product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3

where

  • product_id: unique identifier of the product
  • product_name: name of the product
  • aisle_id: identifier of aisle in which product is located
  • department_id: identifier of department

Solution

I treat each product as an object and create an array to store the list of the objects. I use hashing to store data for each departments.

Repo directory structure

The directory structure for the repo looks like this:

├── README.md
├── run.sh
├── src
│   └── purchase_analytics.py
├── input
│   └── products.csv
|   └── order_products.csv
├── output
|   └── report.csv
├── insight_testsuite
    └── run_tests.sh
    └── tests
        └── test_1
        |   ├── input
        |   │   └── products.csv
        |   │   └── order_products.csv
        |   |__ output
        |   │   └── report.csv
        ├── your-own-test_1
            ├── input
            │   └── your-own-products.csv
            |   └── your-own-order_products.csv
            |── output
                └── report.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages