Instacart has published a dataset containing 3 million Instacart orders.
In this challenge, I calculate, for each department, the number of times a product was requested, number of times a product was requested for the first time and a ratio of those two numbers.
For this challenge, we have two separate input data sources, order_products.csv
and products.csv
.
We assume each line of the file order_products.csv
holds data on one request. The file contains data of the form
order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
3,17668,1,1
3,46667,2,1
3,17461,4,1
3,32665,3,1
4,46842,1,0
where
order_id
: unique identifier of orderproduct_id
: unique identifier of productadd_to_cart_order
: sequence order in which each product was added to shopping cartreordered
: flag indicating if the product has been ordered by this user at some point in the past. The field is1
if the user has ordered it in the past and0
if the user has not. While data engineers should validate their data, for the purposes of this challenge, you can take thereordered
flag at face value and assume it accurately reflects whether the product has been ordered by the user before.
The file products.csv
holds data on every product, and looks something like this:
product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3
where
product_id
: unique identifier of the productproduct_name
: name of the productaisle_id
: identifier of aisle in which product is locateddepartment_id
: identifier of department
I treat each product as an object and create an array to store the list of the objects. I use hashing to store data for each departments.
The directory structure for the repo looks like this:
├── README.md
├── run.sh
├── src
│ └── purchase_analytics.py
├── input
│ └── products.csv
| └── order_products.csv
├── output
| └── report.csv
├── insight_testsuite
└── run_tests.sh
└── tests
└── test_1
| ├── input
| │ └── products.csv
| │ └── order_products.csv
| |__ output
| │ └── report.csv
├── your-own-test_1
├── input
│ └── your-own-products.csv
| └── your-own-order_products.csv
|── output
└── report.csv