There are 2 datasets, A and B. Set A contains a list of 3 mln. unique ids. Set B contains, for a subset of these ids, 5 variables: date1, date2, product, cost and quantity. There are about 10,000 distinct products. Each row in set B represents a sale of a product to a customer (identified by the id), sold on date1 and paid on date2 (with date2 >= date1). There can be up to 100 million rows in set B.
The function I am looking for should transform set A and B into two matrices, C and D. Both matrices should have:
- a total number of rows equal to the total number of unique (customer) ids, i.e. equal to the number of rows in set A.
- a total number of columns equal to the total number of unique products (i.e. about 10,000)
Each cell in matrix C should contain the total quantity of a product sold to a customer, and each cell in matrix D should contain the total cost of a product.
Matrix C and D should have a CCS (compressed column storage) sparse matrix format (see e.g. [login to view URL]).
You can use any free publicly available library or code as part of your program.
## Deliverables
There are 2 datasets, A and B. Set A contains a list of 3 mln. unique ids. Set B contains, for a subset of these ids, 5 variables: date1, date2, product, cost and quantity. There are about 10,000 distinct products. Each row in set B represents a sale of a product to a customer (identified by the id), sold on date1 and paid on date2 (with date2 >= date1). There are about 100 million rows in set B.
The function I am looking for should transform set A and B into two matrices, C and D. Both matrices should have:
- a total number of rows equal to the total number of unique (customer) ids, i.e. equal to the number of rows in set A.
- a total number of columns equal to the total number of unique products (i.e. about 10,000)
Each cell in matrix C should contain the total quantity of a product sold to a customer, and each cell in matrix D should contain the total cost of a product.
The function will require four additional inputs:
- mindate1
- maxdate1
- mindate2
- maxdate2
To build matrix C and D, only rows from Set B should be processed that satisfy both of the following conditions:
- mindate1 <= date1 <= maxdate1
- mindate2 <= date2 <= maxdate2
It is expected that only about 0.1% of the cells in matrix C and D are non-zero. Matrix C and D therefore should have a CCS (compressed column storage) sparse matrix format (see e.g. [login to view URL]).
The function should not require more than 15GB RAM when executed on the data as specified above (cost and quantity variables both have double, i.e. 8 byte, storage format).
Example (in this example we omit the mindate1...maxdate2 restrictions):
inputs:
Set A:
id
15
1
2
100
Set B:
id, date1, date2, prod, cost, quantity,
100, '17/02/2008', '19/02/2008', C, 79, 30,
15, '11/01/2008', '11/01/2009', A, 100.51, 2,
100, '17/02/2008', '19/02/2008', A, 79, 7,
1, '15/03/2008', '11/01/2009', B, 3.71, 13,
15, '11/10/2008', '17/01/2009', A, 58, 1,
matrix C (column names would be: id, prod_A, prod_B, prod_C)
1, 0, 13, 0,
2, 0, 0, 0,
15, 3, 0, 0,
100, 7, 0, 30,
output in CCS sparse format:
row_ind = {1, 2, 3, 4, 3, 4, 1, 4}
col_ptr = {1, 5, 7, 8}
val = {1, 2, 15, 100, 3, 7, 13, 30}
matrix D (column names would be: id, prod_A, prod_B, prod_C)
1, 0, 3.71, 0,
2, 0, 0, 0,
15, 158.51, 0, 0,
100, 79, 0, 79,
output in CCS sparse format:
row_ind = {1, 2, 3, 4, 3, 4, 1, 4}
col_ptr = {1, 5, 7, 8}
val = {1, 2, 15, 100, 158.51, 79, 3.71, 79}
I will provide you with sample datasets A and B and the corresponding matrices C and D for testing purposes.