C or fortran function to transform .csv file to sparse matrix

$30-5000 USD

Closed

Posted

about 13 years ago

$30-5000 USD

Paid on delivery

There are 2 datasets, A and B. Set A contains a list of 3 mln. unique ids. Set B contains, for a subset of these ids, 5 variables: date1, date2, product, cost and quantity. There are about 10,000 distinct products. Each row in set B represents a sale of a product to a customer (identified by the id), sold on date1 and paid on date2 (with date2 >= date1). There can be up to 100 million rows in set B. The function I am looking for should transform set A and B into two matrices, C and D. Both matrices should have: - a total number of rows equal to the total number of unique (customer) ids, i.e. equal to the number of rows in set A. - a total number of columns equal to the total number of unique products (i.e. about 10,000) Each cell in matrix C should contain the total quantity of a product sold to a customer, and each cell in matrix D should contain the total cost of a product. Matrix C and D should have a CCS (compressed column storage) sparse matrix format (see e.g. [login to view URL]). You can use any free publicly available library or code as part of your program. ## Deliverables There are 2 datasets, A and B. Set A contains a list of 3 mln. unique ids. Set B contains, for a subset of these ids, 5 variables: date1, date2, product, cost and quantity. There are about 10,000 distinct products. Each row in set B represents a sale of a product to a customer (identified by the id), sold on date1 and paid on date2 (with date2 >= date1). There are about 100 million rows in set B. The function I am looking for should transform set A and B into two matrices, C and D. Both matrices should have: - a total number of rows equal to the total number of unique (customer) ids, i.e. equal to the number of rows in set A. - a total number of columns equal to the total number of unique products (i.e. about 10,000) Each cell in matrix C should contain the total quantity of a product sold to a customer, and each cell in matrix D should contain the total cost of a product. The function will require four additional inputs: - mindate1 - maxdate1 - mindate2 - maxdate2 To build matrix C and D, only rows from Set B should be processed that satisfy both of the following conditions: - mindate1 <= date1 <= maxdate1 - mindate2 <= date2 <= maxdate2 It is expected that only about 0.1% of the cells in matrix C and D are non-zero. Matrix C and D therefore should have a CCS (compressed column storage) sparse matrix format (see e.g. [login to view URL]). The function should not require more than 15GB RAM when executed on the data as specified above (cost and quantity variables both have double, i.e. 8 byte, storage format). Example (in this example we omit the mindate1...maxdate2 restrictions): inputs: Set A: id 15 1 2 100 Set B: id, date1, date2, prod, cost, quantity, 100, '17/02/2008', '19/02/2008', C, 79, 30, 15, '11/01/2008', '11/01/2009', A, 100.51, 2, 100, '17/02/2008', '19/02/2008', A, 79, 7, 1, '15/03/2008', '11/01/2009', B, 3.71, 13, 15, '11/10/2008', '17/01/2009', A, 58, 1, matrix C (column names would be: id, prod_A, prod_B, prod_C) 1, 0, 13, 0, 2, 0, 0, 0, 15, 3, 0, 0, 100, 7, 0, 30, output in CCS sparse format: row_ind = {1, 2, 3, 4, 3, 4, 1, 4} col_ptr = {1, 5, 7, 8} val = {1, 2, 15, 100, 3, 7, 13, 30} matrix D (column names would be: id, prod_A, prod_B, prod_C) 1, 0, 3.71, 0, 2, 0, 0, 0, 15, 158.51, 0, 0, 100, 79, 0, 79, output in CCS sparse format: row_ind = {1, 2, 3, 4, 3, 4, 1, 4} col_ptr = {1, 5, 7, 8} val = {1, 2, 15, 100, 158.51, 79, 3.71, 79} I will provide you with sample datasets A and B and the corresponding matrices C and D for testing purposes.

Software Architecture

Project ID: 3250970