A scikit-learn script that performs non-negative PCA decomposition
£10-20 GBP
Closed
Posted almost 7 years ago
£10-20 GBP
Paid on delivery
I'd like a python 3 script that produces an approximate PCA decomposition of a set of vectors into N components which can then be linearly combined to approximate the samples.
First, read wikipedia: Linear combination > Affine, conical, and convex combinations
However, we can't just use IncrementalPCA in scikit-learn because instead of a normal linear combination, we'd like restricted coefficients, like a conical combination. Or better still, I'd like to restrict the set of possible coefficients to only any two real numbers, or some other number of coefficients, such as 4, 8, or 16. In the case of two, it's more like a 'Boolean combination' than a linear combination.
Input data:
1. samples > 100000 (has to work on at least this many, and I only have 24GB of ram (or 3GB on my GPU) so you'll probably need incremental batching. Total run time should be hours not days.)
2. features per sample = 4000 32 bit integers
Output data:
1. N components of size 4000 that can be used in a linear combination to produce each input sample. The component values can be floats rather than integers, approximation is fine.
2. A list of the max, min and mean accuracy of the representation for the sample batch.
3. the linear combination for each sample and its accuracy.
Input parameters:
1. The number of components, N < X, where X is specified. This will obviously affect accuracy.
2. The size of the set of coefficients, S. S in {2, 4, 8, 16... or R} where R is no restriction and allows all real numbers.
Constraints:
It doesn't matter what the values of the coefficients are in the linear combinations, but they have to be the same coefficients re-used in different combinations for all samples. If they were {0, 1.6} for example you'd get this kind of output:
batch average accuracy: 32242
batch min accuracy: 1232
batch max accuracy: 272291
sample number: 37221
linear combination: 0v1 + 1.6v2 + 0v3 +1.6v4 + 1.6v5 +0v6 ..... 1.6vN
accuracy: 23423
sample number: 15718
linear combination: 1.6v1 + 0v2 + 1.6v3 +1.6v4 + 0v5 +1.6v6 ..... 0vN
accuracy: 12383
Accuracy is to be calculated by summing the absolute differences between each feature in the sample and each feature in the linear representation, so zero would be a perfect representation, 1000000 would be a bad one. Freelancers: Please just put the words white rabbits at the top of your proposal and in your first message to me so I know you have read this spec and understand it, this saves us both time. There will be two milestones: milestone 1 will be a token amount for a short script that isn't complete garbage. Milestone 2 will be the full project amount for the finished script.
Examples:
One of the millions of samples will look like {7, 272, 0, 345373, 23, 2325.... all the way to 4000 features}
One of the N (e.g. 1000) output components might look like {2E06, 1.7E03, 0.2828323, -322, 0, 2.7E-10.... to 4000 values}
and a given sample might be represented by:
[login to view URL] + 1component2 + 0component3 +1component4 + 1component4 +0componentf ..... 1component1000
with accuracy 27324
Hi,
Can we discuss the project further?
I have done similar project and understood the project outline. Please give me a chance. A trial will convince you. Looking forward to work with you.