Is there an algorithm for weighted reservoir sampling? [closed] - language-agnostic

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?

The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.
The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).
Another article talking about this algorithm is this one by the Cloudera folks.

You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.
Hope this helps,
Benoit

Related

How can we understand DP (Dynamic Programing) ? Listing type of problems [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 days ago.
Improve this question
I am doing some programming practice and
Going through Dynamic Programming theory. we always come across two points
Optimal Substructure (OSS)
Overlapping Subproblem (OSP)
Any optimization problem with these two characteristics can be solved by using DP techniques (Memoization or Tabulation).
But we know this needs so much practice to identify the kind of problems.
Let's Say we have 4 types
TYPE 1
TYPE2
PROBLEMS
OSS
OSP
ALL DP Problem
NON-OSS
OSP
?
OSS
NON-OSP
?
NON-OSS
NON-OSP
?
e.g. Is there any problem that looks like having NON-Overlapping Subproblem but Optimal Substructure characteristics?
I need your help in listing the problem of each type. These will help me and whoever is reading this get more identifying and then solving the problem.
If you have gone through any problem (Leetcode, CodeChef, SPOJ etc) you think can be fit into '?' category please comment.
Also If you have any link/source to know more about type based on OSS/OSP.

How to improve Random forest regression prediction result [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working with parking occupancy prediction using machine learning random forest regression. I have 6 features, I have tried to implement the random forest model but the results are not good, As I am very new to this I do not know what kind of model is suitable for this kind of problem. My dataset is huge I have 47 million rows. I have also used Random search cv but I cannot improve the model. Kindly have a look at the code below and help to improve or suggest another model.
Random forest regression
The features used are extracted with the help of the location data of the parking lots with a buffer. Kindly help me to improve.
So, your used variables are :
['restaurants_pts','population','res_percent','com_percent','supermarkt_pts', 'bank_pts']
The thing I see is, for a same Parking, those variables won't change, so the Regression will just predict the "average" occupancy of the parking. One of the key part of your problem seem to be that the occupancy is not the same at 5pm and at 4am...
I'd suggest you work on a time variable (ex : arrival) so it's usable.
Itself, the variable cannot be understood by the model, but you can work on it to create categories with it. For example, you make a preprocess selecting only the HOUR of your variable, and then make categories with it (either each hour being a category, or larger categories like ['noon - 6am', '6am - 10am', '10am - 2pm', '2pm - 6 pm', '6 pm - noon'])

what is the time complexity of iterator increments and decrements for stl::map [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What's the complexity of iterator++ operation for stl RB-Tree(set or map)?
I always thought they would use indices thus the answer should be O(1), but recently I read the vc10 implementation and shockly found that they did not.
To find the next element in an ordered RB-Tree, it would take time to search the smallest element in the right subtree, or if the node is a left child and has no right child, the smallest element in the right sibling. This introduce a recursive process and I believe the ++ operator takes O(lgn) time.
Am I right? And is this the case for all stl implementations or just visual C++?
Is it really difficult to maintain indices for an RB-Tree? As long as I see, by holding two extra pointers in the node structure we can maintain a doubly linked list as long as the RB-Tree. Why don't they do that?
The amortized complexity when incrementing the iterator over the whole container is O(1) per increment, which is all that's required by the standard. You're right that a single increment is only O(log n), since the depth of the tree has that complexity class.
It seems likely to me that other RB-tree implementations of map will be similar. As you've said, the worst-case complexity for operator++ could be improved, but the cost isn't trivial.
It quite possible that the total time to iterate the whole container would be improved by the linked list, but it's not certain since bigger node structures tend to result in more cache misses.

Technical implications of FFT spectral analysis over custom defined frequency bands [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
First of all, I should mention that I'm not an expert in signal processing, but I know some of the very basics. So I apologize if this question doesn't make any sense.
Basically I want to be able to run a spectral analysis over a specific set of user-defined discrete frequency bands. Ideally I would want to capture around 50-100 different bands simultaneously. For example: the frequencies of each key on an 80-key grand piano.
Also I should probably mention that I plan to run this in a CUDA environment with about 200 cores at my disposal (Jetson TK1).
My question is: What acquisition time, sample rate, sampling frequency, etc should I use to get a high enough resolution to line up with the desired results? I don't want to choose a crazy high number like 10000 samples, so are there any tricks to minimize the number of samples while getting spectral lines within the desired bands?
Thanks!
The FFT result does not depend on its initialization, only on the sample rate, length, and signal input. You don't need to use a whole FFT if you only want one frequency result. A bandpass filter (perhaps 1 per core) for each frequency band would allow customizing each filter for the bandwidth and response desired for that frequency.
Also, for music, note pitch is very often different from spectral frequency peak.

Machine learning of word structure [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a system that can create made up fanatsy words based on a variety of user input, such as syllable templates or a modified Backus Naur Form. One new mode, though, is planned to be machine learning. Here, the user does not explicitly define any rules, but paste some text and the system learns the structure of the given words and creates similar words.
My current naïve approach would be to create a table of letter neighborhood probabilities (including a special end-of-word "letter") and filling it by scanning the input by letter pairs (using whitespace and punctuation as word boundaries). Creating a word would mean to look up the probabilities for every letter to follow the current letter and randomly choose one according to the probabilities, append, and reiterate until end-of-word is encountered.
But I am looking for more sophisticated approaches that (probably?) provide better results. I do not know much about machine learning, so pointers to topics, techniques or algorithms are appreciated.
I think that for independent words (an especially names), a simple Markov chain system (which you seem to describe when talking about using letter pairs) can perform really well. Feed it a lexicon and throw it a seed to generate a new name based on what it learned. You may want to tweak the prefix length of the Markov chain to get nicely sounding results (as pointed out in a comment to your question, 2 letters are much better than one).
I once tried it with elvish and orcish names dictionaries and got very satisfying results.