how long does it take to find maximum element in descending sorted array? - data-analysis

Worst case to find maximum element in descending sorted array?
I guess best case, average case, and worst cases will all be same because maximum is going to be a first element.is it?
Cases will be Big-O(1)? kindly correct my speculation.

if you are using iteration for search from element 1 to last then Yes.
Other wise it depends a lot on which search algorithm you are going to use.

It depends on the search algorithm and if the algorithm knows the array is already sorted. If the algorithm is unaware chances at it will be N for all cases since each element will need to be considered to prove you have the maximum. For an aware algorithm it will probably be 1 or Log N

Related

Percentile of array (in CUDA ) without sort?

I have a 2560x2048 array of float values that I need the 25% and the 75% percentile values. (5,242,880) as a 1D vector. My first thought was to use a bitonic sort and fetch the value at 25% and 75%. But the Bitonic sort I have is for power of 2 arrays, and I don't want to go to a larger array with dummies.
This got me thinking that perhaps someone has a way of getting percentile without the overhead of a full sort?
I know you're asking about non-sorting methods, but Thrust does offer a sorting function. I haven't tried it, but if it's anything like cuFFT, I'd expect it to be highly optimized.
You can also sort using CUB which apparently is faster than Thrust, according to this link.
Another option would be finding percentiles from a histogram, though that might not be what you want for floating point values, unless you have a good way of partitioning the expected values into a series of bins.
Tae-Sung Shin is correct. Percentile from Histogram is the best way to do this.

Given a set of N values that are all 0 or X, except for one Y, what is the most efficient way to find X?

Basically I have a set of redundant data that could have an error in one (bonus points: one or more) of the values. Some values could also be 0, which means ignore/invalid. What would be the most efficient way to return the "good" value?
The dumb solution would be a for loop that iterates over the set and returns once it finds the same non-zero value twice. But I feel like there might be some logical/bit-hacking expression that would be better.
The 'dumb' solution might be the best one to use, especially if there aren't many zeros in the data set. You would break from the loop very early in most cases.
In the case where there are a lot of zeros, your speed can be optimized if your hardware is able to quickly scan for non-zero entries. I imagine that searching for non-zeros is very easy on FPGA hardware, but I don't have personal experience with this myself.

Hashing Function Vs Loop search

I have an array of structures, ~100 unique elements, and the structure is not large. Due to legacy code, to find an element in this array i use a hash function to find a likely starting point to start looping from until i find the element i want.
My question is this: Is the hash function (and resulting hash table) overkill ?
I know for large tables hashing is essential for good response time, but for a table this size ?
More succinctly, is there a table size below which writing a hash function is unnecessary ?
Language agnostic answers please.
Thanks,
A hash lookup trades better scalability for a bigger up-front computation cost. There is no inherent table size, as it depends on the cost of your hash function. Roughly speaking, if calculating your hash function has the same cost as one hundred equality comparisons, then you could only theoretically benefit from the hash map at some point above one hundred items. The only way to get specific answers for your case is to measure the performance.
My guess though, is that a hash map for 100 items for performance reasons is overkill.
The standard, obvious answer would be/is to write the simplest code that can do the job. Ensure that your interface to that code is as clean as possible so you can replace it when/if needed. Later, if you find that code takes an unacceptable amount of time, replace it with something that improves performance.
On a theoretical basis, however, it's impossible to guess at the upper limit on the number of items for which a linear search will provide acceptable performance for your task. It's also impossible to guess at the number of items for which a hash table will provide better performance than a linear search.
The main point, however, is that it's rarely necessary to try to figure out (especially on a poorly-defined theoretical basis) what data structure would be best for a given situation. In most cases, you just need to make an acceptable decision, and implement it so you can change your mind later if it turns out to be unacceptable after all.
When creating (or after it's created) sort your 'array of unique elements' by their 'key value'. Then use 'binary search' rather than hash or linear search. Now you get a simple implementation, no extra memory usage and good performance.

Randomly sorting an array

Does there exist an algorithm which, given an ordered list of symbols {a1, a2, a3, ..., ak}, produces in O(n) time a new list of the same symbols in a random order without bias? "Without bias" means the probability that any symbol s will end up in some position p in the list is 1/k.
Assume it is possible to generate a non-biased integer from 1-k inclusive in O(1) time. Also assume that O(1) element access/mutation is possible, and that it is possible to create a new list of size k in O(k) time.
In particular, I would be interested in a 'generative' algorithm. That is, I would be interested in an algorithm that has O(1) initial overhead, and then produces a new element for each slot in the list, taking O(1) time per slot.
If no solution exists to the problem as described, I would still like to know about solutions that do not meet my constraints in one or more of the following ways (and/or in other ways if necessary):
the time complexity is worse than O(n).
the algorithm is biased with regards to the final positions of the symbols.
the algorithm is not generative.
I should add that this problem appears to be the same as the problem of randomly sorting the integers from 1-k, since we can sort the list of integers from 1-k and then for each integer i in the new list, we can produce the symbol ai.
Yes - the Knuth Shuffle.
The Fisher-Yates Shuffle (Knuth Shuffle) is what you are looking for.

What statistics can be maintained for a set of numerical data without iterating?

Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
Count
Sum
Mean
Max*
Min*
Median**
Less Obvious
Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.
First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.
To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.
The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.
As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.
Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
public void Add(double value) {
values.Add(value);
if (value > highValue)
highValue = value;
}
public void Remove(double value) {
values.Remove(value);
if (value.WithinTolerance(highValue))
highValue = RecalculateHighValueByIteration();
}
It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.
If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.
It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.