What statistics can be maintained for a set of numerical data without iterating? - language-agnostic

Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
Count
Sum
Mean
Max*
Min*
Median**
Less Obvious
Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.

First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.

To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.

The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.

As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.

Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
public void Add(double value) {
values.Add(value);
if (value > highValue)
highValue = value;
}
public void Remove(double value) {
values.Remove(value);
if (value.WithinTolerance(highValue))
highValue = RecalculateHighValueByIteration();
}

It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.

If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.

It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.

Related

Data structure/Algorithm to manage non-overlapping ranges of values?

I'm working on a system that has 10's of thousands of flags for a user. The flags are all sequential in number, 0 through X, whatever X ends up being. X is expected to grow over time. And we're expecting to have lots and lots of users as well.
Our primary concerns are:
Being able to quickly test whether the user has set any given flag.
Being able to quickly set a flag.
Being able to optimize the data storage to as small a size as possible.
With 10k flags, we're looking at around 1k of data per user, in memory, if we use a bit vector. Which might be too much. And to make matters worse, this is in Javascript, being stored in a document database serialized as JSON, which means that we have several storage options, none of them which I particularly like.
Store flags as the JSON output of the Uint32Array object. Looks like: "{"0":10,"1":4294967295}". Unfortunately needs an average of 17 bytes per 4 bytes stored as the flags approach their filled state, which is over 4x the memory, and leads to about 5k of memory when serialized. This is not ideal.
Perform our own serialization of the JSON, using base64 to avoid the bloated sizes of the numbers-as-strings approach. Unfortunately that adds an extra processing step to the JSON input/output phases which complicates things because now we have to modify our data during the process and will slow everything down.
So... putting aside the bitvector idea for a bit. I was wondering if there's possibly a better approach. I considered using an "array of ranges", something like:
[{"m":0,"x":100},{"m":102},{"m":108,"x":204}]
We can make a few assumptions about the data in this system, which is what led me to this approach:
Flags are never un-set. Once it's set, it will remain set.
Flags are generally clustered. If flag X is set, there's a huge probability that X-1 and X+1 will be set as well.
Flags will generally be set at increasing index values. If Flag X is being set, then X-1 is more likely to be set than X+1, and X+1 is likely to be set fairly soon afterwards.
So because of these conditions, I think storing an array of range objects might be the optimal solution. That way, over time, the user's flags eventually condense down into one large range entry. The optimal case is of course:
[{"m":0,"x":10000}]
The worst case scenario, of course, is if they somehow manage to find themselves in a state where they set every other flag.
[{"m":0},{"m":2},{"m":4},{"m":6},{"m":8},{"m":10}...{"m":10000}]
That would be bad. Far worse than the bitvector solution, I think. But we're pretty confident that won't happen.
So, as to the ability to quickly decide if a flag is set; that's simply an O(logn) binary search (since the array will be sorted); just find the range object closest to your number, check to see if your number is in that range, and return.
Insertions are more tricky. It'll still be a binary search, but now we're modifying the array.
one adjacent sibling insert: optimal scenario. We find a range where the min or max is one off from the number we're inserting, and simply decrement or increment the value on the current range. O(1)
No adjacent siblings insert: simply insert a new node with the min set. O(n), because we'll be moving everything after it in the array downwards.
Two adjacent siblings insert: Change the max to the max value of the right sibling range, delete the right sibling range from the array and shift everything after it to the left. O(n).
So cases 2+3 have me wondering if I shouldn't try to use some sort of balanced binary search tree for this. A Red-Black tree, for example.
Is that worth the bother? Am I overthinking this?

Will alpha-beta pruning remove randomness in my solution with minimax?

Existing implementation:
In my implementation of Tic-Tac-Toe with minimax, I look for all boxes where I can get best result and chose 1 of them randomly, so that the same solution isn't displayed each time.
For ex. if the returned list is [1, 0 , 1, -1], at some point, I will randomly chose between the two highest values.
Question about Alpha-Beta Pruning:
Based on what I understood, when the algorithm finds that it is winning from one path, it would no longer need to look for other paths that might/ might not lead to a winning case.
So will this, like I feel, cause the earliest possible box that leads to the best solution to be displayed as the result and seem the same each time? For example at the time of first move, all moves lead to a draw. So will the 1st box be selected each time?
How can I bring randomness to the solution like with the minimax solution? One way that I thought about now could be to randomly pass the indices to the alpha-beta algorithm. So the result will be the first best solution in that randomly sorted list of positions.
Thanks in advance. If there is some literature on this, I'd be glad to read it.
If someone could post some good reference for aplha-beta pruning, That'll be excellent as I had a hard time understanding how to apply it.
To randomly pick among multiple best solutions (all equal) in alpha-beta pruning, you can modify your evaluation function to add a very small random number whenever you evaluate a game state. You should just make sure that the magnitude of that random number is never greater than the true difference between the evaluations of two states.
For example, if the true evaluation function for your game state can only return values -1, 0, and 1, you could add a randomly generated number in the range [0.0, 0.01] to the evaluation of every game state.
Without this, alpha-beta pruning doesn't necessarily find only one solution. Consider this example from wikipedia. In the middle, you see that two solutions with an evaluation of 6 were found, so it can find more than one. I do actually think it will still find all moves leading to optimal solutions at the root node, but not actually find all solutions deep down in the tree. Suppose, in the example image, that the pruned node with score of 9 in the middle actually had a score of 6. It would still get pruned there, so that particular solution wouldn't be found, but the move from root node leading to it (the middle move at root) would still be found. So, eventually, you would be able to reach it.
Some interesting notes:
This implementation would also work in minimax, and avoid the need to store a list of multiple (equally good) solutions
In more complex games than Tic Tac Toe, where you cannot search the complete state space, adding a small random number for the max player and deducting a small random number for the min player like this may actually slightly improve your heuristic evaluation function. The reason for this is as follows. Suppose in state A you have 5 moves available, and in state B you have 10 moves available, which all result in the same heuristic evaluation score. Intuitively, the successors of state B may be slightly better, because you had more moves available; in many games, having more moves available means that you are in a better position. Because you generated 10 random numbers for the 10 successors of state B, it is also a bit more likely that the highest generated random number is among those 10 (instead of the 5 numbers generated for successors of A)

Optimal DB query for prefix search

I have a dataset which is a list of prefix ranges, and the prefixes aren't all the same size. Here are a few examples:
low: 54661601 high: 54661679 "bin": a
low: 526219100 high: 526219199 "bin": b
low: 4305870404 high: 4305870404 "bin": c
I want to look up which "bin" corresponds to a particular value with the corresponding prefix. For example, value 5466160179125211 would correspond to "bin" a. In the case of overlaps (of which there are few), we could return either the longest prefix or all prefixes.
The optimal algorithm is clearly some sort of tree into which the bin objects could be inserted, where each successive level of the tree represents more and more of the prefix.
The question is: how do we implement this (in one query) in a database? It is permissible to alter/add to the data set. What would be the best data & query design for this? An answer using mongo or MySQL would be best.
If you make a mild assumption about the number of overlaps in your prefix ranges, it is possible to do what you want optimally using either MongoDB or MySQL. In my answer below, I'll illustrate with MongoDB, but it should be easy enough to port this answer to MySQL.
First, let's rephrase the problem a bit. When you talk about matching a "prefix range", I believe what you're actually talking about is finding the correct range under a lexicographic ordering (intuitively, this is just the natural alphabetic ordering of strings). For instance, the set of numbers whose prefix matches 54661601 to 54661679 is exactly the set of numbers which, when written as strings, are lexicographically greater than or equal to "54661601", but lexicographically less than "54661680". So the first thing you should do is add 1 to all your high bounds, so that you can express your queries this way. In mongo, your documents would look something like
{low: "54661601", high: "54661680", bin: "a"}
{low: "526219100", high: "526219200", bin: "b"}
{low: "4305870404", high: "4305870405", bin: "c"}
Now the problem becomes: given a set of one-dimensional intervals of the form [low, high), how can we quickly find which interval(s) contain a given point? The easiest way to do this is with an index on either the low or high field. Let's use the high field. In the mongo shell:
db.coll.ensureIndex({high : 1})
For now, let's assume that the intervals don't overlap at all. If this is the case, then for a given query point "x", the only possible interval containing "x" is the one with the smallest high value greater than "x". So we can query for that document and check if its low value is also less than "x". For instance, this will print out the matching interval, if there is one:
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(1).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
Suppose now that instead of assuming the intervals don't overlap at all, you assume that every interval overlaps with less than k neighboring intervals (I don't know what value of k would make this true for you, but hopefully it's a small one). In that case, you can just replace 1 with k in the "limit" above, i.e.
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(k).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
What's the running time of this algorithm? The indexes are stored using B-trees, so if there are n intervals in your data set, it takes O(log n) time to lookup the first matching document by high value, then O(k) time to iterate over the next k documents, for a total of O(log n + k) time. If k is constant, or in fact anything less than O(log n), then this is asymptotically optimal (this is in the standard model of computation; I'm not counting number of external memory transfers or anything fancy).
The only case where this breaks down is when k is large, for instance if some large interval contains nearly all the other intervals. In this case, the running time is O(n). If your data is structured like this, then you'll probably want to use a different method. One approach is to use mongo's "2d" indexing, with your low and high values codifying x and y coordinates. Then your queries would correspond to querying for points in a given region of the x - y plane. This might do well in practice, although with the current implementation of 2d indexing, the worst case is still O(n).
There are a number of theoretical results that achieve O(log n) performance for all values of k. They go by names such as Priority Search Trees, Segment trees, Interval Trees, etc. However, these are special-purpose data structures that you would have to implement yourself. As far as I know, no popular database currently implements them.
"Optimal" can mean different things to different people. It seems that you could do something like save your low and high values as varchars. Then all you have to do is
select bin from datatable where '5466160179125211' between low and high
Or if you had some reason to keep the values as integers in the table, you could do the CASTing in the query.
I have no idea whether this would give you terrible performance with a large dataset. And I hope I understand what you want to do.
With MySQL you may have to use a stored procedure, which you call to map value to bin. Said procedure would query the list of buckets for each row and do arithmetic or string ops to find the matching bucket. You could improve this design by using fixed length prefixes, arranged in a fixed number of layers. You could assign a fixed depth to your tree and each layer has a table. You won't get tree-like performance with either of these approaches.
If you want to do something more sophisticated, I suspect you have to use a different platform.
Sql Server has a Hierarchy data type:
http://technet.microsoft.com/en-us/library/bb677173.aspx
PostgreSQL has a cidr data type. I'm not familiar with the level of query support it has, but in theory you could build a routing table inside of your db and use that to assign buckets:
http://www.postgresql.org/docs/7.4/static/datatype-net-types.html#DATATYPE-CIDR
Peyton! :)
If you need to keep everything as integers, and want it to work with a single query, this should work:
select bin from datatable where 5466160179125211 between
low*pow(10, floor(log10(5466160179125211))-floor(log10(low)))
and ((high+1)*pow(10, floor(log10(5466160179125211))-floor(log10(high)))-1);
In this case, it would search between the numbers 5466160100000000 (the lowest number with the low prefix & the same number of digits as the number to find) and 546616799999999 (the highest number with the high prefix & the same number of digits as the number to find). This should still work in cases where the high prefix has more digits than the low prefix. It should also work (I think) in cases where the number is shorter than the length of the prefixes, where the varchar code in the previous solution can give incorrect results.
You'll want to experiment to compare the performance of having a lot of inline math in the query (as in this solution) vs. the performance of using varchars.
Edit: Performance seems to be really good either way even on big tables with no indexes; if you can use varchars then you might be able to further boost performance by indexing the low and high columns. Note that you'd definitely want to use varchars if any of the prefixes have initial zeroes. Here's a fix to allow for the case where the number is shorter than the prefix when using varchars:
select * from datatable2 where '5466' between low and high
and length('5466') >= length(high);

Is TimeSpan unnecessary?

EDIT 2009-Nov-04
OK, so it's been a little while since I first posted this question. It seems to me that many of the initial responders failed to really get what I was saying--a common response was some variation on "What you're saying doesn't make any sense"--and so I've made some handy diagrams to really illustrate my point.
When we speak of numbers, we are generally referring to points on what grade school children learn is called the Number Line:
Now, when we learn arithmetic, our minds learn to perform a very interesting transformation of this concept. Evalutating the expression 1 + 0.5, for example, if we simply applied our "number line thinking", would require us to somehow make sense of this:
It's difficult to really illustrate that, because it's difficult to think about that: "adding" two points. This is where a lot of responders struggled with the idea of adding dates (or simply dismissed it as absurd), because they were thinking of dates as points.
However, the expression 1 + 0.5 does make sense to us, because when we think of it, we're really imagining this:
That is, the number (or point) 1, plus the vector 0.5, resulting in point 1.5.
Alternately, we may be imagining this:
That is, the vector 1, plus the vector 0.5, resulting in the vector 1.5.
In other words, when dealing with numbers, we treat points and vectors interchangeably. But what about dates? Dates are, after all, basically numbers. If you don't believe me, compare this line to the number line above:
Notice the correspondence between the timeline and the number line? This was my point: if we perform the transformation above with numbers, we ought to be able to do it with dates as well. So, applying "timeline thinking", the expression 0001-Jan-02 00:00:00 + 0001-Jan-01 12:00:00 doesn't make a lot of sense, as plenty of responders pointed out:
But, if we do the same conceptual transformation in our head that we perform every time we add or subtract numbers, we can easily "rethink" the above as this:
So clearly, the difference between a DateTime and a TimeSpan is the same difference that exists between a point and a vector. What I think caused a lot of people to respond negatively to my suggestion is that it just feels so unnatural to think of dates as magnitudes in this way. But I don't buy the argument that there's no obvious reference point to use as zero. There is an obvious reference point, and I'll give you a hint where it is: about 2010 years ago.
Don't get me wrong: I'm not questioning the usefulness of drawing a conceptual divide between the notion of a DateTime and a TimeSpan. Really, my question all along should have been (as ChrisW indirectly suggested), why do we treat numbers and vectors interchangeably when dealing with regular numeric types? (Or: why do we have just one int type, instead of int and intspan?) There's a big difference, and yet we don't ever really think about it until sometime in junior high or high school, when we begin geometry. And then it's treated as this new mathematical concept, when in reality it's something we've been utilizing ever since we learned to add numbers by counting with our fingers.
In the end, the best answer came from Strilanc, who pointed out that the use of DateTime and TimeSpan is really an implementation of an affine space, which has the convenient property of not needing a reference point to treat as the origin. So thanks, Strilanc. I'm giving the accepted answer to ChrisW, however, for being the first one to bring up the concept of vectors and points, which really got to the crux of the matter.
ORIGINAL QUESTION (for posterity)
I am certainly no programming jack of all trades, but I know both PHP and .NET have a TimeSpan class in addition to a DateTime class (or structure in .NET), and I am guessing this is the case in a variety of other languages and frameworks as well (though I am writing this primarily with reference to the .NET structures). This might seem a strange question, but isn't TimeSpan redundant?
In case you think the answer is obvious ("A DateTime is an absolute point in time, while a TimeSpan is a range of time -- simple as that!"), consider this: an integer can be conceptualized as either an absolute value (the point on the number line) or a distance between values--and we don't need two separate data types for these different conceptualizations. I can still write 5 + 6 without any ambiguity as to what I mean.
As long as there is a consistent zero-point reference, it seems to me there should be no reason why one would need a TimeSpan object to perform arithmetic operations on DateTime objects, or to get the distance between them.
What am I missing? Why can't the unique methods and properties of the TimeSpan structure simply be folded into DateTime?
(Disclaimer: It isn't like I'm passionate about this or anything; I'm fine using DateTime and TimeSpan objects as they're intended all the time. I'm just asking a question.)
EDIT: Okay, over-simplified example to illustrate my point:
Consider the equation 10 - 5 = 5. One could read this as "Start at 10 (value), move 5 to the left (span), and you end up at 5 (value)."
Suppose, just to make things easy, we let January 1 1900 be point zero and we define TimeSpan objects in terms of days only.
Then 10 - 5 = 5 could be understood, in DateTime terms, as January 11 1900 - January 6 1900 = January 6 1900. This is fine, because January 11 is just "10" by our definition and January 6 is "5". The fact that we are viewing the 10 as a value, the first 5 as a span, and the last 5 as a value again is merely for our own conceptual benefit. My point is just this: that the only difference is in how you think of the number, not in what it actually is. This is why we don't have separate structures for, say, integer values and integer spans -- a plain old integer covers all our bases.
Am I making any sense?
consider this: an integer can be conceptualized as either an absolute value (the point on the number line) or a distance between values
By your logic, it isn't TimeSpan that's unecessary: rather it's DateTime that's unnecessary, and could be replaced by TimeSpan (duration since zero).
Plus there's the fact that integers have an obvious zero, whereas Dates however don't have an obvious zero; but having an obvious zero is necessary, if you want to replace "place on the number line" with "distance/span from the zero/origin".
Edit:
A point (location on a plane) isn't the same as a vector.
They seem similar ...
A vector (distance from origin) can represent a point
A point (relative to the origin) can represent a vector
... however the value of the vector that's required to represent a given point will change if the origin changes.
It always makes sense to add two (relative) vectors; but, it makes no sense to add two points, except by converting those points to vectors and then adding the vectors.
The sum of two vectors is unaffected by a change in the origin, but the sum of two points would be affected by a change in the origin if you summed them by converting them to vectors and adding the vectors (because changing the origin would affect the values of those vectors).
[Replace 'point' with DateTime and 'vector' with TimeSpan in the argument above.]
I think there is a genuine difference between absolute and relative values. I'm don't know why that difference isn't more apparent in arithmetic, i.e. why 'numbers' are used seemingly interchangeably to represent both absolute and relative values.
(Speaking as a mathematician) It's because arithmetic operations on a "date" aren't closed or well defined, necessitating the need for an additional structure.
For example, January 1, 2000 - December 1, 1999 = ... ? We know there's 31 days between them, but if this were interpreted as a date, then the answer is Epoch (i.e., zero) + 31 days. This is not a valid "date" anymore.
Similarly, all the arithmetic operations on integers aren't well defined (1 / 2 has no answer in the integers .. integer math returns zero here, but 0 * 2 = 0, not 1 as you would expect). This necessitates the need for an additional structure that we call fractions.
Just because you can define an operation doesn't mean you should. For example, one of the reasons division by zero is undefined is because defining it would require sacrificing some very useful properties of arithmetic (eg. associativity, etc).
The distinction between a timespan and a date comes down to addition. It makes sense to add two timespans, but it doesn't make sense to add two dates unless you have an arbitrary reference date. By not allowing addition of dates, you abstract away that arbitrary reference date. I don't know what date '0' is in .Net, and I've never needed to know. Isn't that nice?
Adding two dates is almost always a bug (seriously, try to think of where this makes sense outside of numerology). By introducing timespans (creating an Affine Space) you eliminate a whole class of bugs.
One reason is that splitting the types prevents a class of bugs where you think you have a relative time but really have an absolute time, and vice versa. For example, addition of two absolute times can be flagged as a compiler error if the two types are separate.
Also, IntelliSense (and discovery for newbies) works better when the number of members is smaller-- by splitting methods between the two types, working with each gets easier.
Asked the other way round: what would the benefit of weakening the type system in that regard be?
It’s all a question of cost vs. benefit and DateTime has the great benefit of reducing bugs due to illogical date/time calculations by forbidding such actions. DateTime exists for very much the same reasons that a strict type-checking system exists in the first place: to make semantic errors in the code produce compile-time messages. that notify the programmers of errors in their code.
Conversely, there’s the cost of having DateTime: zilch.
Now consider dropping DateTime. What would we gain?
To answer your question directly: “isn't TimeSpan redundant?” Absolutely not, it reduces bugs. It definitely has, for me.
Think about it conceptually. If I tell you that I'm having a party 7 days from now, is "7 days" in that sentence a date. Could I just say my party is on 7 days? Of course not, because 7 days isn't a date. One of the key ideas of object oriented programming is to represent concepts like this in the system as types. It's true that we could represent everything as an integer (and in fact, many people have and do), but in object oriented programming, we have the notion of types of items, and their behaviors and properties, and in that sense, it makes sense to have an object that expresses this.
I think you could make the opposite argument that DateTime is redundant, and we should only have TimeSpan :)
Seriously, all dates really are just time spans. They are all relative to some starting point. Technically, there is no "year zero" in the Christian calendar (since you can't really have a "zeroth year of our lord"), but if we assign 12:00 A.M. January 1, 0001 B.C. as the "zero point", then every date that comes after (or before) can be thought of as relative to that date. So, 12:00 A.M. on September 19, 2009 would have a TimeSpan of 734033 days.
So, mathematically, DateTime and TimeSpan are redundant. But when we write code, we are attempting to communicate much more than just abstract mathematical constructs. Any given DateTime instance may in fact just be a time span relative to some arbitrary zero point, but to most people reading your code, it will imply a particular point on the calendar. Similarly, a TimeSpan implies the gap between two points on the calendar.
In this case, Microsoft has chosen to be clear rather than parsimonious. I can't say I disagree with the decision.
There are a lot of complications in dates, for example:
leap years
leap seconds
the 1582 change to the gregorian calendar
the fact that there is no such thing as 0 years
differences in the lengths of months
Treating Dates and TimeSpans as different things means that these kinds of issues are much less likely to confuse you in practise.
its sugar not more or less....

Contains Test for a Constant Set

The problem statement:
Given a set of integers that is known in advance, generate code to test if a single integer is in the set. The domain of the testing function is the integers in some consecutive range.
Nothing in particular is known now about the range or the set to be tested. The range could be small or huge (but a solution can reject problems that are to big but higher limits are better). It could be that very few of the values in the allowed range are in the set or most of them are or anything in between. The set may be uniformly distributed or clustered. There may be large sections of only contained/not-contained values or there may be at least a few of each type of value in most swaths. (sort of like the assumption made about items to be sorted when analyzing sorting algorithms)
The objective is a procedure for generating effective code for running the test.
Partial solutions that come to mind include
perfect hash function (costly for large sets)
range tests: foreach(b in ranges) if(b.l <= v && v <= b.h) return true;
trees/indexes (more costly than others in some cases)
table lookup (costly for large sets)
the inverse of any of these these (kodos to Jason S)
It seems that an ideal solution would be able to pick what option is best or if none work well, use a tree to break down the full range into sections and then switch to other options for subsection that are better suited to them.
Topic(s) that might be useful include:
Huffman coding
Note: this is not homework. if it was issued as homework below the doctoral level the prof should be shot with a Nerf gun (if you don't get that then re-read the problem, it is very much non trivial)
Note: This is a problem that occurred to me a few days a go and I've been puzzling over it off and on. I have no direct use for this but thought it would be a cool problem to attack. The reason that I wan to generate the code is because generated code will be no slower than general code (it can be the same thing if needed) and might be faster in some/many cases.
I'm posting this question as much to clarify my thoughts as anything. If I can come up with any reasonable or cool solutions I plan on implementing them as a template meta program (the other reason for generated code)
some people have noted that the problem is very general. That is the point. I'm hoping to generate a system that would work an a very general domain: sets of integers in some range.
a previous question on dictionary/spellchecking had a number of responses that mentioned Bloom filters; maybe that would help.
I would think that testing for large sets is going to be expensive no matter what.
let's pretend, for a moment, that this is a real question:
there are no limits on the size of the base set or the input set
this makes the "problem" unrealistic, underspecified, and un-solvable in any practical sense
if someone wants to posit a solution, here's some unit test cases:
unit test 1:
the base set is all integers between -1,000,000,000,000 and +1,000,000,000,000 except for 100,000,000,000 randomly-removed values
the input set is 100,000,000,000 randomly-generated integers in the same range
unit test 2:
the base set is the Fibonacci series
the input set is 1T randomly-generated integers in the range 0..infinity
there's also boost::dynamic_bitset, not sure how it scales for time, or in space with respect to distribution of original numbers. (e.g. if the bits are stored in chunks of 8/16/32/64, then sparse bitsets are inefficient)
or perhaps this (compressed bit set) or this (bit vector) webpage (I googled for "large sparse bit sets" and "compressed bit sets")