Is TimeSpan unnecessary? - language-agnostic

EDIT 2009-Nov-04
OK, so it's been a little while since I first posted this question. It seems to me that many of the initial responders failed to really get what I was saying--a common response was some variation on "What you're saying doesn't make any sense"--and so I've made some handy diagrams to really illustrate my point.
When we speak of numbers, we are generally referring to points on what grade school children learn is called the Number Line:
Now, when we learn arithmetic, our minds learn to perform a very interesting transformation of this concept. Evalutating the expression 1 + 0.5, for example, if we simply applied our "number line thinking", would require us to somehow make sense of this:
It's difficult to really illustrate that, because it's difficult to think about that: "adding" two points. This is where a lot of responders struggled with the idea of adding dates (or simply dismissed it as absurd), because they were thinking of dates as points.
However, the expression 1 + 0.5 does make sense to us, because when we think of it, we're really imagining this:
That is, the number (or point) 1, plus the vector 0.5, resulting in point 1.5.
Alternately, we may be imagining this:
That is, the vector 1, plus the vector 0.5, resulting in the vector 1.5.
In other words, when dealing with numbers, we treat points and vectors interchangeably. But what about dates? Dates are, after all, basically numbers. If you don't believe me, compare this line to the number line above:
Notice the correspondence between the timeline and the number line? This was my point: if we perform the transformation above with numbers, we ought to be able to do it with dates as well. So, applying "timeline thinking", the expression 0001-Jan-02 00:00:00 + 0001-Jan-01 12:00:00 doesn't make a lot of sense, as plenty of responders pointed out:
But, if we do the same conceptual transformation in our head that we perform every time we add or subtract numbers, we can easily "rethink" the above as this:
So clearly, the difference between a DateTime and a TimeSpan is the same difference that exists between a point and a vector. What I think caused a lot of people to respond negatively to my suggestion is that it just feels so unnatural to think of dates as magnitudes in this way. But I don't buy the argument that there's no obvious reference point to use as zero. There is an obvious reference point, and I'll give you a hint where it is: about 2010 years ago.
Don't get me wrong: I'm not questioning the usefulness of drawing a conceptual divide between the notion of a DateTime and a TimeSpan. Really, my question all along should have been (as ChrisW indirectly suggested), why do we treat numbers and vectors interchangeably when dealing with regular numeric types? (Or: why do we have just one int type, instead of int and intspan?) There's a big difference, and yet we don't ever really think about it until sometime in junior high or high school, when we begin geometry. And then it's treated as this new mathematical concept, when in reality it's something we've been utilizing ever since we learned to add numbers by counting with our fingers.
In the end, the best answer came from Strilanc, who pointed out that the use of DateTime and TimeSpan is really an implementation of an affine space, which has the convenient property of not needing a reference point to treat as the origin. So thanks, Strilanc. I'm giving the accepted answer to ChrisW, however, for being the first one to bring up the concept of vectors and points, which really got to the crux of the matter.
ORIGINAL QUESTION (for posterity)
I am certainly no programming jack of all trades, but I know both PHP and .NET have a TimeSpan class in addition to a DateTime class (or structure in .NET), and I am guessing this is the case in a variety of other languages and frameworks as well (though I am writing this primarily with reference to the .NET structures). This might seem a strange question, but isn't TimeSpan redundant?
In case you think the answer is obvious ("A DateTime is an absolute point in time, while a TimeSpan is a range of time -- simple as that!"), consider this: an integer can be conceptualized as either an absolute value (the point on the number line) or a distance between values--and we don't need two separate data types for these different conceptualizations. I can still write 5 + 6 without any ambiguity as to what I mean.
As long as there is a consistent zero-point reference, it seems to me there should be no reason why one would need a TimeSpan object to perform arithmetic operations on DateTime objects, or to get the distance between them.
What am I missing? Why can't the unique methods and properties of the TimeSpan structure simply be folded into DateTime?
(Disclaimer: It isn't like I'm passionate about this or anything; I'm fine using DateTime and TimeSpan objects as they're intended all the time. I'm just asking a question.)
EDIT: Okay, over-simplified example to illustrate my point:
Consider the equation 10 - 5 = 5. One could read this as "Start at 10 (value), move 5 to the left (span), and you end up at 5 (value)."
Suppose, just to make things easy, we let January 1 1900 be point zero and we define TimeSpan objects in terms of days only.
Then 10 - 5 = 5 could be understood, in DateTime terms, as January 11 1900 - January 6 1900 = January 6 1900. This is fine, because January 11 is just "10" by our definition and January 6 is "5". The fact that we are viewing the 10 as a value, the first 5 as a span, and the last 5 as a value again is merely for our own conceptual benefit. My point is just this: that the only difference is in how you think of the number, not in what it actually is. This is why we don't have separate structures for, say, integer values and integer spans -- a plain old integer covers all our bases.
Am I making any sense?

consider this: an integer can be conceptualized as either an absolute value (the point on the number line) or a distance between values
By your logic, it isn't TimeSpan that's unecessary: rather it's DateTime that's unnecessary, and could be replaced by TimeSpan (duration since zero).
Plus there's the fact that integers have an obvious zero, whereas Dates however don't have an obvious zero; but having an obvious zero is necessary, if you want to replace "place on the number line" with "distance/span from the zero/origin".
Edit:
A point (location on a plane) isn't the same as a vector.
They seem similar ...
A vector (distance from origin) can represent a point
A point (relative to the origin) can represent a vector
... however the value of the vector that's required to represent a given point will change if the origin changes.
It always makes sense to add two (relative) vectors; but, it makes no sense to add two points, except by converting those points to vectors and then adding the vectors.
The sum of two vectors is unaffected by a change in the origin, but the sum of two points would be affected by a change in the origin if you summed them by converting them to vectors and adding the vectors (because changing the origin would affect the values of those vectors).
[Replace 'point' with DateTime and 'vector' with TimeSpan in the argument above.]
I think there is a genuine difference between absolute and relative values. I'm don't know why that difference isn't more apparent in arithmetic, i.e. why 'numbers' are used seemingly interchangeably to represent both absolute and relative values.

(Speaking as a mathematician) It's because arithmetic operations on a "date" aren't closed or well defined, necessitating the need for an additional structure.
For example, January 1, 2000 - December 1, 1999 = ... ? We know there's 31 days between them, but if this were interpreted as a date, then the answer is Epoch (i.e., zero) + 31 days. This is not a valid "date" anymore.
Similarly, all the arithmetic operations on integers aren't well defined (1 / 2 has no answer in the integers .. integer math returns zero here, but 0 * 2 = 0, not 1 as you would expect). This necessitates the need for an additional structure that we call fractions.

Just because you can define an operation doesn't mean you should. For example, one of the reasons division by zero is undefined is because defining it would require sacrificing some very useful properties of arithmetic (eg. associativity, etc).
The distinction between a timespan and a date comes down to addition. It makes sense to add two timespans, but it doesn't make sense to add two dates unless you have an arbitrary reference date. By not allowing addition of dates, you abstract away that arbitrary reference date. I don't know what date '0' is in .Net, and I've never needed to know. Isn't that nice?
Adding two dates is almost always a bug (seriously, try to think of where this makes sense outside of numerology). By introducing timespans (creating an Affine Space) you eliminate a whole class of bugs.

One reason is that splitting the types prevents a class of bugs where you think you have a relative time but really have an absolute time, and vice versa. For example, addition of two absolute times can be flagged as a compiler error if the two types are separate.
Also, IntelliSense (and discovery for newbies) works better when the number of members is smaller-- by splitting methods between the two types, working with each gets easier.

Asked the other way round: what would the benefit of weakening the type system in that regard be?
It’s all a question of cost vs. benefit and DateTime has the great benefit of reducing bugs due to illogical date/time calculations by forbidding such actions. DateTime exists for very much the same reasons that a strict type-checking system exists in the first place: to make semantic errors in the code produce compile-time messages. that notify the programmers of errors in their code.
Conversely, there’s the cost of having DateTime: zilch.
Now consider dropping DateTime. What would we gain?
To answer your question directly: “isn't TimeSpan redundant?” Absolutely not, it reduces bugs. It definitely has, for me.

Think about it conceptually. If I tell you that I'm having a party 7 days from now, is "7 days" in that sentence a date. Could I just say my party is on 7 days? Of course not, because 7 days isn't a date. One of the key ideas of object oriented programming is to represent concepts like this in the system as types. It's true that we could represent everything as an integer (and in fact, many people have and do), but in object oriented programming, we have the notion of types of items, and their behaviors and properties, and in that sense, it makes sense to have an object that expresses this.

I think you could make the opposite argument that DateTime is redundant, and we should only have TimeSpan :)
Seriously, all dates really are just time spans. They are all relative to some starting point. Technically, there is no "year zero" in the Christian calendar (since you can't really have a "zeroth year of our lord"), but if we assign 12:00 A.M. January 1, 0001 B.C. as the "zero point", then every date that comes after (or before) can be thought of as relative to that date. So, 12:00 A.M. on September 19, 2009 would have a TimeSpan of 734033 days.
So, mathematically, DateTime and TimeSpan are redundant. But when we write code, we are attempting to communicate much more than just abstract mathematical constructs. Any given DateTime instance may in fact just be a time span relative to some arbitrary zero point, but to most people reading your code, it will imply a particular point on the calendar. Similarly, a TimeSpan implies the gap between two points on the calendar.
In this case, Microsoft has chosen to be clear rather than parsimonious. I can't say I disagree with the decision.

There are a lot of complications in dates, for example:
leap years
leap seconds
the 1582 change to the gregorian calendar
the fact that there is no such thing as 0 years
differences in the lengths of months
Treating Dates and TimeSpans as different things means that these kinds of issues are much less likely to confuse you in practise.

its sugar not more or less....

Related

What data type should I use for a 'duration' attribute?

I'm using phpmyadmin/MySQL to make a database.
It's for a plane/bus/train booking system.
I have a 'depart_time' attribute which is a time data type. In the same table I have a 'duration' attribute. Later on I will need to do multiplication on this duration attribute (depending on if it is train/bus/plane).
My question is - what would be the best data type for this duration attribute?
I thought about using a decimal type - but then the values in it won't represent the time exactly (e.g. 1.30 won't represent 1 and a half hrs, it would need to be 1.50 - if that makes sense).
I also thought about using the time data type for this field as well, but I wasn't sure if multiplication would be possible on that?
I couldn't find any help after googling about multiplication on the time data type.
Hopefully this made sense, if you need anymore information then feel free to ask in the comments!
Thanks in advance!
Use an int and record durations in the smallest unit you're interested in. For example, if you need minute accuracy, store one and a half hours as 90 minutes. Formatting that value for display purposes is presentation logic, not the business of the database.
If I were in that position I would probably either:
In seconds. Unlikely that you need more precision than that.
In a string such as P1D for 1 day and P1W2DT3H for 1 week, 2 days, 3 hours. This is a standard format used by many libraries and deals better with situations where something really takes 1 day, but it's a day with a leap hour.
For most cases just using seconds will be fine though.
I would represent it in the database as seconds or minutes (minimum precision you want). Showing it to the user should be done dynamically in frontend (e.g. in minutes (1 min, 30 min 180min), hours (0.1h, 1h, 3h, 3.0h), days (0.5d, 1d) or minimal packed (1d 5h 42min).
You should keep this separate. So I suggest to use seconds.
I've solved how I'm going to do this.
Instead of doing it within the database. I am going to do the multiplication using Python.
I took the information from the table, turned the data into int/datetime.deltatime and the multiplication worked.
I then just needed to return that data depending on whether it's bus/train/plane.
For a multiplier not involved with money, simply use FLOAT.
Then work in seconds (or minutes if you prefer). That can be in INT UNSIGNED.
Use appropriate DATETIME functions to convert seconds to hh:mm or whatever output you desire. Note: The internal format need not be the same as the display format.
A duration could be represented in an open standard manner using ISO 8601 duration format.
See https://www.digi.com/resources/documentation/digidocs/90001437-13/reference/r_iso_8601_duration_format.htm

Project Euler 298 - there must be a correct answer? (only pastebinned code)

Project Euler has a paging file problem (though it's disguised in other words).
I tested my code(pastebinned so as not to spoil it for anyone) against the sample data and got the same memory contents+score as the problem. However, there is nowhere near a consistent grouping of scores. It asks for the expected difference in scores after 50 turns. A random sampling of scores:
1.50000000
1.78000000
1.64000000
1.64000000
1.80000000
2.02000000
2.06000000
1.56000000
1.66000000
2.04000000
I've tried a few of those as answers, but none of them have been accepted... I know some people have succeeded, so I'm really confused - what the heck am I missing?
Your problem likely is that you don't seem to know the definition of Expected Value.
You will have to run the simulation multiple times and for each score difference, maintain the frequency of that occurence and then take the weighted mean to get the expected value.
Of course, given that it is Project Euler problem, there is probably a mathematical formula which can be used readily.
Yep, there is a correct answer. To be honest, Monte Carlo can theoretically come close in on the expect value given the law of large numbers. However, you won't want to try it here. Because practically each time you run the simu, you will have a slightly different result rounded to eight decimal places (And I think this setting does exactly deprive anybody of any chance of even thinking to use Monte Carlo). If you are lucky, you will have one simu that delivers the answer after lots of trials, given that you have submitted all the previous and failed. I think, captcha is the second way that euler project let you give up any brute-force approach.
Well, agree with Moron, you have to figure out "expected value" first. The principle of this problem is, you have to find a way to enumerate every possible "essential" outcomes after 50 rounds. Each outcome will have its own |L-R|, so sum them up, you will have the answer. No need to say, brute-force approach fails in most of the case, especially in this case. Fortunately, we have dynamic programming (dp), which is fast!
Basically, dp saves the computation results in each round as states and uses them in the next. Thus it avoids repeating the same computation over and over again. The difficult part of this problem is to find a way to represent a state, that is to say, how you would like to save your temp results. If you have solved problem 290 in dp, you can get some hints there about how to understand the problem and formulate a state.
Actually, that isn't the most difficult part for the mind. The hardest mental piece is whether you realize that some memory statuses of the two players are numerically different but substantially equivalent. For example, L:12345 R:12345 vs L:23456 R:23456 or even vs L:98765 R:98765. That is due to the fact that the call is random. That is also why I wrote possible "essential" outcomes. That is, you can summarize some states into one. And only by doing so, your program can finish in reasonal time.
I would run your simulation a whole bunch of times and then do a weighted average of the | L- R | value over all the runs. That should get you closer to the expected value.
Just submitting one run as an answer is really unlikely to work. Imagine it was dice roll expected value. Roll on dice, score a 6, submit that as expected value.

Does MySQL support historical date (like 1200)?

I can't see any info about that. Where can I find the oldest date Mysql can support ?
For the specific example you used on your question (year 1200), technically things will work.
In general, however, timestamps are unadvisable for this uses.
First, the range limitation is arbitrary: in MySQL it's Jan 1st, 1000. If you are working with 12-13th century stuff, things go fine... but if at some moment you need to add something older (10th century or earlier), the date will miserably break, and fixing the issue will require re-formatting all your historic dates into something more adequate.
Timestamps are normally represented as raw integers, with a given "tick interval" and "epoch point", so the number is indeed the number of ticks elapsed since the epoch to the represented date (or viceversa for negative dates). This means that, as with any fixed-with integer data-type, the set of representable values is finite. Most timestamp formats I know about sacrifice range in favor of precision, mostly because applications that need to perform time arithmetic often need to do so with a decent precision; while applications that need to work with historical dates very rarely need to perform serious arithmetic.
In other words, timestamps are meant for precise representation of dates. Second (or even fraction of second) precission makes no sense for historical dates: could you tell me, down to the milliseconds, when was Henry the 8th crowned as King of England?
In the case of MySQL, the format is inherently defined as "4-digit years", so any related optimization can rely on the assumption that the year will have 4 digits, or that the entire string will have exactly 10 chars ("yyyy-mm-dd"), etc. It's just a matter of luck that the date you mentioned on your title still fits, but even relying on that is still dangerous: besides what the DB itself can store, you need to be aware of what the rest of your server stack can manipulate. For example, if you are using PHP to interact with your database, trying to handle historical dates is very likely to crash at some point or another (on a 32-bit environment, the range for UNIX-style timestamps is December 13, 1901 through January 19, 2038).
In summary: MySQL will store properly any date with a 4-digit year; but in general using timestamps for historical dates is almost guaranteed to trigger issues and headaches more often than not. I strongly advise against such usage.
Hope this helps.
Edit/addition:
Thank you for this very insteresting
answer. Should I create my own algo
for historical date or choose another
db but which one ? – user284523
I don't think any DB has too much support for this kind of dates: applications using it most often have enough with string-/text- representation. Actually, for dates on year 1 and later, a textual representation will even yield correct sorting / comparisons (as long as the date is represented by order of magnitude: y,m,d order). Comparisons will break, however, if "negative" dates are also involved (they would still compare as earlier than any positive one, but comparing two negative dates would yield a reversed result).
If you only need Year 1 and later dates, or if you don't need sorting, then you can make your life a lot easier by using strings.
Otherwise, the best approach is to use some kind of number, and define your own "tick interval" and "epoch point". A good interval could be days (unless you really need further precission, but even then you can rely on "real" (floating-point) numbers instead of integers); and a reasonable epoch could be Jan 1, 1. The main problem will be turning these values to their text representation, and viceversa. You need to keep in mind the following details:
Leap years have one extra day.
The rule for leap years was "any multiple of 4" until 1582, when it changed from the Julian to the Gregorian calendar and became "multiple of 4 except those that are multiples of 100 unless they are also multiples of 400".
The last day of the Julian calendar was Oct 4th, 1582. The next day, first of the Gregorian calendar, was Oct 15th, 1582. 10 days were skipped to make the new calendar match again with the seasons.
As stated in the comments, the two rules above vary by country: Papal states and some catholic countries did adopt the new calendar on the stated dates, but many other countries took longer to do so (the last being Turkey in 1926). This means that any date between the papal bull in 1582 and the last adoption in 1926 will be ambiguous without geographical context, and even more complex to process.
There is no "year 0": the year before year 1 was year -1, or year 1 BCE.
All of this requires quite elaborate parser and formater functions, but beyond the many case-by-case breakings there isn't really too much complexity (it'd be tedious to code, but quite straight-forward). The use of numbers as the underlying representation ensures correct sorting/comparing for any pair of values.
Knowing this, now it's your choice to take the approach that better fits your needs.
From the documentation:
DATE
A date. The supported range is '1000-01-01' to
'9999-12-31'.
Yes. MySQL dates start in year 1000.
For whatever it's worth, I found that the MySQL DATE field does support dates < 1000 in practice, though the documentation says otherwise. E.g., I was able to enter 325 and it stores as 0325-00-00. A search WHERE table.date < 1000 also gave correct results.
But I am hesitant to rely on the < 1000 dates when they are not officially supported, plus I sometimes need BCE years with more than 4 digits anyway (e.g. 10000 BCE). So separate INT fields for year, month and day (as suggested above) do seem the only choice.
I do wish the DATE type (or perhaps a new HISTDATE type) supported a full range of historical dates - it would be nice to combine three fields into one and simply sort by date instead of having to sort by year, month, day.
Use SMALLINT for year, so the year will accept from -32768 (BC) to 32768 (AD)
As for months and days, use TINYINT UNSIGNED
Most historical events dont have months and days, so you could query like this :
SELECT events FROM history WHERE year='-4990'
Result : 'Noah Ark'
Or : SELECT events FROM history WHERE year='570' AND month='4' AND day='20'
return : "Muhammad pbuh was born"
Depending on requirements, you could also add DATETIME column and make it NULL for date before 1000 and vice versa (thus saving some bytes)
This is an important and interesting problem which has another solution.
Instead of relying on the database platform to support a potentially infinite number of dates with millisecond precision, rely on an object-oriented programming language compiler and runtime to correctly handle date and time arithmetic.
It is possible to do this using the Java Virtual Machine (JVM), where time is measured in milliseconds relative to midnight, January 1, 1970 UTC (Epoch), by persisting the required value as a long in the database (including negative values), and performing the required conversion/calculation in the component layer after retrieval.
For example:
Date d = new Date(Long.MIN_VALUE);
DateFormat df = new SimpleDateFormat("EEE, d MMM yyyy G HH:mm:ss Z");
System.out.println(df.format(d));
Should show:
Sun, 2 Dec 292269055 BC 16:47:04 +0000
This also enables independence of database versions and platforms as it abstracts all date and time arithmetic to the JVM runtime, i.e. changes in database versions and platforms will be much less likely to require re-implementation, if at all.
I had the similar problem and I wanted to continue relay on date fields in the DB to allow me use date range search with accuracy of up-to a day for historic values.
(My DB includes date of birth and dates of roman emperors...)
The solution was to add a constant year (example: 3000) to all the dates before adding them to the DB and subtracting the same number before displaying the query results to the users.
If you DB has already some dates value in it, remember to update the exiting value with the new const number.

Should I store a field PRICE as an int or as a float int the database?

In a previous project, I noticed that the price field was being stored as an int, rather than as a float. This is done by multiplying the actual value by 100, the reason being was to avoid running into floating point problems.
Is this a good practice that I should follow or is it unnecessary and only makes the data less transparent?
Interesting question.
I wouldn't actually choose float in the mysql environment. Too many problems in the past with precision with that datatype.
To me, the choice would be between int and decimal(18,4).
I've seen real world examples integers used to represent floating point values. The internals of JD Edwards datatables all do this. Quantities are typically divided by 10000. While I'm sure it's faster and smaller in-table, it just means that we're always having to CAST the ints to a decimal value if we want to do anything with them, especially division.
From a programming perspective, I'd always prefer to work with decimal for price ( or money in RDBMSs that support it ).
Floating point errors could cause you problems if you are multiplying large numbers. In general, financial calculations should never be done with floating point numbers where possible.
I think Decimal is good for this use.
While it would save you float-related issues, having prices saved as integers might lead to a problem where you end up charging 100 times the price to a customer. It could also confuse other programmers.
I have seen both solution used successfully on medium-size ecommerce websites, but my preference goes to using floats.

What statistics can be maintained for a set of numerical data without iterating?

Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
Count
Sum
Mean
Max*
Min*
Median**
Less Obvious
Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.
First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.
To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.
The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.
As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.
Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
public void Add(double value) {
values.Add(value);
if (value > highValue)
highValue = value;
}
public void Remove(double value) {
values.Remove(value);
if (value.WithinTolerance(highValue))
highValue = RecalculateHighValueByIteration();
}
It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.
If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.
It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.