When working with data values, should I create a single table storing the hourly values, and also the aggregated daily/monthly values, or should I create separate tables for these?
I'd imagine multiple tables would be the way to go, but I'm a complete amateur here. It sounds like something that would improve performance and possibly maintenance, but I'd also like to know if this even makes a difference. In the end, having 3-4 tables vs 1 could also cause some maintenance issues I would imagine.
So basically, a values_table containing:
id value datetime range
1 33 2022-05-13 11:00:00 hourly
2 54 2022-05-13 12:00:00 hourly
3 840 2022-05-13 daily
...
vs
hourly_values_table containing:
id value datetime
1 33 2022-05-13 11:00:00
2 54 2022-05-13 12:00:00
...
And a daily_values_table containing:
id value datetime
1 840 2022-05-13
...
What would be the proper way to handle this?
Your hourly data is a Data Warehouse 'Fact' table". It is, I assume, written 'continually' and never updated.
"Summary Table(s)" are useful for performance. Usually only 1 is needed. For example a "daily" table gives you about a 24x reduction. From that table you can fetch weekly, monthly, or any arbitrary date range reasonably efficiently. (I need more metrics and a better feel for what type of data you are storing to be surer of what I am saying.)
I discuss using MySQL for DW and Summary tables
Sure, purists debate the storing of "redundant" data. But when you get a billion rows, you really need summary tables to avoid performace bottlenecks.
As for how long to hold onto the data in the Fact table or the Summary table, I often suggest:
Use Partitioning for speedy of purging old data (after, say, a month), thereby saving disk space;
Keep the summary tables 'forever', since they are 'small'.
I don't understand your purpose or your approach?
You have to start with the purpose of the database? What data are you trying to store, and why?
From reading your description I can't tell if the data is supposed to be connected to a person, or is it for an accounting purpose? There's no context.
Start with the purpose of the database and this will identify the tables/names, which will then reveal the structure and relationships. And go to my post here for clarification, which could help conceptually. Link
There is some value, x, which I am recording every 30 seconds, currently into a database with three fields:
ID
Time
Value
I am then creating a mobile app which will use that data to plot charts in views of:
Last hour
Last 24 hours.
7 Day
30 Day
Year
Obviously, saving every 30 seconds for the last year and then sending that data to a mobile device will be too much (it would mean sending 1051200 values).
My second thought was perhaps I could use the average function in MySQL, for example, collect all of the averages for every 7 days (creating 52 points for a year), and send those points. This would work, but still MySQL would be trawling through creating averages and if many users connect, it's going to be bad.
So simply put, if these are my views, then I do not need to keep track of all that data. Nobody should care what x was a year ago to the precision of every 30 seconds, this is fine. I should be able to use "triggers" to create some averages.
I'm looking for someone to check what I have below is reasonable:
Store values every 30s in a table (this will be used for the hour view, 120 points)
When there are 120 rows are in the 30s table (120 * 30s = 60 mins = 1 hour), use a trigger to store the first half an hour in a "half hour average" table, remove the first 60 entries from the 30s table. This new table will need to have an id, start time, end time and value. This half hour average will be used for the 24 hour view (48 data points).
When the half hour table has more than 24 entries (12 hours), store the first 6 as an average in a 6 hour average table and then remove from the table. This 6 hour average will be used for the 7 day view (28 data points).
When there are 8 entries in the 6 hour table, remove the first 4 and store this as an average day, to be used in the 30 day view (30 data points).
When there are 14 entries in the day view, remove the first 7 and store in a week table, this will be used for the year view.
This doesn't seem like the best way to me, as it seems to be more complicated than I would imagine it should be.
The alternative is to keep all of the data and let mysql find averages as and when needed. This will create a monstrously huge database. I have no idea about the performance yet. The id is an int, time is a datetime and value is a float. Is 1051200 records too many? Now is a good time to add, I would like to run this on a raspberry pi, but if not.. I do have my main machine which I could use.
Your proposed design looks OK. Perhaps there are more elegant ways of doing this, but your proposal should work too.
RRD (http://en.wikipedia.org/wiki/Round-Robin_Database) is a specialised database designed to do all of this automatically, and it should be much more performant than MySQL for this specialised purpose.
An alternative is the following: keep only the original table (1051200 records), but have a trigger that generates the last hour/day/year etc views every time a new record is added (e.g. every 30 seconds) and store/cache the result somewhere. Then your number-crunching workload is independent of the number of requests/clients you have to serve.
1051200 records may or may not be too many. Test in your Raspberry Pi to find out.
Let me give a suggestion on the physical layout of your table, regardless on whether you decide to keep all data or "prune" it from time to time...
Since you generate a new row "every 30 seconds", then Time can serve as a natural key without fear of exceeding the resolution of the underlying data type and causing duplicated keys. You don't need ID in this scenario1, so your table is simply:
Time (PK)
Value
And since InnoDB tables are clustered, not having secondary indexes2 means the whole table is stored in a single B-Tree, which is as efficient as it gets from storage and querying perspective. On top of that, Value is automatically covered, which may not have been the case in your original design unless you specifically designed your index(es) for that.
Using time as key can be tricky in general, but I think may be worth it in this particular case.
1 Unless there are other tables that reference it through FOREIGN KEYs, or you have already written too much code that depends on it.
2 Which would be necessary in the original design to support efficient aggregation.
I have some data on fatalities which I'm trying to store, and I'm trying to come up with a reasonable scheme for storing the age of the person when they died.
I don't have DoB data for any of them, but I do have date of death generally (although not always very precisely) and I have data of varying accuracy for their age at death.
Some typical source data might be:
between 20 and 29 years old (or "in their 20s")
5 years old
2 months old
40 days old
adult
child
elderly
I have typically been storing this in three fields...
age_min (integer years)
age_max (integer years)
age_category (enum - baby, child, adult, elderly)
...but clearly this doesn't capture the 2 months old or 40 days old very well, both of which would simply end up as 0 years in my current schema, which is needlessly throwing away information.
It is very important that the database is honest about the precision to which information is known. So converting 2 months into 60 days, for example, would be a bad thing, because it implies a level of precision the source data didn't provide - converting it into 60-90 days might be ok.
I also considered adding a units field so I'd have...
age_min (integer)
age_min_unit (enum - days, months, years)
but the problem with this is it makes comparisons annoying. 24 months == 2 years, but dealing with that just makes a lot of code much more complex than I suspect it needs to be.
I could store all ages in days, with a min and a max, but then the complexity becomes converting that back into something human readable which isn't clunky and doesn't express a greater degree of precision than I actually have.
So for example, 40 days might end up being rendered at 1 month, 10 days which is actually a little less precise than saying 40 days.
Ok just adding it answer for future
Can you try to use the age_min and age_max in days and also carry one more field as "human_readable_age_text" which reads , say "40 days"
Been there, done that. The least ambiguous and easiest to process is to convert everything to days and add a +/- tolerance. That way everything can be stored in 2 fields and all situations are covered. Obviously you have to convert to human readable format before display.
If you have date of birth and date of death the tolerance becomes 0.
Thus the following input values will yield the indicated stored values.
5 years: 2007 183 (ie. 5.5 x 365 = 2007 days. 365/2 = +/-183 days.)
2 months: 75 15
9 years 7 months: 3512 15
child: First value is midpoint of your preferred "child" age range in days. (1-12?, 3-18?). Tolerance is half that.
baby: Same again. Decide on what constitutes a "baby" (0-2?) and generate the values accordingly.
Store the value as min+max+unit. 'adult','child'... etc can be represented as a unit of age for which the min and max would be ignored.
Then you need to find the answer to philosophical questions like "Who is older: a child or a person between 5 and 12 years old?".
When you have the answer to those for all of the possible combo's of age types you will be able to tell if it's possible to use a canonical representation of the age (e.g. days) for comparing.
If its possible - you can add an additional field with the age in days (or seconds, or something...) to use for comparing/sorting. The compare field can be calculated with a trigger, or in the app.
If its not possible - you will need a custom comparator for sorting, afaik that can't be done in MySQL so you will probably have to do all sorting and comparing in the app.
There are extensive resources on the internet discussing the famous Birthday Paradox. It is clear to me how you calculate the probability of two people sharing a birthday i.e. P(same) = 1 - P(different). However if I ask myself something apparently more simple I stall: firstly, let's say I generate two random birthdays. Getting the same birthday is like tossing a coin. Either the two persons share a birthday (Heads) or they don't share a birthday (Tail). Run this 500 times and the end result (#Heads/500) will somehow be close to 0.5
Q1) But how do I think about this if I generate three random birthdays? How can I estimate the probability then? Obviously my coin analogy won't be applicable.
Q2) once I have figured out the above I will need to scale it up and generate 30 or 50 birthdays. Is there a recommended technique or algorithm to isolate identical birthdays from a large set? Should I put them into arrays and loop through them?
Here's what I think I need:
Q1)
r = 25 i.e. each trial run generates 25 birthdays
Trial 1 >
3 duplicates: 0
Trial 2 >
3 duplicates: 0
Trial 3 >
3 duplicates: 2
Trial 4 >
3 duplicates: 1
...
T100 >
3 duplicates: 2
estimated probability of 3 persons sharing a birthday in a room of 25 = (0+0+2+1+...+2)/100
Q2)
Create an array for 2 duplicates, an array for 3 duplicates and one for more than 3 duplicates
add each generated birthday one by one into the first array. But before doing so, loop through the array to see if it's in there already. If so, add it to the second array, but before doing so repeat the above process and so on
It doesn't seem to be a very efficient algorithm though :) suggestions to improve the Big O here?
Create an integer array of length 365, initialized to 0. Then generate N (in your case 25) random numbers between 1-365 and increase that number in the array (ie. bdays[random_value]++). Since you are only interested in a collision happening, right after increasing the number in the array check if it is greater than 2 (If it is then there is a second collision, which means there are 3 people with the same birthday). Keep track of collisions and execute this as many times as you wish (1000).
In the end, the ratio of collisions/1000 will be your requested value.
and, no tossing coins analogy is wrong.
Check this similar question and its answers on CrossValidated, but I think it is really worth thinking about the classic Birthday problem again to get the basics.
To the second part of your question: depends on the language you use. I definitely suggest using R to solve a problem like that, as checking identical birthdays in a list/vector/data frame can easily done with a simple unique call. To run a such simple MC simulation R is again really handy, check the second answer on the link above.
Sounds like your first task will be to create a method that will generate random birthdays. To keep things simple, you can use the numbers 1-365 to denote unique birthdays.
Store however many random birthdays (2 in the first case more later) in an ArrayList as Strings. You will want to use a loop to call the random number function and store the value in your list.
Then make a function to search the ArrayList for duplicates. If there are any duplicates (no matter how many) then that's a Heads result. If there are no matches then it's a Tails.
Your probabilities will be far different from 50/50 until you get to 20 or so.
We're building an app that stores "hours of operation" for various businesses. What is the easiest way to represent this data so you can easily check if an item is open?
Some options:
Segment out blocks (every 15 minutes) that you can mark "open/closed". Checking involves seeing if the "open" bit is set for the desired time (a bit like a train schedule).
Storing a list of time ranges (11am-2pm, 5-7pm, etc.) and checking whether the current time falls in any specified range (this is what our brain does when parsing the strings above).
Does anyone have experience in storing and querying timetable information and any advice to give?
(There's all sorts of crazy corner cases like "closed the first Tuesday of the month", but we'll leave that for another day).
store each contiguous block of time as a start time and a duration; this makes it easier to check when the hours cross date boundaries
if you're certain that hours of operation will never cross date boundaries (i.e. there will never be an open-all-night sale or 72-hour marathon event et al) then start/end times will suffice
The most flexible solution might be use the bitset approach. There are 168 hours in a week, so there are 672 15-minute periods. That's only 84 bytes worth of space, which should be tolerable.
I'd use a table like this:
BusinessID | weekDay | OpenTime | CloseTime
---------------------------------------------
1 1 9 13
1 2 5 18
1 3 5 18
1 4 5 18
1 5 5 18
1 6 5 18
1 7 5 18
Here, we have a business that has regular hours of 5 to 6, but shorter hours on sunday.
A query for if open would be (psuedo-sql)
SELECT #isOpen = CAST
(SELECT 1 FROM tblHours
WHERE BusinessId = #id AND weekDay = #Day
AND CONVERT(Currentime to 24 hour) IS BETWEEN(OpenTime,CloseTime)) AS BIT;
If you need to store edge cases, then just have 365 entries, one per day...its really not that much in the grand scheme of things, place an index on the day column and businessId column.
Don't forget to store the businesses timezone in a separate table (normalize!), and perform a transform between your time and it before making these comparisons.
OK, I'll throw in on this for what it's worth.
I need to handle quite a few things.
Fast / Performant Query
Any increments of time, 9:01 PM, 12:14, etc.
International (?) - not sure if this is an issue even with timezones, at least in my case but someone more versed here feel free to chime in
Open - Close spanning to the next day (open at noon, close at 2:00 AM)
Multiple timespans / day
Ability to override specific days (holidays, whatever)
Ability for overrides to be recurring
Ability to query for any point in time and get businesses open (now, future time, past time)
Ability to easily exclude results of businesses closing soon (filter businesses closing in 30 minutes, you don't want to make your users 'that guy that shows up 5 minutes before closing in the food/beverage industry)
I like a lot of the approaches presented and I'm borrowing from a few of them. In my website, project, whatever I need to take into consideration I may have millions of businesses and a few of the approaches here don't seem to scale well to me personally.
Here's what I propose for an algorithm and structure.
We have to make some concrete assumptions, across the globe, anywhere, any time:
There are 7 days in a week.
There are 1440 minutes in one day.
There are a finite number of permutations of minutes of open / closed that are possible.
Not concrete but decent assumptions:
Many permutations of open/closed minutes will be shared across businesses reducing total permutations actually stored.
There was a time in my life I could easily calculate the actual possible combinations to this approach but if someone could assist/thinks it would be useful, that would be great.
I propose 3 tables:
Before you stop reading, consider in the real-world 2 of these tables will be small enough cache neatly. This approach isn't going to be for everyone either due to the sheer complexity of code required to interpret a UI to the data model and back again if needed. Your mileage and needs may vary. This is an attempt at a reasonable 'enterprise' level solution, whatever that means.
HoursOfOperations Table
ID | OPEN (minute of day) | CLOSE (minute of day)
1 | 360 | 1020 (example: 9 AM - 5 PM)
2 | 365 | 1021 (example: edge-case 9:05 AM - 5:01 PM (weirdos) )
etc.
HoursOfOperations doesn't care about what days, just open and close and uniqueness. There can be only a single entry per open/close combination. Now, depending on your environment either this entire table can be cached or it could be cached for the current hour of the day, etc. At any rate, you shouldn't need to query this table for every operation. Depending on your storage solution I envision every column in this table as indexed for performance. As time progresses, this table likely has an exponentially inverse likelihood of INSERT(s). Really though, dealing with this table should mostly be an in-process operation (RAM).
Business2HoursMap
Note: In my example I'm storing "Day" as a bit-flag field/column. This is largely due to my needs and the advancement of LINQ / Flags Enums in C#. There's nothing stopping you from expanding this to 7 bit fields. Both approaches should be relatively similar in both storage logic and query approach.
Another Note: I'm not entering into a semantics argument on "every table needs a PK ID column", please find another forum for that.
BusinessID | HoursID | Day (or, if you prefer split into: BIT Monday, BIT Tuesday, ...)
1 | 1 | 1111111 (this business is open 9-5 every day of the week)
2 | 2 | 1111110 (this business is open 9:05 - 5:01 M-Sat (Monday = day 1)
The reason this is easy to query is that we can always determine quite easily the MOTD (Minute of the Day) that we're after. If I want to know what's open at 5 PM tomorrow I grab all HoursOfOperations IDS WHERE Close >= 1020. Unless I'm looking for a time range, Open becomes insignificant. If you don't want to show businesses closing in the next half-hour, just adjust your incoming time accordingly (search for 5:30 PM (1050), not 5:00 PM (1020).
The second query would naturally be 'give me all business with HoursID IN (1, 2, 3, 4, 5), etc. This should probably raise a red flag as there are limitations to this approach. However, if someone can answer the actual permutations question above we may be able to pull the red flag down. Consider we only need the possible permutations on any one side of the equation at one time, either open or close.
Considering we've got our first table cached, that's a quick operation. Second operation is querying this potentially large-row table but we're searching very small (SMALLINT) hopefully indexed columns.
Now, you may be seeing the complexity on the code side of things. I'm targeting mostly bars in my particular project so it's going to be very safe to assume that I will have a considerable number of businesses with hours such as "11:00 AM - 2:00 AM (the next day)". That would indeed be 2 entries into both the HoursOfOperations table as well as the Business2HoursMap table. E.g. a bar that is open from 11:00 AM - 2:00 AM will have 2 references to the HoursOfOperations table 660 - 1440 (11:00 AM - Midnight) and 0 - 120 (Midnight - 2:00 AM). Those references would be reflected into the actual days in the Business2HoursMap table as 2 entries in our simplistic case, 1 entry = all days Hours reference #1, another all days reference #2. Hope that makes sense, it's been a long day.
Overriding on special days / holidays / whatever.
Overrides are by nature, date based, not day of week based. I think this is where some of the approaches try to shove the proverbial round peg into a square hole. We need another table.
HoursID | BusinessID | Day | Month | Year
1 | 2 | 1 | 1 | NULL
This can certainly get more complex if you needed something like "on every second Tuesday, this company goes fishing for 4 hours". However, what this will allow us to do quite easily is allow 1 - overrides, 2 - reasonable recurring overrides. E.G. if year IS NULL, then every year on New Years day this weirdo bar is open from 9:00 AM to 5:00 PM keeping in line with our above data examples. I.e. - If year were set, it's only for 2013. If month is null, it's every first day of the month. Again, this won't handle every scheduling scenario by NULL columns alone, but theoretically, you could handle just about anything by relying on a long sequence of absolute dates if needed.
Again, I would cache this table on a rolling day basis. I just can't realistically see the rows for this table in a single-day snapshot being very large, at least for my needs. I would check this table first as it is well, an override and would save a query against the much larger Business2HoursMap table on the storage-side.
Interesting problem. I'm really surprised this is the first time I've really needed to think this through. As always, very keen on different insights, approaches or flaws in my approach.
I think I'd personally go for a start + end time, as it would make everything more flexible. A good question would be: what's the chance that the block size would change at a certain point? Then pick the solution that best fits your situation (if it's liable to change I'd go for the timespans definately).
You could store them as a timespan, and use segments in your application. That way you have the easy input using blocks, while keeping the flexibility to change in your datastore.
To add to what Johnathan Holland said, I would allow for multiple entries for the same day.
I would also allow for decimal time, or another column for minutes.
Why? many restaurants and some businesses, and many businesses around the world have lunch and or afternoon breaks. Also, many restaurants (2 that I know of near my house close at odd non-15-increments time. One closes at 9:40 PM on Sundays, and one closes at 1:40 AM.
There is also the issue of holiday hours , such as stores closing early on thanksgiving day, for example, so you need to have calendar-based override.
Perhaps what can be done is a date/time open, date-time close, such as this:
businessID | datetime | type
==========================================
1 10/1/2008 10:30:00 AM 1
1 10/1/2008 02:45:00 PM 0
1 10/1/2008 05:15:00 PM 1
1 10/2/2008 02:00:00 AM 0
1 10/2/2008 10:30:00 AM 1
etc. (type: 1 being open and 0 closed)
And have all the days in the coming 1 or two years precalculated 1-2 years in advance. Note that you would only have 3 columns: int, date/time/bit so the data consumption should be minimal.
This will also allow you to modify specific dates for odd hours for special days, as they become known.
It also takes care of crossing over midnight, as well as 12/24 hour conversions.
It is also timezone agnostic. If you store start time and duration, when you calculate the end time, is your machine going to give you the TZ adjusted time? Is that what you want? More code.
as far as querying for open-closed status: query the date-time in question,
select top 1 type from thehours where datetimefield<=somedatetime and businessID = somebusinessid order by datetime desc
then look at "type". if one, it's open, if 0, it's closed.
PS: I was in retail for 10 years. So I am familiar with the small business crazy-hours problems.
The segment blocks are better, just make sure you give the user an easy way to set them. Click and drag is good.
Any other system (like ranges) is going to be really annoying when you cross the midnight boundary.
As for how you store them, in C++ bitfields would probably be best. In most other languages, and array might be better (lots of wasted space, but would run faster and be easier to comprehend).
I would think a little about those edge-cases right now, because they are going to inform whether you have a base configuration plus overlay or complete static storage of opening times or whatever.
There are so many exceptions - and on a regular basis (like snow days, irregular holidays like Easter, Good Friday), that if this is expected to be a reliable representation of reality (as opposed to a good guess), you'll need to address it pretty soon in the architecture.
How about something like this:
Store Hours Table
Business_id (int)
Start_Time (time)
End_Time (time)
Condition varchar/string
Open bit
'Condition' is a lambda expression (text for a 'where' clause). Build the query dynamically. So for a particular business you select all of the open/close times
Let Query1 = select count(open) from store_hours where #t between start_time and end_time and open = true and business_id = #id and (.. dynamically built expression)
Let Query2 = select count(closed) from store_hours where #t between start_time and end_time and open = false and business_id = #id and (.. dynamically built expression)
So end the end you want something like:
select cast(Query1 as bit) & ~cast(Query2 as bit)
If the result of the last query is 1 then the store is open at time t, otherwise it is closed.
Now you just need a friendly interface that can generate your where clauses (lambda expressions) for you.
The only other corner case that I can think of is what happens if a store is open from say 7am to 2am on one date but closes at 11pm on the following date. Your system should be able to handle that as well by smartly splitting up the times between the two days.
There is surely no need to conserve memory here, but perhaps a need for clean and comprehensible code. "Bit twiddling" is not, IMHO, the way to go.
We need a set container here, which holds any number of unique items and can determine quickly and easily whether an item is a member or not. The setup reuires care, but in routine use a single line of simply understood code determines if you are open or closed
Concept:
Assign index number to every 15 min block, starting at, say, midnight sunday.
Initialize:
Insert into a set the index number of every 15 min block when you are open. ( Assuming you are open fewer hours than you are closed. )
Use:
Subtract from interesting time, in minutes, midnight the previous sunday and divide by 15. If this number is present in the set, you are open.