I have a database that each entry is a business, some of the data is opening and closing times. Originally, to support multiple opening and closing times in a day I was going to store it as a string such as: '09:00-15:00,17:00-22:00' and then split it up and convert it to TIMESTAMPS server-side. I now understand that it is "bad form" to store times as strings. But the solution requires a second table.
What exactly is the issue with using a string instead of DATE or TIMESTAMP? If I kept it all in one table it would get pulled with my current query and then just processed and converted into a TIMESTAMP. Using multiple tables causes more queries. Is the string manipulation and conversion really that much more taxing than another query searching through an entire table that will end up with thousands of entries?
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store points in time as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. Your mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why?
For the first two answers, it has to do with normalization, indexing, searching strategies, and query optimization. These concepts are all related, and each of them will help you understand the other three.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
Having said that, I don't know mySQL, so I don't know how good it is for date manipulations or for query optimization.
By '09:00-15:00,17:00-22:00', you mean that it is open for two periods of time? Then 2 rows in one table. If the open times vary by day of week, then another column to specify that. If you want holidays, another column.
But... What will you be doing with the data?...
If you are going to write SQL to use the times, then what I described may work best.
If the processing will be in your application code, then a CSV file that you load every time, parse every time, etc, may be best.
Or, you could put that CSV file into a table as one big string. (However, this is not much different than having a file.)
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store dates as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. You mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why do I gove these answers?
For the first two answers, it has to do with normalization, indexing, and searching efficiency. These concepts are all related, and you should learn them all at about the same time, because each of them will help you understand the other two.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
Related
So I've designed a basic SQL database that imports data outputted by machines through SSIS into SQL, does a few transforms, and ends up with how many things we are producing every 15 minutes.
Now we want to be able to report on this per-operator. So I have another table with operators and operator numbers, and am trying to figure out how to track this, with the eventual goal of giving my boss charts and graphs of how his employees are doing.
Now the question:
I was going to format a table with the date, machine number, operator number and then a column for each 15m segment in a day, but that ended up being a million+ datapoints a year, which will clearly get out of control.
Then I was thinking, date, machinenumber, user#, start and stop time. but couldn't figure out how to get it to roll over into the next day if a shift goes past midnight, or how to query against a time between the start/stop times, simple stuff I'm sure but I'm new here. I need to use time instead of just a "shift" since that may change, people go home early, ect. stuff happens.
So the question is: What would be best practice on how to format a table for a work schedule, and how can I query off of it as above?
First, a million rows a year isn't a lot. SQL databases regularly get into the billions of rows. The storage requirements compared to modern drive sizes are nothing. Properly indexed, performance won't be a problem.
In fact, I'd say to consider not even bothering with the time periods. Record each data point with a timestamp instead. Use SQL operators such as BETWEEN to get whatever periods you like. It's simpler. It's more flexible. It takes more space, but space isn't really an issue. And with proper indexing it won't be a performance issue. Use the money saved on developer time to buy better hardware for your database, like more RAM or an SSD. Or move to a cloud database.
Just make sure you architect your system to encapsulate the details of the schema, probably by using a model, and ensure that you have a way to safely change your schema, like by using migrations. Then if you need to re-architect your schema later you can do so without having to hunt down every piece of code that might use that table.
That said, there's a few simple things you could do to reduce the number of rows.
There's probably going to be a lot of periods when a thing doesn't produce anything. If nothing is produced during that period, don't store a row. If you just store the timestamp for each thing produced, these gaps appear normally.
You could save a small amount of space and performance by putting the periods in their own table and referencing them. So instead of every table having redundant start and end datetime columns, they'd have a single period column which referenced a period table that had start and end columns. While this would reduce some duplication, I'm not so sure this is worth the complexity.
In the end, before you add a bunch of complexity over hypothetical performance issues, do the simplest thing and benchmark it. Load up your database with a bunch of test data, see how it performs, and optimize from there.
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
In order to analyze dates and times I am creating a MySQL table where I want to keep the time information. Some example analyses will be stuff like:
Items per day/week/month/year
Items per weekday
Items per hour
etc.
Now in regards to performance, what way should I record in my datatable:
date type: Unix timestamp?
date type: datetime?
or keep date information in one row each, e.g. year, month, day in separate fields?
The last one, for example, would be handy if I'm analysing by weekday; I wouldn't have to perform WEEKDAY(item.date) on MySQL but could simply use WHERE item.weekday = :w.
Based on your usage, you want to use the native datetime format. Unix formats are most useful when the major operations are (1) ordering; (2) taking differences in seconds/minutes/hours/days; and (3) adding seconds/minutes/hours/days. They need to be converted to internal date time formats to get the month or week day, for instance.
You also have a potential indexing issue. If you want to select ranges of days, hours, months and so on for your results, then you want an index on the column. For this purpose an index on a datetime is probably sufficient.
If the summaries are by hour, you might find it helpful to stored the date component in a date field and the hour in a separate column. That would be particularly helpful if you are combining hours from different days.
Whether you break out other components of the date, such as weekday and month, for indexing purposes would depend on the volume of data in the table, performance requirements, and the queries you are planning on running. I would not be inclined to do this, except as a later optimization.
The rule of thumb is: store things as they should be stored, don't do performance tweaks until you're hitting the bottleneck. If you store your date as separate fields, you'll eventually stumble upon a situation you need this date as a whole inside your database (e.g. update query for a particular range of time), and this will be like hell - condition from 3 april 2015 till 15 may 2015 would be as giant as possible.
You should keep your dates as date type. This will grant you maximum flexibility, (most probably) query readability and will keep all of your opportunities to work with them. The only thing I really can recommend is storing the same date divided into year/month/day in next columns - of course, this will bloat your database and require extreme caution on update scenarios, but this will allow you to use any variant of source data in your queries.
I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.
We are re-factoring our database, we will be adding about 100.000 rows to the db a day.
Each Row now contains a date field (yy-mm-dd), So today we entered the date "2013-11-29" 100.000 times into a table.
Would it make sense to breakout the dates into a separate table, and use it's id instead since we don't store the time?
There is a trade-of there. If we break it out, we will be adding yet another JOIN to the query used when we want to view the records later. We already join 3 tables when we query for information, The database consist of about 10 million entry.
Any thought? the database is growing huge so we need to thing about disk-space but also preformens.
The driving consideration for breaking out the dates into a separate table should not be storage. This is a question about the data model. As Gareth points out, the size of the built-in date and the size of the foreign key are about the same.
The real question is whether you need additional information about dates that you cannot readily get from the built-in date functions. MySQL has a rich set of date functions, where you can get the day of the week, format the dates into strings, extract components, and so on. In addition, the built in functional also handle comparisons, differences, and adding "intervals" to dates.
If these are sufficient, then use the built-in functions. On the other hand, if you have "business rules" surrounding dates, then consider another table. Here are examples of such business rules:
A special set of dates that might be holidays. Or worse, country-dependent holidays.
Special business definitions for "quarter", "year", "week-of-year" and so on.
The need to support multiple date formats for internationalization purposes.
Special rules about date types, such as "workday" versus "weekend".
A date table is only one possible solution for handling these issues. But if you need to support these, then having a separate date table starts to make sense.
It probably doesn't make any sense to do this since the storage requirements increase just a bit, and if you ever want to start storing the time, you're sort of stuck.