We are re-factoring our database, we will be adding about 100.000 rows to the db a day.
Each Row now contains a date field (yy-mm-dd), So today we entered the date "2013-11-29" 100.000 times into a table.
Would it make sense to breakout the dates into a separate table, and use it's id instead since we don't store the time?
There is a trade-of there. If we break it out, we will be adding yet another JOIN to the query used when we want to view the records later. We already join 3 tables when we query for information, The database consist of about 10 million entry.
Any thought? the database is growing huge so we need to thing about disk-space but also preformens.
The driving consideration for breaking out the dates into a separate table should not be storage. This is a question about the data model. As Gareth points out, the size of the built-in date and the size of the foreign key are about the same.
The real question is whether you need additional information about dates that you cannot readily get from the built-in date functions. MySQL has a rich set of date functions, where you can get the day of the week, format the dates into strings, extract components, and so on. In addition, the built in functional also handle comparisons, differences, and adding "intervals" to dates.
If these are sufficient, then use the built-in functions. On the other hand, if you have "business rules" surrounding dates, then consider another table. Here are examples of such business rules:
A special set of dates that might be holidays. Or worse, country-dependent holidays.
Special business definitions for "quarter", "year", "week-of-year" and so on.
The need to support multiple date formats for internationalization purposes.
Special rules about date types, such as "workday" versus "weekend".
A date table is only one possible solution for handling these issues. But if you need to support these, then having a separate date table starts to make sense.
It probably doesn't make any sense to do this since the storage requirements increase just a bit, and if you ever want to start storing the time, you're sort of stuck.
Related
Suppose I have a site with 10,000 members and I want to collect one particular metric from those users everyday. These metric points would be used to form points on a graph. So as you can figure, the data points are very much related to the day on which they were collected. To keep things relevant, any data older than 30 days would be deleted.
What would be the best way to store this data? I'm currently using MariaDB. I would like to have a table where the first column is for the member id, and the other columns are for each of the days where the data was collected. With each day, a new column would be created on the end of the table to store that day's data, and the first column of the table would be deleted. This seems like the most computationally efficient approach.
This would mean the titles for those columns would be the date. Similar questions have been posed on here before and people are quick to condemn the database design for having dates as column titles. I'm not necessarily disagreeing but I would like to know what a better alternative is.
The other big problem here is that trying to get MariaDB / MySQL to automatically add a column with a unique title every day and delete the first column is quite a bit more complicated than I think it ought to be. I wonder if a different database service such as PostgreSQL would be better suited for these sorts of things.
I have a database that each entry is a business, some of the data is opening and closing times. Originally, to support multiple opening and closing times in a day I was going to store it as a string such as: '09:00-15:00,17:00-22:00' and then split it up and convert it to TIMESTAMPS server-side. I now understand that it is "bad form" to store times as strings. But the solution requires a second table.
What exactly is the issue with using a string instead of DATE or TIMESTAMP? If I kept it all in one table it would get pulled with my current query and then just processed and converted into a TIMESTAMP. Using multiple tables causes more queries. Is the string manipulation and conversion really that much more taxing than another query searching through an entire table that will end up with thousands of entries?
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store points in time as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. Your mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why?
For the first two answers, it has to do with normalization, indexing, searching strategies, and query optimization. These concepts are all related, and each of them will help you understand the other three.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
Having said that, I don't know mySQL, so I don't know how good it is for date manipulations or for query optimization.
By '09:00-15:00,17:00-22:00', you mean that it is open for two periods of time? Then 2 rows in one table. If the open times vary by day of week, then another column to specify that. If you want holidays, another column.
But... What will you be doing with the data?...
If you are going to write SQL to use the times, then what I described may work best.
If the processing will be in your application code, then a CSV file that you load every time, parse every time, etc, may be best.
Or, you could put that CSV file into a table as one big string. (However, this is not much different than having a file.)
There are three separate issues here. One is whether it is better to combine two elementary data items into a list, and store the list in one place, or to keep the elementary items separate, and store them in two separate places. For these purposes, I'm considering the intersection of a single row and a single column to be one place.
The second issue is whether it is better to identify a time interval as a single elementary item or better to describe a time interval as two elementary items, side by side.
The third issue is whether it is better to store dates as strings or use the DBMS timestamp datatype.
For the first issue, it's better to store separate elementary items in separate places. In this case, in two separate rows of the same table.
For the second issue, it's better to describe a time interval as two timestamps side by side than to combine them into an interval, for the DBMS systems I have used. But there are DBMS systems that have a time interval datatype that is distinct from the timestamp datatype. You mileage may vary.
For the third issue, it's better to use the DBMS timestamp datatype to describe a point in time than a character string. Different DBMS products have different facilities for storing a time without any date associated with it, and in your case, you may want to make use of this. It depends on how you will be searching the data. If you are going to want to find all the rows that contain a 9:00 to 15:00 time range regardless of date, you will want to make use of this. If you are going to want to find all the rows that contain a 9:00 to 15:00 range on any Tuesday, I suggest you look into data warehousing techniques.
Why do I gove these answers?
For the first two answers, it has to do with normalization, indexing, and searching efficiency. These concepts are all related, and you should learn them all at about the same time, because each of them will help you understand the other two.
For the third answer, it has to do with using the power of the DBMS to do the detailed work for you. Any good DBMS will have tools for subtracting timestamps to yield a time interval, and for ordering timestamps correctly. Getting the DBMS to do this work for you will save you a lot of work in your application programming, and generally use less computer resources in production.
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
In order to analyze dates and times I am creating a MySQL table where I want to keep the time information. Some example analyses will be stuff like:
Items per day/week/month/year
Items per weekday
Items per hour
etc.
Now in regards to performance, what way should I record in my datatable:
date type: Unix timestamp?
date type: datetime?
or keep date information in one row each, e.g. year, month, day in separate fields?
The last one, for example, would be handy if I'm analysing by weekday; I wouldn't have to perform WEEKDAY(item.date) on MySQL but could simply use WHERE item.weekday = :w.
Based on your usage, you want to use the native datetime format. Unix formats are most useful when the major operations are (1) ordering; (2) taking differences in seconds/minutes/hours/days; and (3) adding seconds/minutes/hours/days. They need to be converted to internal date time formats to get the month or week day, for instance.
You also have a potential indexing issue. If you want to select ranges of days, hours, months and so on for your results, then you want an index on the column. For this purpose an index on a datetime is probably sufficient.
If the summaries are by hour, you might find it helpful to stored the date component in a date field and the hour in a separate column. That would be particularly helpful if you are combining hours from different days.
Whether you break out other components of the date, such as weekday and month, for indexing purposes would depend on the volume of data in the table, performance requirements, and the queries you are planning on running. I would not be inclined to do this, except as a later optimization.
The rule of thumb is: store things as they should be stored, don't do performance tweaks until you're hitting the bottleneck. If you store your date as separate fields, you'll eventually stumble upon a situation you need this date as a whole inside your database (e.g. update query for a particular range of time), and this will be like hell - condition from 3 april 2015 till 15 may 2015 would be as giant as possible.
You should keep your dates as date type. This will grant you maximum flexibility, (most probably) query readability and will keep all of your opportunities to work with them. The only thing I really can recommend is storing the same date divided into year/month/day in next columns - of course, this will bloat your database and require extreme caution on update scenarios, but this will allow you to use any variant of source data in your queries.
Good day everyone.
I spent a whole day looking for an answer to my question but unfortunately I did not the answer I'm looking for.
I am trying to automate some processes that we are currently using by creating a web application in PHP and a MySQL database.
I would like to prepare the staff schedule in excel then upload it to MySQL using php. Most of the database fields are straight forward: an employee ID that is INT, date that is in DATE format.
The issue I am facing is in the time. When an employee is working, I would like to upload their shift start time and end time in TIME, however, if they are not working, I would like the database to store their status in letters: OFF for a day off, AL for annual leave, ML for medical leave, EL for emergency leave and so on.
I do not know what data type will support this. ie: accept either time or an entry out of a specific list of outcome codes.
Any assistance will be appreciated.
Mohamed.
Side Note:
I asked this question in a time where I was beginning to learn how to code, one of the mistakes that I have learned to avoid is mixing business logic and application logic. This question may be useful to someone in the future, and my advise would be to any new developer is to make sure that the business model logic may not go hand-in-hand with your application logic, but there is always a way to make it work.
Mohamed.
Here is an idea, put another column for status, one of the status will be working status and every other status will be as described. Only working status will have data in the "start time" and "end time" columns. In my opinion this is the best solution and allows for better search capabilities, cleaner database and more comprehensive readability.
However, if you absolutely want to, and/or have any reason on why you can't have an additional column, you can always store your time as text.
PS: Another tip for your database is to drop the date column and store both times in DATETIME format, it may range from unlikely to nearly impossible depending on what job shifts you are storing in the database, but it is possible to start a shift on one day and end it the next day, and even if you think you won't ever need it, it is good practice and makes the database more resilient. If you had to change it in the future it would be a pain to do so.
There is no such datatype in MySQL.
A VARCHAR field would accept both dates and your custom codes but if would be a pain writing the queries: you will always have to check the data before converting to times for reports etc.
I would just create three nullable fields: startTime, endTime and absenceReason and fill absenceReason if and only if both times are NULL.
You could use a VarChar or Char field to store either a time or text values. However, I would recommend using different fields for different things. Such as a start time and end time of data type time. And a separate field for the type of leave, etc. of type VarChar or Char.