I would like to discuss the "best" way to storage date periods in a database. Let's talk about SQL/MySQL, but this question may be for any database. I have the sensation I am doing something wrong for years...
In english, the information I have is:
-In year 2014, value is 1000
-In year 2015, value is 2000
-In year 2016, there is no value
-In year 2017 (and go on), value is 3000
Someone may store as:
BeginDate EndDate Value
2014-01-01 2014-12-31 1000
2015-01-01 2015-12-31 2000
2017-01-01 NULL 3000
Others may store as:
Date Value
2014-01-01 1000
2015-01-01 2000
2016-01-01 NULL
2017-01-01 3000
First method validation rules looks like mayhem to develop in order to avoid holes and overlaps.
In second method the problem seem to filter one punctual date inside a period.
What my colleagues prefer? Any other suggestion?
EDIT: I used full year only for example, my data usually change with day granularity.
EDIT 2: I thought about using stored "Date" as "BeginDate", order rows by Date, then select the "EndDate" in next (or previous) row. Storing "BeginDate" and "Interval" would lead to hole/overlap problem as method one, that I need a complex validation rule to avoid.
It mostly depends on the way you will be using this information - I'm assuming you do more than just store values for a year in your database.
Lots of guesses here, but I guess you have other tables with time-bounded data, and that you need to compare the dates to find matches.
For instance, in your current schema:
select *
from other_table ot
inner join year_table yt on ot.transaction_date between yt.year_start and yt.year_end
That should be an easy query to optimize - it's a straight data comparison, and if the table is big enough, you can add indexes to speed it up.
In your second schema suggestion, it's not as easy:
select *
from other_table ot
inner join year_table yt
on ot.transaction_date between yt.year_start
and yt.year_start + INTERVAL 1 YEAR
Crucially - this is harder to optimize, as every comparison needs to execute a scalar function. It might not matter - but with a large table, or a more complex query, it could be a bottleneck.
You can also store the year as an integer (as some of the commenters recommend).
select *
from other_table ot
inner join year_table yt on year(ot.transaction_date) = yt.year
Again - this is likely to have a performance impact, as every comparison requires a function to execute.
The purist in me doesn't like to store this as an integer - so you could also use MySQL's YEAR datatype.
So, assuming data size isn't an issue you're optimizing for, the solution really would lie in the way your data in this table relates to the rest of your schema.
Related
I started a HR management project and I want to count days between 2 dates without counting the holidays and weekends. So the HR can count employee's day off
Here's the case, I want to count between 2018-02-14 and 2018-02-20 where there is an office holiday on 2018-02-16. The result should be 3 days.
I have already created a table called tbl_holiday where I put all weekends and holidays in one year there
I found this post, and I tried it on my MariaDB
Here's my query:
SELECT 5 * (DATEDIFF('2018-02-20', '2018-02-14') DIV 7) +
MID('0123444401233334012222340111123400012345001234550', 7 *
WEEKDAY('2018-02-14') + WEEKDAY('2018-02-20') + 1, 1) -
(SELECT COUNT(dates) FROM tbl_holiday WHERE dates NOT IN (SELECT dates FROM tbl_holiday)) as Days
The query works but the result is 4 days, not 3 days. It means the query only exclude the weekends but not the holiday
What is wrong with my query? Am I missing something? Thank you for helping me
#RichardDoe, from the question comments.
In a reasonable implementation of a date table, you create a list of all days (covering a sufficient range to cope with any query you may run against it - 15 years each way from today is probably a useful minimum), and alongside each day you store a variety of derived attributes.
I wrote a Q&A recently with basic tools that would get you started in SQL Server: https://stackoverflow.com/a/48611348/9129668
Unfortunately I don't have a MySQL environment or intimate familiarity with it to allow me to write or adapt queries off the top of my head (as I'm doing here), but I hope this will illustrate the structure of a solution for you in SQL Server syntax.
In terms of the answer I link to (which generates a date table on the fly) and extending it by adding in your holiday table (and making some inferences about how you've defined your holiday table), and noting that a working day is any day Mon-Fri that isn't a holiday, you'd write a query like so to get the number of working days between any two dates:
WITH
dynamic_date_table AS
(
SELECT *
FROM generate_series_datetime2('2000-01-01','2030-12-31',1)
CROSS APPLY datetime2_params_fxn(datetime2_value)
)
,date_table_ext1 AS
(
SELECT
ddt.*
,IIF(hol.dates IS NOT NULL, 1, 0) AS is_company_holiday
FROM
dynamic_date_table AS ddt
LEFT JOIN
tbl_holiday AS hol
ON (hol.dates = ddt.datetime2_value)
)
,date_table_ext2 AS
(
SELECT
*
,IIF(is_weekend = 1 OR is_company_holiday = 1, 0, 1) AS is_company_work_day
FROM date_table_ext1
)
SELECT
COUNT(datetime2_value)
FROM
date_table_ext2
WHERE
(datetime2_value BETWEEN '2018-02-14' AND '2018-02-20')
AND
(is_company_work_day = 1)
Obviously, the idea for a well-factored solution is that these intermediate calculations (being general in nature to the entire company) get rolled into the date_params_fxn, so that any query run against the database gains access to the pre-defined list of company workdays. Queries that are run against it then start to resemble plain English (rather than the approach you linked to and adapted in your question, which is ingenious but far from clear).
If you want top performance (which will be relevant if you are hitting these calculations heavily) then you define appropriate parameters, save the lot into a stored date table, and index that table appropriately. This way, your query would become as simple as the final part of the query here, but referencing the stored date table instead of the with-block.
The sequentially-numbered workdays I referred to in my comment on your question, are another step again for the efficiency and indexability of certain types of queries against a date table, but I won't complicate this answer any further for now. If any further clarification is required, please feel free to ask.
I found the answer for this problem
It turns out, I just need to use a simple arithmetic operator for this problem
SELECT (SELECT DATEDIFF('2018-02-20', '2018-02-14')) - (SELECT COUNT(id) FROM tbl_holiday WHERE dates BETWEEN '2018-02-14' AND '2018-02-20');
I am trying run to say find the devices that did not contain 01: in the past 7 days.
I have tried "Where column Not Like '%01:%'" but it just removes the 01: and still shows the machine that had the 01: in the past 7 days.
I have a table called devices. Each location has a unique ID number. Each device runs a job at 1am and 7pm. Devices should have 1 entry for 01:00:00 per week then 3 entries for 19:00:00 per week. Ex of cell data is 2017-10-23 19:00:02.
So I begin with
Select * From devices
Where locationid=##
AND jobdate < DATE_SUB(NOW(), INTERVAL 7 DAY))
AND jobdate not like '%01:%'
What I get in result is the machine that did run at 01:00 2 days ago. The job date shows 19:00 so it sounds like it just removed the 01:.
I am thinking of grouping the job data then say list the computer that did not have 2017-10-23 01:00:02 .
There is a good deal of intuition in the following suggestion, more on that later.
Most databases don't actually store date/time information is a WYSIWYG fashion. Indeed if you think about it long enough you will understand that date/times are really "sets of numbers". That is why we can do things like calculate the number of days from date1 to date2 etc. So, IF the data is stored as a datetime data type don't attempt to use LIKE (which is for text) against a datetime column. Instead look for date and time related functions that may apply to your situation. Here you are looking for not equal to specific time of day (I think). So, to remove "date" from consideration convert it to "time", and then you can filter on that.
So below, I introduce a new column jobtime which is the time portion of jobdate, and then I look for any times not equal to a given value.
SQL Fiddle
MySQL 5.6 Schema Setup:
CREATE TABLE Devices
(`locationid` varchar(2), `jobdate` datetime)
;
INSERT INTO Devices
(`locationid`, `jobdate`)
VALUES
('##', '2017-10-23 01:00:00'),
('##', '2017-10-23 19:00:02')
;
Query 1:
select
*
from (
select locationid, cast(jobdate as time) jobtime, jobdate
from devices
) d
where locationid = '##'
and jobtime <> '01:00:00'
;
Results:
| locationid | jobtime | jobdate |
|------------|----------|----------------------|
| ## | 19:00:02 | 2017-10-23T19:00:02Z |
...
why is there "intuition" above? (the "more on this later")
It is remarkably frustrating to not know which database is in use because the syntax differs so much between the vendors. It is also essential to know the EXACT data type of the jobdate column - because if it is varchar for example I have just made a complete fool of myself in the query above. In other words we are not likely to answer because key facts are missing.
Finally, you have data! It's in your table(s) already. Why not make it easy on everyone by sharing a few bits of it? Provide "sample data" with your question, and the "expected result" too (i.e. provide 2 things, not one without the other, and do not use images of data!!!). Hopefully you can see from the example above how useful sample data & result is. For example, if my intuition is way off, you can tell in an instant that it is - even if you don't read the SQL.
Rant over, not all points raised here apply to this question.
I generally use datetime field to store created_time updated time of data within an application.
But now i have come across a database table where they have kept date and time separate fields in table.
So what are the schema in which two of these should be used and why?
What are pros and cons attached with using of two?
There is a huge difference in performance when using DATE field above DATETIME field. I have a table with more then 4.000.000 records and for testing purposes I added 2 fields with both their own index. One using DATETIME and the other field using DATE.
I disabled MySQL query cache to be able to test properly and looped over the same query for 1000x:
SELECT * FROM `logs` WHERE `dt` BETWEEN '2015-04-01' AND '2015-05-01' LIMIT 10000,10;
DATETIME INDEX:
197.564 seconds.
SELECT * FROM `logs` WHERE `d` BETWEEN '2015-04-01' AND '2015-05-01' LIMIT 10000,10;
DATE INDEX:
107.577 seconds.
Using a date indexed field has a performance improvement of: 45.55%!!
So I would say if you are expecting a lot of data in your table please consider in separating the date from the time with their own index.
I tend to think there are basically no advantages to storing the date and time in separate fields. MySQL offers very convenient functions for extracting the date and time parts of a datetime value.
Okay. There can be some efficiency reasons. In MySQL, you can put separate indexes on the fields. So, if you want to search for particular times, for instance, then a query that counts by hours of the day (for instance) can use an index on the time field. An index on a datetime field would not be used in this case. A separate date field might make it easier to write a query that will use the date index, but, strictly speaking, a datetime should also work.
The one time where I've seen dates and times stored separately is in a trading system. In this case, the trade has a valuation date. The valuation time is something like "NY Open" or "London Close" -- this is not a real time value. It is a description of the time of day used for valuation.
The tricky part is when you have to do date arithmetic on a time value and you do not want a date portion coming into the mix. Ex:
myapptdate = 2014-01-02 09:00:00
Select such and such where myapptdate between 2014-01-02 07:00:00 and 2014-01-02 13:00:00
1900-01-02 07:00:00
2014-01-02 07:00:00
One difference I found is using BETWEEN for dates with non-zero time.
Imagine a search with "between dates" filter. Standard user's expectation is it will return records from the end day as well, so using DATETIME you have to always add an extra day for the BETWEEN to work as expected, while using DATE you only pass what user entered, with no extra logic needed.
So query
SELECT * FROM mytable WHERE mydate BETWEEN '2020-06-24' AND '2020-06-25'
will return a record for 2020-06-25 16:30:00, while query:
SELECT * FROM mytable WHERE mydatetime BETWEEN '2020-06-24' AND '2020-06-25'
won't - you'd have to add an extra day:
SELECT * FROM mytable WHERE mydatetime BETWEEN '2020-06-24' AND '2020-06-26'
But as victor diaz mentioned, doing datetime calculations with date+time would be a super inefficient nightmare and far worse, than just adding a day to the second datetime. Therefore I'd only use DATE if the time is irrelevant, or as a "cache" for speeding queries up for date lookups (see Elwin's answer).
I have the following table in MySQL that records event counts of stuff happening each day
event_date event_count
2011-05-03 21
2011-05-04 12
2011-05-05 12
I want to be able to query this efficiently by date range AND by day of week. For example - "What is the event_count on Tuesdays in May?"
Currently the event_date field is a date type. Are there any functions in MySQL that let me query this column by day of week, or should I add another column to the table to store the day of week?
The table will hold hundreds of thousands of rows, so given a choice I'll choose the most efficient solution (as opposed to most simple).
Use DAYOFWEEK in your query, something like:
SELECT * FROM mytable WHERE MONTH(event_date) = 5 AND DAYOFWEEK(event_date) = 7;
This will find all info for Saturdays in May.
To get the fastest reads store a denormalized field that is the day of the week (and whatever else you need). That way you can index columns and avoid full table scans.
Just try the above first to see if it suits your needs and if it doesn't, add some extra columns and store the data on write. Just watch out for update anomalies (make sure you update the day_of_week column if you change event_date).
Note that the denormalized fields will increase the time taken to do writes, increase calculations on write, and take up more space. Make sure you really need the benefit and can measure that it helps you.
Check DAYOFWEEK() function
If you want textual representation of day of week - use DAYNAME() function.
I've been hammering my head against my desk for the past few days on this, and so I turn to you, Stack Overflow.
The software I'm working on has time-sensitive data. The usual solution for this is effective and expiration dates.
EFF_DT XPIR_DT VALUE
2000-05-01 2000-10-31 100
2000-11-01 (null) 90
This would be easy. Unfortunately, we require data that repeats on a yearly basis arbitrarily far into the future. In other words, each May 1 (starting in 2000) we may want the effective value to be 100, and each November 1 we may want to change it to 90.
This may be true for a long time (>50 years), and so I don't want to just create a hundred records. I.e., I don't want to do this:
EFF_DT XPIR_DT VALUE
2000-05-01 2000-10-31 100
2000-11-01 2001-04-30 90
2001-05-01 2001-10-31 100
2001-11-01 2002-04-30 90
2002-05-01 2002-10-31 100
2002-11-01 2003-04-30 90
...
2049-05-01 2049-10-31 100
2049-11-01 2050-04-30 90
2050-05-01 2050-10-31 100
2050-11-01 2051-04-30 90
These values may also change with time. Values before 2000 might have been constant (no flip-flopping) and values for the coming decade may be different than the values for the last:
EFF_DT XPIR_DT REPEATABLE VALUE
1995-01-01 2000-04-30 false 85
2000-05-01 2010-04-30 true 100
2000-11-01 2010-10-31 true 90
2010-05-01 (null) true 120
2010-11-01 (null) true 115
We already have a text file (from a legacy app) that stores data in a form very close to this, so there are benefits to adhering to this type of structure as closely as possible.
The question then comes on retrieval: which value would apply to today, 2010-03-09?
It seems that the best way to do this would be to find the most recent instance of each effective date (of all the active rows), then see which is the greatest.
EFF_DT MOST_RECENT XPIR_DT VALUE
2000-05-01 2009-05-01 2010-04-30 100
2000-11-01 2009-11-01 2010-10-31 90
The value for today would be 90, since 2009-11-01 is later than 2009-05-01.
On, say, 2007-06-20:
EFF_DT MOST_RECENT XPIR_DT VALUE
2000-05-01 2007-05-01 2010-04-30 100
2000-11-01 2006-11-01 2010-10-31 90
The value would be 100 since 2007-05-01 is later than 2006-11-01.
Using the MySQL date functions, what's the most efficient way to calculate the MOST_RECENT field?
Or, can anyone think of a better way to do this?
The language is Java, if it matters. Thanks all!
Suppose your wanted 'date' is '2007-06-20'.
You need to combine the non-repeating elements with the repeating ones, so you could do something like this (untested and probably needs some thinkering, but should give you the general idea):
select * from (
select * from mytable
where
repeatable = false
and
EFF_DT <= '2007-06-20' < XPIR_DT
union all
select * from mytable
where
repeatable = true
and EFF_DT <= str_to_date(concat("2007", "-", month(EFF_DT), "-", day(EFF_DT)), "%Y-%m-%d") < XPIR_DT
)
order by EFF_DT desc limit 1
I've had to do similar things with recurring appointments & events, and you might find that MySQL will be a lot happier with the "static" date style that you don't want - each recurring instance spelled out in hundreds of rows.
If possible, I'd consider creating a separate table to store them flattened out, while keeping the effective/expires dates where they are (to match legacy data & act as a parent), and a 1:many relation between the two tables (i.e. an "event_id" on the flattened data referencing the original's PK). Writing all those records will obviously take longer, but it's directly lightening the load from reading them (where things generally need to be faster).
Creating a stored procedure or external program to handle recalculating a flat start_date / end_date / value table should be fairly basic, given a common interval. Querying the data could then be as simple as WHERE #somedate BETWEEN start_date AND end_date, instead of increasingly complex conversions & date math.
Again, INSERTs and UPDATEs will be slower, but "hundreds of rows" isn't even scratching the surface of what MySQL's capable of. If it's just 2 dates, an int, and some sort of int key, writing a few hundred records shouldn't take but a couple seconds on a sub-par server. If we were talking millions of records then maybe something could be tweaked (do you really need to track 50 years ahead or just the next 5? can recalculation be moved to off-peak times via cron? etc), but even then MySQL will just be that much more effective compared to calculating the difference every time.
Also maybe of interest: What's the best way to model recurring events in a calendar application? & Data structure for storing recurring events?
Here is a query that you can use to calculate the more recent EFF_DT for a data set. You will have to fill in there where clause because i'm not sure how this data is organized.
select EFF_DT form date_table where 1 order by EFF_DT desc limit 1
The flip flop of 90 and 100 is more complex, but you should be able to take care of this using the mysql data and time functions. This is a tricky one, and I'm not 100% on what you are trying to do. But, this query checks to see if the month of XPIR_DT is greater than May (the 5th month) but less than November (The 11th month). If this is true then the sql query will return 90, if its false then you'll get 100.
select if((month(XPIR_DT)>=5) and (month(XPIR_DT)<11),90,100) from date_table where id=1