I have a table with entries that have a start_date and end_date (both indexed, DATE format). I want to return a list of all entries where today is between these 2 dates. Here are 2 options I've considered:
1) Direct query:
MySQL query (where 28/02/2014 would be variable of course):
SELECT * FROM mytable WHERE '28/02/2014' BETWEEN start_date AND end_date
2) Daily cronjob to go through all entries and update a field is_valid (boolean format) to be true when today is between both dates, and false otherwise (the performance is less important here as it's not customer-facing). Then the MySQL query to select entries would be:
SELECT * FROM mytable WHERE is_valid = 1
The end goal is to have the fastest query (will be used in search results which would be a prominent page of the site) when entries could reach 100,000 or even millions in the future. I'm not sure if indexing dates would be good enough, or if the cronjob is just overkill - or if there is an even better way to do this!
Thanks in advance for your advice in which option to choose!
EDIT: thanks for the replies - is this index structure good?
If you want the faster query between these two options, then there is nothing like a cron job to set the flag appropriately. You should then index the resulting column, because otherwise you have to do a full-table scan. Without the index, this approach is probably slower than using the dates with an index.
For most purposes, a composite index on start_date and end_date is the preferred solution and should be quite fast enough.
I suspect that you are submitting to the daemon of premature optimization. The fastest approach is to run a cron job and load today's data into a new table, properly indexed and structured for your analysis. Barring that, a composite index is a very reasonable approach. Although updating a flag does solve the problem, it would be neither the fastest nor the cleanest method.
I have used this same schema before. A query with the dates was fast enough, if you have the right indexes.
Related
This is my data from my table:
I mean i have exactly one million rows so it is just a snippet.
I would like to make this query faster:
Which basically groups the values by time (ev represents year honap represents month and so on.). It has one problem that it takes a lot of time. I tried to apply indexes as you can see here:
but it does absolutely nothing.
Here is my index:
I have tried also to put the perc (which represents minute) due to cardinality but mysql doesnt want to use that. Could you give me any suggestions?
Is the data realistic? If so, why run the query -- it essentially delivers exactly what was in the table.
If, on the other hand, you had several rows per minute, then the GROUP BY makes sense.
The index you have is not worth using. However, the Optimizer seemed to like it. That's a bug.
In that case, I would simply this:
SELECT AVG(konyha1) AS 'avg',
LEFT(time, 16) AS 'time'
FROM onemilliondata
GROUP BY LEFT(time, 16)
A DATE or TIME or DATETIME can be treated as such a datatype or as a VARCHAR. I'm asking for it to be a string.
Even in this case, no index is useful. However, this would make it a little faster:
PRIMARY KEY(time)
and the table would have only 2 columns: time, konyha1.
It is rarely beneficial to break a date and/or time into components and put them into columns.
A million points will probably choke a graphing program. And the screen -- which has a resolution of only a few thousand.
Perhaps you should group by hour? And use LEFT(time, 13)? Performance would probably be slightly faster -- but only because less data is being sent to the client.
If you are collecting this data "forever", consider building and maintaining a "summary table" of the averages for each unit of time. Then the incremental effort is, say, aggregating yesterday's data each morning.
You might find MIN(konyha1) and MAX(konyha1) interesting to keep on an hourly or daily basis. Note that daily or weekly aggregates can be derived from hourly values.
I am new to SQL, and certainly to MySQL.
I have created a table from streaming market data named trade that looks like
date | time |instrument|price |quantity
----------|-----------------------|----------|-------|--------
2017-09-08|2017-09-08 13:16:30.919|12899586 |54.15 |8000
2017-09-08|2017-09-08 13:16:30.919|13793026 |1177.75|750
2017-09-08|2017-09-08 13:16:30.919|1346049 |1690.8 |1
2017-09-08|2017-09-08 13:16:30.919|261889 |110.85 |50
This table is huge (150 million rows per date).
To retrieve data efficiently, I have created an index date_time_inst (date,time,instrument) because most of my queries will select a specific date
or date range and then a time range.
But that does not help speed up a query like:
select * from trade where date="2017-09-08", instrument=261889
So, I am considering creating another index date_inst_time (date, instrument, time). Will that help speed up queries where I wish to get the time-series of one or a few instruments out of the thousands?
In additional database write-time due to index update, should I worry too much?
I get data every second, and take about 100 ms to process it and store in a database. As long as I continue to take less than 1 sec I am fine.
To get the most efficient query you need to query on a clustered index. According the the documentation this is automatically set on the primary key and can not be set on any other columns.
I would suggest ditching the date column and creating a composite primary key on time and instrument
A couple of recommendations:
There is no need to store date and time separately if time corresponds to time of the same date. You can instead have one datetime column and store timestamps in it
You can then have one index on datetime and instrument columns, that will make the queries run faster
With so many inserts and fixed format of SELECT query (i.e. always by date first, followed by instrument), I would suggest looking into other columnar databases (like Cassandra). You will get faster writes and reads for such structure
First, your use case sounds like two indexes would be useful (date, instrument) and (date, time).
Given your volume of data, you may want to consider partitioning the data. This involves storing different "shards" of data in different files. One place to start is with the documentation.
From your description, you would want to partition by date, although instrument is another candidate.
Another approach would be a clustered index with date as the first column in the index. This assumes that the data is inserted "in order", to reduce movement of the data on inserts.
You are dealing with a large quantity of data. MySQL should be able to handle the volume. But, you may need to dive into more advanced functionality, such as partitioning and clustered indexes to get the functionality you need.
Typo?
I assume you meant
select * from trade where date="2017-09-08" AND instrument=261889
^^^
Optimal index for such is
INDEX(instrument, date)
And, contrary to other Comments/Answers, it is better to have the date last, especially if you want more than one day.
Splitting date and time
It is usually a bad idea to split date and time. It is also usually a bad idea to have redundant data; in this case, the date is repeated. Instead, use
WHERE `time` >= "2017-09-08"
AND `time` < "2017-09-08" + INTERVAL 1 DAY
and get rid of the date column. Note: This pattern works for DATE, DATETIME, DATETIME(3), etc, without messing up with the midnight at the end of the range.
Data volume?
150M rows? 10 new rows per second? That means you have about 5 years' data? A steady 10/sec insertion rate is rarely a problem.
Need to see SHOW CREATE TABLE. If there are a lot of indexes, then there could be a problem. Need to see the datatypes to look for shrinking the size.
Will you be purging 'old' data? If so, we need to talk about partitioning for that specific purpose.
How many "instruments"? How much RAM? Need to discuss the ramifications of an index starting with instrument.
The query
Is that the main SELECT you use? Is it always 1 day? One instrument? How many rows are typically returned.
Depending on the PRIMARY KEY and whatever index is used, fetching 100 rows could take anywhere from 10ms to 1000ms. Is this issue important?
Millisecond resolution
It is usually folly to think that any time resolution is not going to have duplicates.
Is there an AUTO_INCREMENT already?
SPACE IS CHEAP. Indexes take time creating/inserting (once), but shave time retrieving (Many many times)
My experience is to create as many indexes with all the relevant fields in all orders. This way, Mysql can choose the best index for your query.
So if you have 3 relevant fields
INDEX 1 (field1,field2,field3)
INDEX 2 (field1,field3)
INDEX 3 (field2,field3)
INDEX 4 (field3)
The first index will be used when all fields are present. The others are for shorter WHERE conditions.
Unless you know that some combinations will never be used, this will give MySQL the best chance to optimize your query. I'm also assuming that field1 is the biggest driver of the data.
I have a table with millions of rows where one of the columns is a TIMESTAMP and against which I frequently select for date ranges. Would it improve performance any to index that column, or would that not furnish any notable improvement?
EDIT:
So, I've indexed the TIMESTAMP column. The following query
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
Takes 3.1 seconds.
There are just over 3 million records in the interactions table.
The above query produces a result of ~976k
Does this seem like a reasonable amount of time to perform this task?
If you want improvement on the efficiency of queries, you need 2 things:
First, index the column.
Second, and this is more important, make sure the conditions on your queries are sargable, i.e. that indexes can be used. In particular, functions should not be used on the columns. In your example, one way to write the condition would be:
WHERE interaction_time >= '2013-10-10'
AND interaction_time < (CURRENT_DATE + INTERVAL 1 DAY)
The general rule with indexes is they speed retrieval of data with large data sets, but SLOW the insertion and update of records.
If you have millions of rows, and need to select a small subset of them, then an index most likely will improve performance when doing a SELECT. (If you need most or all of them if will make little or no difference.)
Without an index, a table scan (ie read of every record to locate required ones) will occur which can be slow.
With tables with only a few records, a table scan can actually be faster than an index, but this is not your situation.
Another consideration is how many discrete values you have. If you only have a handful of different dates, indexing probably won't help much if at all, however if you have a wide range of dates the index will most likely help.
One caveat, if the index is very big and won't fit in memory, you may not get the performance benefits you might hope for.
Also you need to consider what other fields you are retrieving, joins etc, as they all have an impact.
A good way to check how performance is impacted is to use the EXPLAIN statement to see how mySQL will execute the query.
It would improve performance if:
there are at least "several" different values
your query uses a date range that would select less than "most" of the rows
To find out for sure, use EXPLAIN to show what index is being used. Use explain before creating the index and again after - you should see that the new index is being used or not. If its being used, you can be confident performance is better.
You can also simply compare query timings.
For
select count(*) from interactions where date(interaction_time) between date('2013-10-10') and date(now())
query to be optimized you need to do the following:
Use just interaction_time instead of date(interaction_time)
Create an index that covers interaction_time column
(optional) Use just '2013-10-10' not date('2013-10-10')
You need #1 because indexes are only used if the columns are used in comparisons as-is, not as arguments in another expressions.
Adding an index on date column definitely increases performance.
My table has 11 million rows, and a query to fetch rows which were updated on a particular date took the following time according to conditions:
Without index: ~2.5s
With index: ~5ms
I have a table with the following structure:
ID, SourceID, EventId, Starttime, Stoptime
All of the ID columns are char(36) and the times are dates.
The problem is that querying the table is really slow. I have 7 millons rows, I have about 60-70 threads that are writing (insert or update) to the table all the time.
On the other side I have the GUI that needs to read from this table, and it's here it get slow. If I want to select all the events that have been made where SourceID = something it takes almost 300 seconds. SourceID has an index. I take the same query and put explain keyword first I got this.
select type = simple
type = ref
possible_keys = sourceidnevent,sourceid
key = soruceid
key_len = 109
ref = const
rows = 84148
And the query
SELECT * FROM tabel where sourceid='28B791C7-D519-4F0C-BC03-EFB1D4AC9CEB'
However I started to think about what does I really need from the table. I want to know which event occured on which server, and also which event occured on servers, sorted by date. I have added index for all combination of which where and order by are used.
I need all the rows for becuse I want to make some calculation on them, some grouping, avarage and so on. But I'm doing it in .NET enviroment insteed of asking many question.
However if I add a limit to the select it goes faster. So is the bottleneck the amount of data that is transfered and not actully the finding/selecting part? If so I can rebuild my application to do the calculation on only one day and save the result into another table, and later aggregate all of it.
How can I speed up the procecss? Would it be better to switch to MongoDB? I currently use MySQL and InnoDB.
There's a lot of information you've not provided here - some of which I've mentioned in my comment elsewhere.
NoSQL is unlikely to be much faster than MySQL on a single node. I'd be very surprised if it were faster than using the handler API on MySQL along with appropirate indexes.
You've provided part of an explain plan (but not the query being explained) - but you haven't provided any interpretation of this:
rows = 84148
Does it really need to process that many rows to provide the result you need? If so and the result is not aggregated then maybe you need to think about why you need to ship 80k rows of data to the front end. If it's only having to return a few non-aggregated rows then you really need to analyse your indexes.
I have added index for all combination
Too many indexes is just as bad for performance as too few.
I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.