Date Index - Long Range Search All Rows, Small Not - mysql

Why MySQL search all rows when I switch to a 1 year range?
--Table dates
id (int)
date (timestamp)
value (varchar)
PRIMARY(id), date_index(date)
1750 rows
Executing
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2011-04-27' AND '2011-04-28'
The rows column display 18 rows.
If I increase or decrease the BETWEEN range - 1 year for example - the rows column display 1750 rows.
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2011-04-27' AND '2012-04-28'
EXPLAIN SELECT * FROM dates WHERE date BETWEEN '2010-04-27' AND '2011-04-28'

The optimizer builds the query plan depending on several things including the amount/distribution of the data. My best guess would be that you don't have much more than a year's data or that using the index for the year's worth of data wouldn't use many less rows than the total table size.
If that doesn't sound right can you post up the output of:
SELECT MIN(date), MAX(date) FROM dates;
SELECT COUNT(*) FROM dates WHERE date BETWEEN '2011-04-27' AND '2012-04-28';
This article I wrote shows some examples of how the optimizer works too: What makes a good MySQL index? Part 2: Cardinality

Related

How to generate faster mysql query with 1.6M rows

I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.

MySQL - group by interval query optimisation

Some background first. We have a MySQL database with a "live currency" table. We use an API to pull the latest currency values for different currencies, every 5 seconds. The table currently has over 8 million rows.
Structure of the table is as follows:
id (INT 11 PK)
currency (VARCHAR 8)
value (DECIMAL
timestamp (TIMESTAMP)
Now we are trying to use this table to plot the data on a graph. We are going to have various different graphs, e.g: Live, Hourly, Daily, Weekly, Monthly.
I'm having a bit of trouble with the query. Using the Weekly graph as an example, I want to output data from the last 7 days, in 15 minute intervals. So here is how I have attempted it:
SELECT *
FROM currency_data
WHERE ((currency = 'GBP')) AND (timestamp > '2017-09-20 12:29:09')
GROUP BY UNIX_TIMESTAMP(timestamp) DIV (15 * 60)
ORDER BY id DESC
This outputs the data I want, but the query is extremely slow. I have a feeling the GROUP BY clause is the cause.
Also BTW I have switched off the sql mode 'ONLY_FULL_GROUP_BY' as it was forcing me to group by id as well, which was returning incorrect results.
Does anyone know of a better way of doing this query which will reduce the time taken to run the query?
You may want to create summary tables for each of the graphs you want to do.
If your data really is coming every 5 seconds, you can attempt something like:
SELECT *
FROM currency_data cd
WHERE currency = 'GBP' AND
timestamp > '2017-09-20 12:29:09' AND
UNIX_TIMESTAMP(timestamp) MOD (15 * 60) BETWEEN 0 AND 4
ORDER BY id DESC;
For both this query and your original query, you want an index on currency_data(currency, timestamp, id).

how can I calculate the SUM in 4days buckets over all dates

I have a MySQL DB where one column is the DATE and the other column is the SIGNAL. Now I would like to calculate the SUM over Signal for 4 days each.
f.e.
SUM(signal over DATE1,DATE2,DATE3,DATE4)
SUM(signal over DATE5,DATE6,DATE7,DATE8)
...
whereas Date_N = successor of DATE_N-1 but need not to be the day before
Moreless the algo should be variable in the days group. 4 ist just an example.
Can anyone here give me an advice how to perform this in MySQL?
I have found this here group by with count, maybe this could be helpful for my issue?
Thanks
Edit: One important note: My date ranges have gaps in it. you see this in the picture below, in the column count(DISTINCT(TradeDate)). It should be always 4 when I have no gaps. But I DO have gaps. But when I sort the date descending, I would like to group the dates together always 4 days, f.e. Group1: 2017-08-22 + 2017-08-21 + 2017-08-20 + 2017-08-19, Group2: 2017-08-18 + 2017-08-17+2017-08-15+2017-08-14, ...
maybe I could map the decending dateranges into a decending integer autoincrement number, then I would have a number without gaps. number1="2017-08-17" number2="2017-08-15" and so on ..
Edit2:
As I see the result from my table with this Query: I might I have double entries for one and the same date. How Can I distinct this date-doubles into only one reprensentative?
SELECT SUM(CondN1),count(id),count(DISTINCT(TradeDate)),min(TradeDate),max(TradeDate) ,min(TO_DAYS(DATE(TradeDate))),id FROM marketstat where Stockplace like '%' GROUP BY TO_DAYS(DATE(TradeDate)) DIV 4 order by TO_DAYS(DATE(TradeDate))
SUM() is a grouping function, so you need to GROUP BY something. That something should change only every four days. Let's start by grouping by one day:
SELECT SUM(signal)
FROM tableName
GROUP BY date
date should really be of type DATE, like you mentioned, not DATETIME or anything else. You could use DATE(date) to convert other date types to dates. Now we need to group by four dates:
SELECT SUM(signal)
FROM tableName
GROUP BY TO_DAYS(date) DIV 4
Note that this will create an arbitary group of four days, if you want control over that you can add a term like this:
SELECT SUM(signal)
FROM tableName
GROUP BY (TO_DAYS(date)+2) DIV 4
In the meantime and with help of KIKO I have found the solution:
I make a temp table with
CREATE TEMPORARY TABLE if not EXISTS tradedatemaptmp (id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY) SELECT Tradedate AS Tradedate, CondN1, CondN2 FROM marketstat WHERE marketstat.Stockplace like 'US' GROUP BY TradeDate ORDER BY TradeDate asc;
and use instead the originate tradedate the now created id in the temp table. So I could manage that - even when I have gaps in the tradedate range, the id in the tmp table has no gaps. And with this I can DIV 4 and get the always the corresponding 4 dates together.

MySQL Date in where clause

I have a table which contains date (Field Type: Date and Date Format: %Y-%m-%d) as a field. I need to select all the rows from the table for all the years whose date is not between Dec 3rd and Dec 24th.
The table contains month and day as a separate fields.
The result can be obtained by using the following query:
select * from mytable where date not in (select date from mytable where month=12 and day between 3 and 24);
But i m trying to get the result in a single query like the below one but it gave empty rows:
select * from mytable where date not between '%Y-12-03' and '%Y-12-24';
Can it be done in a single query like the above one?
SELECT *
FROM mytable
WHERE MONTH(`date`) <> 12
OR DAY(`date`) NOT BETWEEN 3 AND 24
;
This will give you every row that meets the requirements. I'm sure someone has a faster way of doing this, since this will ignore all indexes and will likely be slow on a large dataset, but it does work and return the data you require, so if no-one can suggest an improvement this will answer your question.

Counting all rows with specific columns and grouping by week

I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5
2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.
Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.
Consider the following 2 rows (with timestamp in date format for ease of reading):
TimeStamp StartDate
20120601 20120601
20120701 20120601
Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.
HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:
TimeStamp COUNT
20120601 2
and a 50:50 chance of getting
TimeStamp COUNT
20120701 2
If you add more data to the dataset as so:
TimeStamp StartDate
20120601 20120601
20120701 20120601
20120701 20120701
You could get
TimeStamp COUNT
20120601 2
20120701 1
or
TimeStamp COUNT
20120701 2
20120701 1
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
EDIT
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
SELECT DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.
Try this
SELECT DATE(timestamp), COUNT(week(date(startdate))) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))