mysql / matlab: optimize query - removing dates from a list - mysql

I have a table with ~3M rows. The rows are date, time, msec, and some other columns with int data. Some unknown fraction of these rows are considered 'invalid' based on their existence in a separate table outages (based on date ranges).
Currently the query does a select * and then uses a huge WHERE to remove the invalid date ranges ( lots of 'and not ( RecordDate >'2008-08-05' and RecordDate < '2008-08-10' )') and so on. This blows away any chance of using an index.
Im looking for a better way to limit the results. As it stands now the query takes several minutes to run.

DELETE b FROM bigtable b
INNER JOIN outages o ON (b.`date` BETWEEN o.datestart AND o.dateend)
WHERE (1=1) //In some modes MySQL demands a `where` clause or it will not run.
Make sure you have an index on all fields involved in the query.

Related

Make mysql query faster

I have this query which takes around 29 second to perform and need to make it faster. I have created index on aggregate_date column and still no real improvement. Each aggregate_date has almost 26k rows within the table.
One more thing the query will run starting from 1/1/2018 till yesterday date
select MAX(os.aggregate_date) as lastMonthDay,
os.totalYTD
from (
SELECT aggregate_date,
Sum(YTD) AS totalYTD
FROM tbl_aggregated_tables
WHERE subscription_type = 'Subcription Income'
GROUP BY aggregate_date
) as os
GROUP by MONTH(os.aggregate_date), YEAR(os.aggregate_date);
I used Explain Select ... and received the following
update
The most of query time is consumed by the inner Query, so as scaisEdge suggested bellow i have tested the query and the time is reduced to almost 8s.
the Inner Query will look like:
select agt.aggregate_date,SUM(YTD)
from tbl_aggregated_tables as agt
FORCE INDEX(idx_aggregatedate_subtype_YTD)
WHERE agt.subscription_type = 'Subcription Income'
GROUP by agt.aggregate_date
I have noticed that this comparison "WHERE agt.subscription_type = 'Subcription Income'" takes the most of time. So is there any way to change that and to be mentioned the column of subscription_type only have 2 values which is 'Subcription Income' and 'Subcription Unit'
The index on aggregate_date column is not useful for performance because in not involved in where condition
looking to your code an useful index should be on column subscription_type
you could try using a redundant index adding also the column involved in select clause (for try to obtain all the data in query from index avoiding access to table)so you index could be
create idx1 on tbl_aggregated_tables (subscription_type, aggregate_date, YTD )
the meaning of the last group by seems not coherent with the select clause

Fastest way to do MySQL query select with WHERE clauses on an unindexed table

I have a very big unindexed table called table with rows like this:
IP entrypoint timestamp
171.128.123.179 /page-title/?kw=abc 2016-04-14 11:59:52
170.45.121.111 /another-page/?kw=123 2016-04-12 04:13:20
169.70.121.101 /a-third-page/ 2016-05-12 09:43:30
I want to make the fastest query that, given 30 IPs and one date, will search rows as far back a week before that date and return the most recent row that contains "?kw=" for each IP. So I want DISTINCT entrypoints but only the most recent one.
I'm stuck by this I know it's a relatively simple INNER JOIN but I don't know the fastest way to do it.
By the way: I can't add the index right now because it's very big and on a db that serves a website. I'm going to replace it with an indexed table don't worry.
Rows from the table
SELECT ...
FROM very_big_unindexed_table t
only within the past week...
WHERE t.timestamp >= NOW() + INTERVAL - 1 WEEK
that contains '?kw=' in the entry point
AND t.entrypoint LIKE '%?kw=%'
only the latest row for each IP. There's a couple of approaches to that. A correlated subquery on a very big unindexed table is going to eat your lunch and your lunch box. And without an index, there's no getting around a full scan of the table and a "Using filesort" operation.
Given the unfortunate circumstances, our best bet for performance is likely going to be getting the set whittled down as small as we can, and then perform the sort, and avoid any join operations (back to that table) and avoid correlated subqueries.
So, let's start with something like this, to return all of the rows from the past week with '?kw=' in entry point. This is going to be full scan of the table, and a sort operation...
SELECT t.ip
, t.timestamp
, t.entry_point
FROM very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
We can use an unsupported trick with user-defined variables. (The MySQL Reference Manual specifically warns against using a pattern like this, because the behavior is (officially) undefined. Unofficially, the optimizer in MySQL 5.1 and 5.5 (at least) is very predictable.
I think this is going to be about as good as you are going to get, if the number of rows from the past week are significant subset of the entire table. This is going to create a sizable intermediate resultset (derived table), if there are lot of rows that satisfy the predicates.
SELECT q.ip
, q.entrypoint
, q.timestamp
FROM (
SELECT IF(t.ip = #prev_ip, 0, 1) AS new_ip
, #prev_ip := t.ip AS ip
, t.timestamp AS timestamp
, t.entrypoint AS entrypoint
FROM (SELECT #prev_ip := NULL) i
CROSS
JOIN very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
) q
WHERE q.new_ip
Execution of that query will require (in terms of what's going to take the time)
a full scan of the table (there's no way to get around that)
a sort operation (again, there's no way around that)
materializing a derived table containing all of the rows that satisfy the predicates
a pass through the derived table to pull out the "latest" row for each IP

What SQL indexes to put for big table

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

Maximize efficiency of SQL SELECT statement

Assum that we have a vary large table. for example - 3000 rows of data.
And we need to select all the rows that thire field status < 4.
We know that the relevance rows will be maximum from 2 months ago (of curse that each row has a date column).
does this query is the most efficient ??
SELECT * FROM database.tableName WHERE status<4
AND DATE< '".date()-5259486."' ;
(date() - php , 5259486 - two months.)...
Assuming you're storing dates as DATETIME, you could try this:
SELECT * FROM database.tableName
WHERE status < 4
AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS)
Also, for optimizing search queries you could use EXPLAIN ( http://dev.mysql.com/doc/refman/5.6/en/explain.html ) like this:
EXPLAIN [your SELECT statement]
Another point where you can tweak response times is by carefully placing appropriate indexes.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.
Here are some explanations & tutorials on MySQL indexes:
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
However, keep in mind that using TIMESTAMP instead of DATETIME is more efficient; the former is 4 bytes; the latter is 8. They hold equivalent information (except for timezone issues).
3,000 rows of data is not large for a database. In fact, it is on the small side.
The query:
SELECT *
FROM database.tableName
WHERE status < 4 ;
Should run pretty quickly on 3,000 rows, unless each row is very, very large (say 10k). You can always put an index on status to make it run faster.
The query suggested by cassi.iup makes more sense:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS);
It will perform better with a composite index on status, date. My question is: do you want all rows with a status of 4 or do you want all rows with a status of 4 in the past two months? In the first case, you would have to continually change the query. You would be better off with:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < date('2013-06-19');
(as of the date when I am writing this.)

Why does a MySQL query take anywhere from 1 millisecond to 7 seconds?

I have an SQL query(see below) that returns exactly what I need but when ran through phpMyAdmin takes anywhere from 0.0009 seconds to 0.1149 seconds and occasionally all the way up to 7.4983 seconds.
Query:
SELECT
e.id,
e.title,
e.special_flag,
CASE WHEN a.date >= '2013-03-29' THEN a.date ELSE '9999-99-99' END as date
CASE WHEN a.date >= '2013-03-29' THEN a.time ELSE '99-99-99' END as time,
cat.lastname,
FROM e_table as e
LEFT JOIN a_table as a ON (a.e_id=e.id)
LEFT JOIN c_table as c ON (e.c_id=c.id)
LEFT JOIN cat_table as cat ON (cat.id=e.cat_id)
LEFT JOIN m_table as m ON (cat.name=m.name AND cat.lastname=m.lastname)
JOIN (
SELECT DISTINCT innere.id
FROM e_table as innere
LEFT JOIN a_table as innera ON (innera.e_id=innere.id AND
innera.date >= '2013-03-29')
LEFT JOIN c_table as innerc ON (innere.c_id=innerc.id)
WHERE (
(
innera.date >= '2013-03-29' AND
innera.flag_two=1
) OR
innere.special_flag=1
) AND
innere.flag_three=1 AND
innere.flag_four=1
ORDER BY COALESCE(innera.date, '9999-99-99') ASC,
innera.time ASC,
innere.id DESC LIMIT 0, 10
) AS elist ON (e.id=elist.id)
WHERE (a.flag_two=1 OR e.special_flag) AND e.flag_three=1 AND e.flag_four=1
ORDER BY a.date ASC, a.time ASC, e.id DESC
Explain Plan:
The question is:
Which part of this query could be causing the wide range of difference in performance?
To specifically answer your question: it's not a specific part of the query that's causing the wide range of performance. That's MySQL doing what it's supposed to do - being a Relational Database Management System (RDBMS), not just a dumb SQL wrapper around comma separated files.
When you execute a query, the following things happen:
The query is compiled to a 'parametrized' query, eliminating all variables down to the pure structural SQL.
The compilation cache is checked to find whether a recent usable execution plan is found for the query.
The query is compiled into an execution plan if needed (this is what the 'EXPLAIN' shows)
For each execution plan element, the memory caches are checked whether they contain fresh and usable data, otherwise the intermediate data is assembled from master table data.
The final result is assembled by putting all the intermediate data together.
What you are seeing is that when the query costs 0.0009 seconds, the cache was fresh enough to supply all data together, and when it peaks at 7.5 seconds either something was changed in the queried tables, or other queries 'pushed' the in-memory cache data out, or the DBMS has other reasons to suspect it needs to recompile the query or fetch all data again. Probably some of the other variations have to do with used indexes still being cached freshly enough in memory or not.
Concluding this, the query is ridiculously slow, you're just sometimes lucky that caching makes it appear fast.
To solve this, I'd recommend looking into 2 things:
First and foremost - a query this size should not have a single line in its execution plan reading "No possible keys". Research how indexes work, make sure you realize the impact of MySQL's limitation of using a single index per joined table, and tweak your database so that each line of the plan has an entry under 'key'.
Secondly, review the query in itself. DBMS's are at their fastest when all they have to do is combine raw data. Using programmatic elements like CASE and COALESCE are by all means often useful, but they do force the database to evaluate more things at runtime than just take raw table data. Try to eliminate such statements, or move them to the business logic as post-processing with the retrieved data.
Finally, never forget that MySQL is actually a rather stupid DBMS. It is optimized for performance in simple data fetching queries such as most websites require. As such it is much faster than SQL Server and Oracle for most generic problems. Once you start complicating things with functions, cases, huge join or matching conditions etc., the competitors are frequently much better optimized, and have better optimization in their query compilers. As such, when MySQL starts becoming slow in a specific query, consider splitting it up in 2 or more smaller queries just so it doesn't become confused, and do some postprocessing in PHP or whatever language you are calling with. I've seen many cases where this increased performance a LOT, just by not confusing MySQL, especially in cases where subqueries were involved (as in your case). Especially the fact that your subquery is a derived table, and not just a subquery, is known to complicate stuff for MySQL beyond what it can cope with.
Lets start that both your outer and inner query are working with the "e" table WITH a minimum requirement of flag_three = 1 AND flag_four = 1 (regardless of your inner query's (( x and y ) or z) condition. Also, your outer WHERE clause has explicit reference to the a.Flag_two, but no NULL which forces your LEFT JOIN to actually become an (INNER) JOIN. Also, it appears every "e" record MUST have a category as you are looking for the "cat.lastname" and no coalesce() if none found. This makes sense at it appears to be a "lookup" table reference. As for the "m_table" and "c_table", you are not getting or doing anything with it, so they can be removed completely.
Would the following query get you the same results?
select
e1.id,
e1.Title,
e1.Special_Flag,
e1.cat_id,
coalesce( a1.date, '9999-99-99' ) ADate,
coalesce( a1.time, '99-99-99' ) ATime
cat.LastName
from
e_table e1
LEFT JOIN a_table as a1
ON e1.id = a1.e_id
AND a1.flag_two = 1
AND a1.date >= '2013-03-29'
JOIN cat_table as cat
ON e1.cat_id = cat.id
where
e1.flag_three = 1
and e1.flag_four = 1
and ( e1.special_flag = 1
OR a1.id IS NOT NULL )
order by
IF( a1.id is null, 2, 1 ),
ADate,
ATime,
e1.ID Desc
limit
0, 10
The Main WHERE clause qualifies for ONLY those that have the "three and four" flags set to 1 PLUS EITHER the ( special flag exists OR there is a valid "a" record that is on/after the given date in question).
From that, simple order by and limit.
As for getting the date and time, it appears that you only want records on/after the date to be included, otherwise ignore them (such as they are old and not applicable, you don't want to see them).
The order by, I am testing FIRST for a NULL value for the "a" ID. If so, we know they will all be forced to a date of '9999-99-99' and time of '99-99-99' and want them pushed to the bottom (hence 2), otherwise, there IS an "a" record and you want those first (hence 1). Then, sort by the date/time respectively and then the ID descending in case many within the same date/time.
Finally, to help on the indexes, I would ensure your "e" table has an index on
( id, flag_three, flag_four, special_flag ).
For the "a" table, index on
(e_id, flag_two, date)