Slow update query despite index - mysql

I have a query updating a large table (4.1 million rows) and using an even larger origin table on which I do aggregation (63 million rows):
update
table1 t1,
(select
user_id,
count(distinct date(started_at)) as count_s
from sand s
where started_at >= (DATE(NOW()) - INTERVAL 7 DAY)
group by user_id) t2
set m.distinct_days_1week = t2.count_s
where m.user_id = t2.user_id
Both tables are indexed on user_id.
sand table is also indexed on started_at
On other update queries I usually run full destination table update in less than 5 mins, but I guess since the origin table is large, it takes much longer time.
The subquery if run alone, runs in less than 4s (didnt measure exactly though).
Explain shows that indexes are used and that the where clause filters a large part of the big sand table.
1 PRIMARY <derived2> ALL 12786201 100.0 Using where
1 UPDATE m eq_ref PRIMARY,user_id_IDX PRIMARY 8 t2.user_id 1 100.0 Using where
2 DERIVED s index user_id_IDX,started_at_IDX user_id_IDX 9 63784993 20.05 Using where
What am I missing to optimize that query?

For sand, replace your single column INDEX(started_at) and INDEX(user_id) with the following. Both are "composite" and "covering". (I don't know which order is better.)
INDEX(started_at, user_id)
INDEX(user_id, started_at)
DATE(NOW()) --> CURDATE() (a trivial simplification)
Are t1 and m the same??? Assuming so, it needs some kind of index starting with user_id.
Please provide SHOW CREATE TABLE for each table.

Related

SQL query optimization on MySQL

I have a SQL query that is taking too much time to execute. How can I optimize it so that it should not take much time it is taking around 620sec that means 10 min.
| 190543 | root | localhost | ischolar | Query | 620 | Copying to tmp table
SELECT a.article_id, count(a.article_id) AS views
FROM timed_views_log a
INNER JOIN published_articles pa
ON (a.article_id = pa.article_id)
WHERE
a.date BETWEEN date_format(curdate() - interval 1 month,'%Y-%m-01 00:00:00') AND
date_format(last_day(curdate()-interval 1 month),'%Y-%m-%d 23:59:59')
GROUP BY a.article_id
ORDER BY
views desc
LIMIT 6, 5;
You may try adding indices which target the join and where conditions:
CREATE INDEX idx1 ON timed_views_log (date, article_id);
CREATE INDEX idx2 ON published_articles (article_id);
The first index, if used, should speed up the WHERE clause by allowing MySQL to use only the index to satisfy your filters on the date. The second index should allow MySQL to do the lookup for the join faster.
If you are using SQL server you can use sql server query execution plan and optimizations suggested by it.
reference article - https://www.sqlshack.com/using-the-sql-execution-plan-for-query-performance-tuning/
your query is a join with where clause, so mostly the data in the tables itself is large, try adding index.

How to optimize better mysql query?

I have written sql query:
select `b`.`id` as `id`, `b`.`name`, count(a.institution_id) as total
from `premises` as `a`
left join `institutions` as `b` on `b`.`id` = `a`.`institution_id`
where exists (select id from rental_schedules as c where a.id = c.premises_id and public = 1 and public_enterprise = 0 and rental_schedule_status = 1 and date >= CURDATE())
group by `a`.`institution_id`
I have very large data in table (Over 1000000 rows) and this query takes up to 8-10 sec. Is there any possibility to optimize better this query?
Thanks for answers in advance!
The join to the institutions table can somewhat benefit from the following index:
CREATE INDEX inst_idx (id, name);
This index will cover the join and the select clause on this table. The biggest improvement would come from the following index on the rental_schedules table:
CREATE INDEX rental_idx (premises_id, public, public_enterprise, rental_schedule_status, date);
This index would allow the exists clause to rapidly evaluate for each joined from the first two tables.
Also, I would rewrite your query to make it ANSI compliant, with the column in the GROUP BY clause matching the SELECT clause:
SELECT
b.id AS id,
b.name, -- allowed, assuming that id be the primary key column of institutions
COUNT(a.institution_id) AS total
FROM premises AS a
LEFT JOIN institutions AS b ON b.id = a.institution_id
WEHRE EXISTS (SELECT 1 FROM rental_schedules AS c
WHERE a.id = c.premises_id AND public = 1 AND
public_enterprise = 0 AND rental_schedule_status = 1 AND
date >= CURDATE())
GROUP BY
b.id;
You can try to deal the subquery as in memory as possible as you can. When the memory is lack of space, a temporary is created and a long time will be wasted.
As the MySQL documentation is described below:
The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
Subquery materialization uses an in-memory temporary table when possible, falling back to on-disk storage if the table becomes too large.

Will adding an index to a column improve the select query (without where) performance in SQL?

I have a MySQL table that contains 20 000 000 rows, and columns like (user_id, registered_timestamp, etc). I have written a below query to get a count of users registered day wise. The query was taking a long time to execute. Will adding an index to the registered_timestamp column improve the execution time?
select date(registered_timestamp), count(userid) from table group by 1
Consider using this query to get a list of dates and the number of registrations on each date.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
GROUP BY date(registered_timestamp)
Then an index on table(registered_timestamp) will help a little because it's a covering index.
If you adapt your query to return dates from a limited range, for example.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE registered_timestamp >= CURDATE() - INTERVAL 8 DAY
AND registered_timestamp < CURDATE()
GROUP BY date(registered_timestamp)
the index will help. (This query returns results for the week ending yesterday.) However, the index will not help this query.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE DATE(registered_timestamp) >= CURDATE() - INTERVAL 8 DAY /* slow! */
GROUP BY date(registered_timestamp)
because the function on the column makes the query unsargeable.
You probably can address this performance issue with a MySQL generated column. This command:
ALTER TABLE `table`
ADD registered_date DATE
GENERATED ALWAYS AS DATE(registered_timestamp)
STORED;
Then you can add an index on the generated column
CREATE INDEX regdate ON `table` ( registered_date );
Then you can use that generated (derived) column in your query, and get a lot of help from that index.
SELECT registered_date, COUNT(*)
FROM table
GROUP BY registered_date;
But beware, creating the generated column and its index will take a while.
select date(registered_timestamp), count(userid) from table group by 1
Would benefit from INDEX(registered_timestamp, userid) but only because such an index is "covering". The query will still need to read every row of the index, and do a filesort.
If userid is the PRIMARY KEY, then this would give you the same answers without bothering to check each userid for being NOT NULL.
select date(registered_timestamp), count(*) from table group by 1
And INDEX(registered_timestamp) would be equivalent to the above suggestion. (This is because InnoDB implicitly tacks on the PK.)
If this query is common, then you could build and maintain a "summary table", which collects the count every night for the day's registrations. Then the query would be a much faster fetch from that smaller table.

MySQL performance issue on a simple two tables joined Query

I'm facing a performance issue on MySQL and I'm unable to understand where I'm wrong. The machine runs MySQLServer 5.7.15 with two Xeon 64bit Processors and 8GBytes of RAM.
I've got two tables:
Table data_raw contains several fields (see VRMS0,VRMS1,VRMS2,PWRA0,PWRA1,PWRA2)
describing the voltages and active powers acquired from complicated instrumentation every 30 seconds from several probes on the field, each probe is uniquely identified by its DEVICE_ID.
Table data_timeslot contains few fields and is used to keep trace of when the single data_raw record was sent (see SRV_TIMESTAMP field)
and from which device (see DEVICE_ID field).
Each table contains about 7.800.000 records.
The two tables are joined using a PK on ID (auto-increment) on data_timeslot and a PK on TIMESLOT_ID (auto-increment) on data_timeslot.
Here is the query:
SELECT D.VRMS0,D.VRMS1,D.VRMS2,D.PWRA0,D.PWRA1,D.PWRA2,T.DEVICE_ID, T.SRV_TIMESTAMP
FROM data_raw AS D FORCE INDEX(PRIMARY)
INNER JOIN data_timeslots AS T ON T.ID=D.TIMESLOT_ID
WHERE T.DEVICE_ID='CEC02'
ORDER BY T.ID DESC LIMIT 1
The query takes always 10 seconds while the same query on a single table takes few milliseconds.
In other words the query
SELECT * FROM 'data_raw' order by TIMESLOT_ID desc limit 1
takes just 0.0071 sec and the query
SELECT * FROM 'data_timeslots' order by ID desc limit 1
takes just 0.0042 sec so I'm wondering why the join takes so long.
Where is the bottleneck?
P.S. The 'extend' shows that the DB is using properly the PK for the operation.
Below the extend printout:
`EXPLAIN SELECT D.VRMS0,D.VRMS1,D.VRMS2,D.PWRA0,D.PWRA1,D.PWRA2,T.DEVICE_ID, T.SRV_TIMESTAMP FROM data_raw AS D INNER JOIN data_timeslots AS T ON T.ID=D.TIMESLOT_ID WHERE T.DEVICE_ID='XXXXX' ORDER BY T.ID ASC LIMIT 1
1 SIMPLE T index PRIMARY,PK_CLUSTER_T,DEVICE_ID PRIMARY 8 30 3.23 Using where
1 SIMPLE D eq_ref PRIMARY PRIMARY 8 splc_smartpwr.T.ID 1 100.00 NULL`
UPDATE (suggested by #Alberto_Delgado_Roda): if I use ASC LIMIT 1 the query takes just 0,0261 sec
Reply to "why"
Data_timeslots has a clusteted index that suits the ascending order
How the Clustered Index Speeds Up Queries
Accessing a row through the clustered index is fast because the index search leads directly to the page with all the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
See https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html
Try this:
1: What happen if do you replace INNER JOIN for STRAIGHT_JOIN?
SELECT D.VRMS0,D.VRMS1,D.VRMS2,D.PWRA0,D.PWRA1,D.PWRA2,T.DEVICE_ID, T.SRV_TIMESTAMP
FROM data_raw AS D FORCE INDEX(PRIMARY)
STRAIGHT_JOIN data_timeslots AS T ON T.ID=D.TIMESLOT_ID
WHERE T.DEVICE_ID='CEC02'
ORDER BY T.ID DESC LIMIT 1
What happen if do you replace DESC LIMIT 1 for ASC LIMIT 1?
I just figured out that the query:
SELECT T.ID,T.DEVICE_ID, T.SRV_TIMESTAMP, D.VRMS0,D.VRMS1,D.VRMS2,D.PWRA0,D.PWRA1,D.PWRA2 FROM data_timeslots as T INNER JOIN data_raw AS D ON D.TIMESLOT_ID=T.ID ORDER BY T.ID DESC LIMIT 1
runs in just 0.0174 sec as expected. I just reversed the order in the SELECT statement and the result changed dramatically. The question now is why???

Large SQL database - solving efficiency

I have this following SQL query, which, when I originally coded it, was exceptionally fast, it now takes over 1 second to complete:
SELECT counted/scount as ratio, [etc]
FROM
playlists
LEFT JOIN (
select AID, PLID FROM (SELECT AID, PLID FROM p_s ORDER BY `order` asc, PLSID desc)as g GROUP BY PLID
) as t USING(PLID)
INNER JOIN (
SELECT PLID, count(PLID) as scount from p_s LEFT JOIN audio USING(AID) WHERE removed='0' and verified='1' GROUP BY PLID
) as g USING(PLID)
LEFT JOIN (
select AID, count(AID) as counted FROM a_p_all WHERE ".time()." - playtime < 2678400 GROUP BY AID
) as r USING(AID)
LEFT JOIN audio USING (AID)
LEFT JOIN members USING (UID)
WHERE scount > 4 ORDER BY ratio desc
LIMIT 0, 20
I have identified the problem, the a_p_all table has over 500k rows. This is slowing down the query. I have come up with a solution:
Create a smaller temporary table, that only stores the data necessary, and deletes anything older than is needed.
However, is there a better method to use? Optimally I wouldn't need a temporary table; what do sites such as YouTube/Facebook do for large tables to keep query times fast?
edit
This is the EXPLAIN table for the query in the answer from #spencer7593
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived3> ALL NULL NULL NULL NULL 20
1 PRIMARY u eq_ref PRIMARY PRIMARY 8 q.AID 1 Using index
1 PRIMARY m eq_ref PRIMARY PRIMARY 8 q.UID 1 Using index
3 DERIVED <derived6> ALL NULL NULL NULL NULL 20
6 DERIVED t ALL NULL NULL NULL NULL 21
5 DEPENDENT SUBQUERY s ALL NULL NULL NULL NULL 49 Using where; Using filesort
4 DEPENDENT SUBQUERY c ALL NULL NULL NULL NULL 49 Using where
4 DEPENDENT SUBQUERY o eq_ref PRIMARY PRIMARY 8 database.c.AID 1 Using where
2 DEPENDENT SUBQUERY a ALL NULL NULL NULL NULL 510594 Using where
Two "big rock" issues stand out to me.
Firstly, this predicate
WHERE ".time()." - playtime < 2678400
(I'm assuming that this isn't the actual SQL being submitted to the database, but that what's being sent to the database is something like this...
WHERE 1409192073 - playtime < 2678400
such that we want only rows where playtime is within the past 31 days (i.e. within 31*24*60*60 seconds of the integer value returned by time().
This predicate can't make use of a range scan operation on a suitable index on playtime. MySQL evaluates the expression on the left side for every row in the table (every row that isn't excluded by some other predicate), and the result of that expression is compared to the literal on the right.
To improve performance, rewrite the predicate that so that the comparison is made on the bare column. Compare the value stored in the playtime column to an expression that needs to be evaluated one time, for example:
WHERE playtime > 1409192073 - 2678400
With a suitable index available, MySQL can perform a "range" scan operation, and efficiently eliminate a boatload of rows that don't need to be evaluated.
The second "big rock" is the inline views, or "derived tables" in MySQL parlance. MySQL is much different than other databases in how inline views are processed. MySQL actually runs that innermost query, and stores the result set as a temporary MyISAM table, and then the outer query runs against the MyISAM table. (The name that MySQL uses, "derived table", makes sense when we understand how MySQL processes the inline view.) Also, MySQL does not "push" predicates down, from an outer query down into the view queries. And on the derived table, there are no indexes created. (I believe MySQL 5.7 is changing that, and does sometimes create indexes, to improve performance.) But large "derived tables" can have a significant performance impact.
Also, the LIMIT clause gets applied last in the statement processing; that's after all the rows in the resultset are prepared and sorted. Even if you are returning only 20 rows, MySQL still prepares the entire resultset; it just doesn't transfer them to the client.
Lots of the column references are not qualified with the table name or alias, so we don't know, for example, which table (p_s or audio) contains the removed and verified columns.
(We know it can't be both, if MySQL isn't throwing a "ambiguous column" error. But MySQL has access to the table definitions, where we don't. MySQL also knows something about the cardinality of the columns, in particular, which columns (or combination of columns) are UNIQUE, and which columns can contain NULL values, etc.
Best practice is to qualify ALL column references with the table name or (preferably) a table alias. (This makes it much easier on the human reading the SQL, and it also avoids a query from breaking when a new column is added to a table.)
Also, the query as a LIMIT clause, but there's no ORDER BY clause (or implied ORDER BY), which makes the resultset indeterminate. We don't have any guaranteed which will be the "first" rows returned.
EDIT
To return only 20 rows from playlists (out of thousands or more), I might try using correlated subqueries in the SELECT list; using a LIMIT clause in an inline view to winnow down the number of rows that I'd need to run the subqueries for. Correlated subqueries can eat your lunch (and your lunchbox too) in terms of performance with large sets, due to the number of times those need to be run.
From what I can gather, you are attempting to return 20 rows from playlists, picking up the related row from member (by the foreign key in playlists), finding the "first" song in the playlist; getting a count of times that "song" has been played in the past 31 days (from any playlist); getting the number of times a song appears on that playlist (as long as it's been verified and hasn't been removed... the outerness of that LEFT JOIN is negated by the predicates on the removed and verified columns, if either of those columns is from the audio table...).
I'd take a shot with something like this, to compare performance:
SELECT q.*
, ( SELECT COUNT(1)
FROM a_p_all a
WHERE a.playtime < 1409192073 - 2678400
AND a.AID = q.AID
) AS counted
FROM ( SELECT p.PLID
, p.UID
, p.[etc]
, ( SELECT COUNT(1)
FROM p_s c
JOIN audio o
ON o.AID = c.AID
AND o.removed='0'
AND o.verified='1'
WHERE c.PLID = p.PLID
) AS scount
, ( SELECT s.AID
FROM p_s s
WHERE s.PLID = p.PLID
ORDER BY s.order ASC, s.PLSID DESC
LIMIT 1
) AS AID
FROM ( SELECT t.PLID
, t.[etc]
FROM playlists t
ORDER BY NULL
LIMIT 20
) p
) q
LEFT JOIN audio u ON u.AID = q.AID
LEFT JOIN members m ON m.UID = q.UID
LIMIT 0, 20
UPDATE
Dude, the EXPLAIN output is showing that you don't have suitable indexes available. To get any decent chance at performance with the correlated subqueries, you're going to want to add some indexes, e.g.
... ON a_p_all (AID, playtime)
... ON p_s (PLID, order, PLSID, AID)