MySQL: long running LEFT JOIN query performance

MySQL: long running LEFT JOIN query performance - mysql

A MySQL database contains two tables: customer and custmomer_orders
The customer table contains 80 million entries and contains 80 fields. Some of them I am interested in:
Id (PK, int(10))
Location (varchar 255, nullable).
Registration_Date (DateTime, nullable). Indexed.
The customer_orders table constains 40 million entries and contains only 3 fields:
Id (PK, int(10))
Customer_Id (int(10), FK to customer table)
Order_Date (DateTime, nullable)
When I run such query, it takes ~800 seconds to execute and returns 40 million entries:
SELECT o.*
FROM customer_orders o
LEFT JOIN customer c ON (c.Id = o.Customer_Id)
WHERE NOT (ISNULL(c.Location)) AND c.Registration_Date < '2018-01-01 00:00:00';
Machine with MySQL server has 32GB of RAM, 28GB assigned to MySQL.
MySQL version: 5.6.39.
Is it normal for MySQL to execute such query for this amount of time on the tables with such amount of records?
How can I improve the performance?
Update:
The customer_orders table does not contain any vital data we would like to store. It is some kind of copied table with orders made within last 10 days.
Every day we run a stored procedure, which deletes orders older than 10 days in scope of a transaction.
In some moment of time, this stored procedure ended up with a timeout due to not optimized query, and number of orders was growing every day.
Previous query contained also COUNT method, which, I suppose, exceeded the timeout.
Nevertheless, it surprised me, that it can take up to 15 minutes for MySQL to fetch 40m of records with additional conditions.

I think it's normal. It would be helpful if you share what explain returns for that query.
In order to optimize the query, it might not be a good idea to start with customer_orders, as you are not filtering it in anyway (so it's performing a full table scan over 40M records). Also, as pointed in the comments, a LEFT JOIN is not needed here.
I would write your query like this:
SELECT o.*
FROM customers c, customer_orders o
WHERE c.id = o.Customer_Id
AND c.Location IS NOT NULL
AND c.Registration_Date < '2018-01-01'
This will (depending on how many records satisfy the clause Registration_Date < '2018-01-01') filter the customers table first and then join with the customer_orders table which has and index by customer_id
Also, maybe not related but, is it normal for you that the query returns 40M records? I mean, it's like the whole customer_orders table. If I am right that means that all orders are from customer registered before '2018-01-01'

This is to long for a comment...
The first thing to note about your query is that it is not actually performing a LEFT JOIN, since it has conditions in the WHERE clause that refer to the LEFT JOINed table.
It could be rewritten as :
SELECT o.*
FROM customer_orders o
INNER JOIN customer c
ON c.Id = o.Customer_Id
AND c.Location is NOT NULL
AND c.Registration_Date < '2018-01-01 00:00:00';
Being explicit about the join type is better for readability and may help MySQL to find a better execution path for the query.
When it comes to performance, the basic advice is that, for this query, you would need a compound index on all three columns being searched, in the same sequence as the one being used in the query (usually, you want to put the more restrictive condition at the beginning, so you might want to adjust this) :
ALTER TABLE mytable ADD INDEX (Id, Location, Registration_Date );
For more advices on performance, you might want to update your question with the CREATE TABLE statements of your tables and the execution plan of your query.

If my comment, and GMB's answer don't end up helping performance much; you can always try writing the query with a different approach. I usually prefer joins to subqueries, but occasionally they turn out to be the best option for the data being handled.
Since you've said the customers table is relatively large compared to the orders table, this could be one of those situations.
SELECT o.*
FROM customer_orders AS o
WHERE o.Customer_Id IN (
SELECT Id
FROM customer
WHERE Location IS NOT NULL
AND Registration_Date < '2018-01-01 00:00:00'
);

I wanted to put a comment, but changed my mind to go with answer.
Because main issue is your question itself.
I don't know how many columns your customer_orders has, but if you are getting
40 million entries
back. I would say you are doing something wrong.
And probably that is not the query itself is slow, but data fetching.
To prove that try to execute EXPLAIN against your query:
EXPLAIN SELECT ...your query here... ;
Then execute
EXPLAIN SELECT ...your query here... LIMIT 1;
Try to LIMIT your results to 1000 for example:
SELECT ...your query here... LIMIT 1000;
When you have answers, outputs and stats for these queries we can discuss your following steps.

Related

Optimizate My SQL Index Multiple Table JOIN

I have a 5 tables in mysql. And when I want execute query it executed too long.
There are structure of my tables:
Reciept(count rows: 23799640)reciept table structure
reciept_goods(count rows: 39398989)reciept_goods table structure
good(count rows: 17514)good table structure
good_categories(count rows: 121)good_categories table structure
retail_category(count rows: 10)retail_category table structure
My Indexes:
Date -->reciept.date #1
reciept_goods_index --> reciept_goods.recieptId #1,
reciept_goods.shopId #2,
reciept_goods.goodId #3
category_id -->good.category_id #1
I have a next sql request:
SELECT
R.shopId,
sales,
sum(Amount) as sum_amount,
count(distinct R.id) as count_reciept,
RC.id,
RC.name
FROM
reciept R
JOIN reciept_goods RG
ON R.id = RG.RecieptId
AND R.ShopID = RG.ShopId
JOIN good G
ON RG.GoodId = G.id
JOIN good_categories GC
ON G.category_id = GC.id
JOIN retail_category RC
ON GC.retail_category_id = RC.id
WHERE
R.date >= '2018-01-01 10:00:00'
GROUP BY
R.shopId,
R.sales,
RC.id
Explain this query gives next result:
Explain query
and execution time = 236sec
if use straight_join good ON (good.id = reciept_goods.GoodId ) explain query
Explain query
and execution time = 31sec
SELECT STRAIGHT_JOIN ... rest of query
I think, that problem in the indexes of my tables, but I don't uderstand how to fix them, can someone help me?

With about 2% of your rows in reciepts having the correct date, the 2nd execution plan chosen (with straight_join) seems to be the right execution order. You should be able to optimize it by adding the following covering indexes:
reciept(date, sales)
reciept_goods(recieptId, shopId, goodId, amount)
I assume that the column order in your primary key for reciept_goods currently is (goodId, recieptId, shopId) (or (goodId, shopId, receiptId)). You could change that to recieptId, shopId, goodId (and if you look at e.g. the table name, you may wanted to do this anyway); in that case, you do not need the 2nd index (at least for this query). I would assume that this primary key made MySQL take the slower execution plan (of course assuming that it would be faster) - although sometimes it's just bad statistics, especially on a test server.
With those covering indexes, MySQL should take the faster explain plan even without straight_join, if it doesn't, just add it again (although I would like a look at both executions plans then). Also check that those two new indexes are used in the explain plan, otherwise I may have missed a column.

It looks like you are depending on walking through a couple of many:many tables? Many people design them inefficiently.
Here I have compiled a list of 7 tips on making mapping tables more efficient. The most important is use of composite indexes.

Fast to query slow to create table

I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows

Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry

This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans

Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID

Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)

Why does the query take a long time in mysql even with a LIMIT clause?

Say I have an Order table that has 100+ columns and 1 million rows. It has a PK on OrderID and FK constraint StoreID --> Store.StoreID.
1) select * from 'Order' order by OrderID desc limit 10;
the above takes a few milliseconds.
2) select * from 'Order' o join 'Store' s on s.StoreID = o.StoreID order by OrderID desc limit 10;
this somehow can take up to many seconds. The more inner joins I add, slows it down further more.
3) select OrderID, column1 from 'Order' o join 'Store' s on s.StoreID = o.StoreID order by OrderID desc limit 10;
this seems to speed the execution up, by limiting the columns we select.
There are a few points that I dont understand here and would really appreciate it if anyone more knowledgeable with mysql (or rmdb query execution in general) can enlighten me.
Query 1 is fast since it's just a reverse lookup by PK and DB only needs to return the first 10 rows it encountered.
I don't see why Query 2 should take for ever. Shouldn't the operation be the same? i.e. get the first 10 rows by PK and then join with other tables. Since there's a FK constraint, it is guaranteed that the relationship will be satisfied. So DB doesn't need to join more rows than necessary and then trim the result, right? Unless, FK constraint allows null FK? In which case I guess a left join would make this much faster than an inner join?
Lastly, I'm guess query 3 is simply faster because less columns are used in those unnecessary joins? But why would the query execution need the other columns while joining? Shouldn't it just join using PKs first, and then get the columns for just the 10 rows?
Thanks!

My understanding is that the mysql engine applies limit after any join's happen.
From http://dev.mysql.com/doc/refman/5.0/en/select.html, The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
EDIT: You could try using this query to take advantage of the PK speed.
select * from (select * from 'Order' order by OrderID desc limit 10) o
join 'Store' s on s.StoreID = o.StoreID;

All of your examples are asking for tablescans of the existing tables, so none of them will be more or less performant than the degree to which mysql can cache the data or results. Some of your queries have order by or join criteria, which can take advantage of indexes purely to make the joining process more efficient, however, that still is not the same as having a set of criteria that will trigger the use of indexes.
Limit is not a criteria -- it can be thought of as filtration once a result set is determined. You save time on the client, once the result set is prepared, but not on the server.
Really, the only way to get the answers you are seeking is to become familiar with:
EXPLAIN EXTENDED your_sql_statement
The output of EXPLAIN will show you how many rows are being looked at by mysql, as well as whether or not any indexes are being used.

Improving mysql query?

My attempt was to join customer and order table and to join the lineitem and order table. I have also indexed the c_mktsegment field. My resultant query is this. Is there anything that I can do do improve it?
select
o_shippriority,
l_orderkey,
o_orderdate,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
cust As c
join ord As o on c.c_custkey = o.o_custkey
join line As l on o.o_orderkey = l.l_orderkey
where
c_mktsegment = ':1'
and o_orderdate < date ':2'
and l_shipdate > date ':2'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate;

I don't see anything obviously wrong with this query. For good performance, you probably should have indexes on orders.o_custkey and lineitem.l_orderkey. The index on c_mktsegment will let the DB find customer records quickly, but from there you need to be able to find order and lineitem records.
You should do an Explain to see how the db is processing the query. This depends on many factors, including the number of records in each table and distribution of keys, so I can't say what the plan is just by looking at the query. But if you run Explain and see that it is doing a full-file read of a table, you should add an index to prevent that. That's pretty much rule #1 for query optimization.

How to optimize a JOIN and AVG statement for a ratings table

I basically have two tables, a 'server' table and a 'server_ratings' table. I need to optimize the current query that I have (It works but it takes around 4 seconds). Is there any way I can do this better?
SELECT ROUND(AVG(server_ratings.rating), 0), server.id, server.name
FROM server LEFT JOIN server_ratings ON server.id = server_ratings.server_id
GROUP BY server.id;

Query looks ok, but make sure you have proper indexes:
on id column in server table - probably primary key,
on server_id column in server_ratings table,
If it does not help, then add rating column into server table and calculate it on a constant basis (see this answer about Cron jobs). This way you will save the time you spend on calculations. They can be made separately eg. every minute, but probably some less frequent calculations are enough (depending on how dynamic is your data).
Also make sure you query proper table - in the question you have mentioned servers table, but in the code there is reference to server table. Probably a typo :)

This should be slightly faster, because the aggregate function is executed first, resulting in fewer JOIN operations.
SELECT s.id, s.name, r.avg_rating
FROM server s
LEFT JOIN (
SELECT server_id, ROUND(AVG(rating), 0) AS avg_rating
FROM server_ratings
GROUP BY server_id
) r ON r.server_id = s.id
But the major point are matching indexes. Primary keys are indexed automatically. Make sure you have one on server_ratings.server_id, too.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008