1k entries query with multiple JOIN's takes up to 10 seconds - mysql

Here's a simplified version of the structure (left out some regular varchar cols):
CREATE TABLE `car` (
`reg_plate` varchar(16) NOT NULL default '',
`type` text NOT NULL,
`client` int(11) default NULL,
PRIMARY KEY (`reg_plate`)
)
And here's the query I'm trying to run:
SELECT * FROM (
SELECT
car.*,
tire.id as tire,
client.name as client_name
FROM
car
LEFT JOIN client ON car.client = client.id
LEFT JOIN tire ON tire.reg_plate = reg_plate
GROUP BY car.reg_plate
) t1
The nested query is necessary due to the framework sometimes adding WHERE / SORT clauses (which assume there are columns named client_name or tire).
Both the car and the tire tables have approx. 1,5K entries. client has no more than 500, and for some reason it still takes up to 10 seconds to complete (worse, the framework runs it twice, first to check how much rows there are, then to actually limit to the requested page)
I'm getting a feeling that this query is very inefficient, I just don't know how to optimize it.
Thanks in advance.

First, read up on MySQL's EXPLAIN syntax.
You probably need indexes on every column in the join clauses, and on every column that your framework uses in WHERE and SORT clauses. Sometimes multi-column indexes are better than single-column indexes.
Your framework probably doesn't require nested queries. Unnesting and creating a view or passing parameters to a stored procedure might give you better performance.
For better suggestions on SO, always include DDL and sample data (as INSERT statements) in your questions. You should probably include EXPLAIN output on performance questions, too.

Related

MySQL 8 - Slow select when order by combined with limit

I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts

Understanding why group by query slows down when there are lots of text columns

I have a query which runs slowly, I've come up with a much faster alternative, but I'd like some help understanding why the original query is so slow.
A simplified version of my problem use two tables. A simplified version of the first table, called profiles, is
`profiles` (
`id` int(11),
`title` char(255),
`body` text,
`pin` int(11),
PRIMARY KEY (`id`),
UNIQUE KEY `pin` (`pin`)
)
The simplified version of my second table, calls, is
`calls` (
`id` int(11),
`pin` int(11),
`duration` int(11),
PRIMARY KEY (`id`),
KEY `ivr_id` (`pin`)
)
My query is supposed to get the full profiles, with the addition of the number of calls received by a profile. The query I was using, was
SELECT profiles.*, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls
ON profiles.pin = calls.pin
GROUP BY profiles.pin
With ~100 profiles and ~250,000 calls, this query takes about 10 seconds which is slow.
If I modify the query to just select the title from profiles, not all columns, the query is much faster. If I modify the query to remove the group by, its also much faster. If I just select everything from the profiles table, then its also a fast query.
My actual profile table has many more text and char fields. The query speed is worse the more text fields that are selected. Why are the text fields causing the query to be so slow, when they are not involved in the JOIN or the GROUP?
I came up with a slightly different query which is much faster, less than half a second. This query is:
SELECT profiles.*, temp.readings
FROM profiles
LEFT JOIN (
SELECT pin ,COUNT(*) AS readings
FROM calls
GROUP BY pin
) AS temp
ON temp.pin=profiles.pin
Whilst I think I've solved by speed problem, I'd like to understand what is causing the issue in the first query.
======== Update ========
I've just profiled both the queries and the entire speed difference is in the 'sending data' section. The slow query is about 10 seconds and the faster query is about 0.1 seconds
======== Update 2 ========
After discussing with #scaisEdge, I think I can rephrase my question. Consider a table T1 that has ~40 columns of which 8 are of type TEXT and ~100 rows and table T2 which has 5 columns of type INT and VARCHAR with ~250,000 rows. Why is it that:
SELECT T1.* FROM T1 is fast
SELECT T1.* FROM T1 JOIN T2 GROUP BY T1.joinfield is slow
SELECT T1.selectfield FROM T1 JOIN T2 GROUP BY T1.joinfield is fast if selectfield is an INT or VARCHAR
This should happpend because
The first query join 100 profile with 250,000 calls and then reduce the returning rows grouping by the result . And the select profile.* implies the full accessing for each matching row to the profile table data
Then second query join 100 profile with the number of rows returned by subquery for TEMP (problably much less than 250,000) reducing the number of the access to the table profile data
Instead of profile.* try accessing only to pin column
SELECT profiles.pin, COUNT(*) AS num_calls
FROM profiles
LEFT JOIN calls ON profiles.pin = calls.pin
GROUP BY profiles.pin
As a suggestion you should take note that use of group by for the first query is allowed only for mysql version earlier than version 5.7 .. because the use of group by column without mention column in select clause and not affected by aggregation function and not mention in GROUP BY is not allowed by defualt and produce error ..

Partition a very large INNER JOIN SQL query

The sql query is fairly standard inner join type.
For example comparing n tables to see which customerId's exist in all n tables would be a basic WHERE ... AND type query.
The problem is the size of the tables are > 10 million records. The database is denormalized. Normalization is not an option.
The query either takes to long to complete or never completes.
I'm not sure if it's relevant but we are using spring xd job modules for other types of queries.
I'm not sure how to partition this sort of job so that it can be run in parallel so that it takes less time and so if a step/subsection fails it can continue from where it left off.
Other posts with similar problem suggest using alternative methods besides the database engine like implementing a LOOP JOIN in code or using MapReduce or Hadoop, having never used either I'm unsure if they are worth looking into for this use case.
What is the standard approach to this sort of operation, I'd expect it to be fairly common. I might be using the wrong search terms to research approaches because I haven't come across any stock standard solutions or clear directions.
The rather cryptic original requirement was:
Compare party_id column in the three very large tables to identify the customer available in three table
i.e if it is AND operation between three.
SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID AND SAMPLE3.PARTY_ID
If the operation is OR then pick all the customers available in the three tables.
SAMPLE1.PARTY_ID OR SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID
AND / OR are used between tables then performed the comparison as required. SAMPLE1.PARTY_ID AND SAMPLE2.PARTY_ID OR SAMPLE3.PARTY_ID
I set up some 4 test tables each with with this definition
CREATE TABLE `TABLE1` (
`CREATED` datetime DEFAULT NULL,
`PARTY_ID` varchar(45) NOT NULL,
`GROUP_ID` varchar(45) NOT NULL,
`SEQUENCE_ID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`SEQUENCE_ID`)
) ENGINE=InnoDB AUTO_INCREMENT=978536 DEFAULT CHARSET=latin1;
Then added 1,000,000 records to each just random numbers in a range that should result in joins.
I used the following test query
SELECT `TABLE1`.`PARTY_ID` AS `pi1`, `TABLE2`.`PARTY_ID` AS `pi2`, `TABLE3`.`PARTY_ID` AS `pi3`, `TABLE4`.`PARTY_ID` AS `pi4` FROM `devt1`.`TABLE2` AS `TABLE2`, `devt1`.`TABLE1` AS `TABLE1`, `devt1`.`TABLE3` AS `TABLE3`, `devt1`.`TABLE4` AS `TABLE4` WHERE `TABLE2`.`PARTY_ID` = `TABLE1`.`PARTY_ID` AND `TABLE3`.`PARTY_ID` = `TABLE2`.`PARTY_ID` AND `TABLE4`.`PARTY_ID` = `TABLE3`.`PARTY_ID`
It's supposed to complete in under 10 min and for table sizes 10x larger.
My test query still hasn't completed and it has been running for 15 min
The following may perform better than the existing join-based query:
select party_id from
(select distinct party_id from SAMPLE1 union all
select distinct party_id from SAMPLE2 union all
select distinct party_id from SAMPLE3) as ilv
group by party_id
having count(*) = 3
Amend the count(*) condition to match the number of tables being queried.
If you want to return party_id values that are present in any table rather than all, then omit the final having clause.

Optimizing MySQL Left join query between 3 tables to reduce execution time

I have the following query:
SELECT region.id, region.world_id, min_x, min_y, min_z, max_x, max_y, max_z, version, mint_version
FROM minecraft_worldguard.region
LEFT JOIN minecraft_worldguard.region_cuboid
ON region.id = region_cuboid.region_id
AND region.world_id = region_cuboid.world_id
LEFT JOIN minecraft_srvr.lot_version
ON id=lot
WHERE region.world_id = 10
AND region_cuboid.world_id=10;
The Mysql slow query log tells me that it takes more than 5 seconds to execute, returns 2300 rows but examines 15'404'545 rows to return it.
The three tables each have bout 6500 rows only with unique keys on the id and lot fields as well as keys on the world_id fields. I tried to minimize the amount of rows examined by filtering both cuboid and world by their ID and the double WHERE on world_id, but it did not seem to help.
Any idea how I can optimize this query?
Here is the sqlfiddle with the indexes as of current status.
MySQL can't use index in this case because joined fields has different data types:
`lot` varchar(20) COLLATE utf8_unicode_ci NOT NULL
`id` varchar(128) COLLATE utf8_bin NOT NULL
If you change types of this fields to general type (for example, region.id to utf8_unicode_ci), MySQL uses primary key (fiddle).
According to docs:
Comparison of dissimilar columns (comparing a string column to a
temporal or numeric column, for example) may prevent use of indexes if
values cannot be compared directly without conversion.
You have joined the two tables "minecraft_worldguard.region" and "minecraft_worldguard.region_cuboid", on region.world_id and region_cuboid.world_id. So WHERE clause wouldn't require two conditions.
The two columns in the WHERE clause have been equated in the JOIN condition, hence you wouldn't require checking both the conditions in the WHERE clause. Remove one of them in the WHERE clause and add an index on the column that is remaining on the WHERE condition.
In your example, leave the WHERE clause as below:
WHERE region.world_id = 10
and add an index on the region.world_id column, that would improve the performance a bit.
NOTE: observe that I am suggesting you to discard "AND region_cuboid.world_id=10;" part of the WHERE clause.
Hope that helps.
First, when writing queries that have multiple tables, it is a very good thing to get used to "alias" references to the tables so you don't have to retype the entire long name throughout. Also, it is a really good idea to identify which tables the columns are coming from to allow users to better understand what is where which can also help improve performance (such as suggesting a covering index).
That said, I have applied aliases to your original query, but AM GUESSING the table per the respective columns, but you can obviously identify quickly and adjust.
SELECT
R.id,
R.world_id,
RC.min_x,
RC.min_y,
RC.min_z,
RC.max_x,
RC.max_y,
RC.max_z,
LV.version,
LV.mint_version
FROM
minecraft_worldguard.region R
LEFT JOIN minecraft_worldguard.region_cuboid RC
ON R.id = RC.region_id
AND R.world_id = RC.world_id
LEFT JOIN minecraft_srvr.lot_version LV
ON R.id = LV.lot
WHERE
R.world_id = 10
I also removed from the where clause your "region_cuboid.world_id = 10" as that is redundant as a result of the JOIN clause based on region AND world.
For suggestion of indexes, and if I have the proper alias references to the columns, I would suggest a covering index on the region table of
( world_id, id ). The "World_id" in the first position quickly qualifies the WHERE clause, and the "id" is there for the RC and LV tables.
For the region_cuboid table, I would also have an index on ( world_id, region_id) to match the region table being joined to it.
For the lot_version table, and index on (lot) or a covering index on (lot, version, mint_version)

MySQL performance of VIEW for tables combined with UNION ALL

Let's say I have 2 tables in MySQL:
create table `persons` (
`id` bigint unsigned not null auto_increment,
`first_name` varchar(64),
`surname` varchar(64),
primary key(`id`)
);
create table `companies` (
`id` bigint unsigned not null auto_increment,
`name` varchar(128),
primary key(`id`)
);
Now, very often I need to treat them the same, that's why following query:
select person.id as `id`, concat(person.first_name, ' ', person.surname) as `name`, 'person' as `person_type`
from persons
union all
select company.id as `id`, company.name as `name`, 'company' as `person_type`
from companies
starts to appear in other queries quite often: as part of joins or subselects.
For now, I simply inject this query into joins or subselects like:
select *
from some_table row
left outer join (>>> query from above goes here <<<) as `persons`
on row.person_id = persons.id and row.person_type = persons.person_type
But, today I had to use discussed union query into another query multiple times i.e. join it twice.
Since I never had experience with views and heard that they have many disadvantages, my question is:
Is it normal practice to create a view for discussed union query and use it in my joins , subselects etc? In terms of performance - will it be worse, equal or better comparing to just inserting it into joins, subselects etc? Are there any drawbacks of having a view in this case?
Thanks in advance for any help!
I concur with all of the points in Bill Karwin's excellent answer.
Q: Is it normal practice to create a view for discussed union query and use it in my joins, subselects etc?
A: With MySQL the more normal practices is to avoid using "CREATE VIEW" statement.
Q: In terms of performance - will it be worse, equal or better comparing to just inserting it into joins, subselects etc?
A: Referencing a view object will have the identical performance to an equivalent inline view.
(There might be a teensy-tiny bit more work to lookup the view object, checking privileges, and then replace the view reference with the stored SQL, vs. sending a statement that is just a teeny-tiny bit longer. But any of those differences are insignificant.)
Q: Are there any drawbacks of having a view in this case?
A: The biggest drawback is in how MySQL processes a view, whether it's stored or inline. MySQL will always run the view query and materialize the results from that query as a temporary MyISAM table. But there's no difference there whether the view definition is stored, or whether it's included inline. (Other RDBMSs process views much differently than MySQL).
One big drawback of a view is that predicates from the outer query NEVER get pushed down into the view query. Every time you reference that view, even with a query for a single id value, MySQL is going to run the view query and create a temporary MyISAM table (with no indexes on it), and THEN MySQL will run the outer query against that temporary MyISAM table.
So, in terms of performance, think of a reference to a view on par with "CREATE TEMPORARY TABLE t (cols) ENGINE=MyISAM" and "INSERT INTO t (cols) SELECT ...".
MySQL actually refers to an inline view as a "derived table", and that name makes a lot of sense, when we understand what MySQL is doing with it.
My personal preference is to not use the "CREATE VIEW" statement. The biggest drawback (as I see it) is that it "hides" SQL that is being executed. For the future reader, the reference to the view looks like a table. And then, when he goes to write a SQL statement, he's going to reference the view like it was a table, so very convenient. Then he decides he's going to join that table to itself, with another reference to it. (For the second reference, MySQL also runs that query again, and creates yet another temporary (and unindexed) MyISAM table. And now there's a JOIN operation on that. And then a predicate "WHERE view.column = 'foo'" gets added on the outer query.
It ends up "hiding" the most obvious performance improvement, sliding that predicate into the view query.
And then, someone comes along and decides they are going to create new view, which references the old view. He only needs a subset of rows, and can't modify the existing view because that might break something, so he creates a new view... CREATE VIEW myview FROM publicview p WHERE p.col = 'foo'.
And, now, a reference to myview is going to first run the publicview query, create a temporary MyISAM table, then the myview query gets run against that, creating another temporary MyISAM table, which the outer query is going to run against.
Basically, the convenience of the view has the potential for unintentional performance problems. With the view definition available on the database for anyone to use, someone is going to use it, even where it's not the most appropriate solution.
At least with an inline view, the person writing the SQL statement is more aware of the actual SQL being executed, and having all that SQL laid out gives an opportunity for tweaking it for performance.
My two cents.
TAMING BEASTLY SQL
I find that applying regular formatting rules (that my tools automatically do) can bend monstrous SQL into something I can read and work with.
SELECT row.col1
, row.col2
, person.*
FROM some_table row
LEFT
JOIN ( SELECT 'person' AS `person_type`
, p.id AS `id`
, CONCAT(p.first_name,' ',p.surname) AS `name`
FROM person p
UNION ALL
SELECT 'company' AS `person_type`
, c.id AS `id`
, c.name AS `name`
FROM company c
) person
ON person.id = row.person_id
AND person.person_type = row.person_type
I'd be equally likely to avoid the inline view at all, and use conditional expressions in the SELECT list, though this does get more unwieldy for lots of columns.
SELECT row.col1
, row.col2
, row.person_type AS ref_person_type
, row.person_id AS ref_person_id
, CASE
WHEN row.person_type = 'person' THEN p.id
WHEN row.person_type = 'company' THEN c.id
END AS `person_id`
, CASE
WHEN row.person_type = 'person' THEN CONCAT(p.first_name,' ',p.surname)
WHEN row.person_type = 'company' THEN c.name
END AS `name`
FROM some_table row
LEFT
JOIN person p
ON row.person_type = 'person'
AND p.id = row.person_id
LEFT
JOIN company c
ON row.person_type = 'company'
AND c.id = row.person_id
A view makes your SQL shorter. That's all.
It's a common misconception for MySQL users that views store anything. They don't (at least not in MySQL). They're more like an alias or a macro. Querying the view is most often just like running the query in the "expanded" form. Querying a view twice in one query (as in the join example you mentioned) doesn't take any advantage of the view -- it will run the query twice.
In fact, view can cause worse performance, depending on the query and how you use them, because they may need to store the result in a temporary table every time you query them.
See http://dev.mysql.com/doc/refman/5.6/en/view-algorithms.html for more details on when a view uses the temptable algorithm.
On the other hand, UNION queries also create temporary tables as they accumulate their results. So you're stuck with the cost of a temp table anyway.