MySQL performance of VIEW for tables combined with UNION ALL - mysql

Let's say I have 2 tables in MySQL:
create table `persons` (
`id` bigint unsigned not null auto_increment,
`first_name` varchar(64),
`surname` varchar(64),
primary key(`id`)
);
create table `companies` (
`id` bigint unsigned not null auto_increment,
`name` varchar(128),
primary key(`id`)
);
Now, very often I need to treat them the same, that's why following query:
select person.id as `id`, concat(person.first_name, ' ', person.surname) as `name`, 'person' as `person_type`
from persons
union all
select company.id as `id`, company.name as `name`, 'company' as `person_type`
from companies
starts to appear in other queries quite often: as part of joins or subselects.
For now, I simply inject this query into joins or subselects like:
select *
from some_table row
left outer join (>>> query from above goes here <<<) as `persons`
on row.person_id = persons.id and row.person_type = persons.person_type
But, today I had to use discussed union query into another query multiple times i.e. join it twice.
Since I never had experience with views and heard that they have many disadvantages, my question is:
Is it normal practice to create a view for discussed union query and use it in my joins , subselects etc? In terms of performance - will it be worse, equal or better comparing to just inserting it into joins, subselects etc? Are there any drawbacks of having a view in this case?
Thanks in advance for any help!

I concur with all of the points in Bill Karwin's excellent answer.
Q: Is it normal practice to create a view for discussed union query and use it in my joins, subselects etc?
A: With MySQL the more normal practices is to avoid using "CREATE VIEW" statement.
Q: In terms of performance - will it be worse, equal or better comparing to just inserting it into joins, subselects etc?
A: Referencing a view object will have the identical performance to an equivalent inline view.
(There might be a teensy-tiny bit more work to lookup the view object, checking privileges, and then replace the view reference with the stored SQL, vs. sending a statement that is just a teeny-tiny bit longer. But any of those differences are insignificant.)
Q: Are there any drawbacks of having a view in this case?
A: The biggest drawback is in how MySQL processes a view, whether it's stored or inline. MySQL will always run the view query and materialize the results from that query as a temporary MyISAM table. But there's no difference there whether the view definition is stored, or whether it's included inline. (Other RDBMSs process views much differently than MySQL).
One big drawback of a view is that predicates from the outer query NEVER get pushed down into the view query. Every time you reference that view, even with a query for a single id value, MySQL is going to run the view query and create a temporary MyISAM table (with no indexes on it), and THEN MySQL will run the outer query against that temporary MyISAM table.
So, in terms of performance, think of a reference to a view on par with "CREATE TEMPORARY TABLE t (cols) ENGINE=MyISAM" and "INSERT INTO t (cols) SELECT ...".
MySQL actually refers to an inline view as a "derived table", and that name makes a lot of sense, when we understand what MySQL is doing with it.
My personal preference is to not use the "CREATE VIEW" statement. The biggest drawback (as I see it) is that it "hides" SQL that is being executed. For the future reader, the reference to the view looks like a table. And then, when he goes to write a SQL statement, he's going to reference the view like it was a table, so very convenient. Then he decides he's going to join that table to itself, with another reference to it. (For the second reference, MySQL also runs that query again, and creates yet another temporary (and unindexed) MyISAM table. And now there's a JOIN operation on that. And then a predicate "WHERE view.column = 'foo'" gets added on the outer query.
It ends up "hiding" the most obvious performance improvement, sliding that predicate into the view query.
And then, someone comes along and decides they are going to create new view, which references the old view. He only needs a subset of rows, and can't modify the existing view because that might break something, so he creates a new view... CREATE VIEW myview FROM publicview p WHERE p.col = 'foo'.
And, now, a reference to myview is going to first run the publicview query, create a temporary MyISAM table, then the myview query gets run against that, creating another temporary MyISAM table, which the outer query is going to run against.
Basically, the convenience of the view has the potential for unintentional performance problems. With the view definition available on the database for anyone to use, someone is going to use it, even where it's not the most appropriate solution.
At least with an inline view, the person writing the SQL statement is more aware of the actual SQL being executed, and having all that SQL laid out gives an opportunity for tweaking it for performance.
My two cents.
TAMING BEASTLY SQL
I find that applying regular formatting rules (that my tools automatically do) can bend monstrous SQL into something I can read and work with.
SELECT row.col1
, row.col2
, person.*
FROM some_table row
LEFT
JOIN ( SELECT 'person' AS `person_type`
, p.id AS `id`
, CONCAT(p.first_name,' ',p.surname) AS `name`
FROM person p
UNION ALL
SELECT 'company' AS `person_type`
, c.id AS `id`
, c.name AS `name`
FROM company c
) person
ON person.id = row.person_id
AND person.person_type = row.person_type
I'd be equally likely to avoid the inline view at all, and use conditional expressions in the SELECT list, though this does get more unwieldy for lots of columns.
SELECT row.col1
, row.col2
, row.person_type AS ref_person_type
, row.person_id AS ref_person_id
, CASE
WHEN row.person_type = 'person' THEN p.id
WHEN row.person_type = 'company' THEN c.id
END AS `person_id`
, CASE
WHEN row.person_type = 'person' THEN CONCAT(p.first_name,' ',p.surname)
WHEN row.person_type = 'company' THEN c.name
END AS `name`
FROM some_table row
LEFT
JOIN person p
ON row.person_type = 'person'
AND p.id = row.person_id
LEFT
JOIN company c
ON row.person_type = 'company'
AND c.id = row.person_id

A view makes your SQL shorter. That's all.
It's a common misconception for MySQL users that views store anything. They don't (at least not in MySQL). They're more like an alias or a macro. Querying the view is most often just like running the query in the "expanded" form. Querying a view twice in one query (as in the join example you mentioned) doesn't take any advantage of the view -- it will run the query twice.
In fact, view can cause worse performance, depending on the query and how you use them, because they may need to store the result in a temporary table every time you query them.
See http://dev.mysql.com/doc/refman/5.6/en/view-algorithms.html for more details on when a view uses the temptable algorithm.
On the other hand, UNION queries also create temporary tables as they accumulate their results. So you're stuck with the cost of a temp table anyway.

Related

MYSQL who are my JOINS appear to make a query faster?

I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?
SELECT
clinical_sample.INTERNAL_ID AS "internalId",
sample.STABLE_ID AS "sampleId",
patient.STABLE_ID AS "patientId",
clinical_sample.ATTR_ID AS "attrId",
cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId",
clinical_sample.ATTR_VALUE AS "attrValue"
FROM clinical_sample
INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
INNER JOIN cancer_study ON patient.CANCER_STUDY_ID =
cancer_study.CANCER_STUDY_ID
WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx'
AND sample.STABLE_ID IN
('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND clinical_sample.ATTR_ID IN
(
'CANCER_TYPE'
);
SELECT
internalId,
sampleId,
patientId,
attrId,
studyId,
attrValue
FROM test
WHERE
sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND studyId = 'xxxxx'
AND attrId = 'CANCER_TYPE';
Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.
It would help to see what the PRIMARY KEY of each table is.
Some of these indexes are likely to help:
clinical_sample: INDEX(ATTR_ID, INTERNAL_ID, ATTR_VALUE)
sample: INDEX(STABLE_ID, INTERNAL_ID, PATIENT_ID)
patient: INDEX(INTERNAL_ID, STABLE_ID, CANCER_STUDY_ID)
cancer_study: INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)
I agree with Barmar's INDEX(studyId, attrId, sampleId) for the materialized view.
I have index in sampleId, studyId, attrId and composite of all three.
Let's see the EXPLAIN. It may show that it is using your index just on (sampleId) when it should be using the composite index.
Also put the IN column last, not first, regardless of cardinality. More precisely, put = columns first in a composite index.
Food for thought: When and why are database joins expensive?
this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).

Should i rather use a subquery or a combined WHERE?

This specific situation may seem a bit silly, but i just want to know how i should solve it: there is a table (schools) and in this table you find all students with their school-id. The order is completely random, but with a SELECT statement you can sort it.
CREATE TABLE schools (school_id int, name varchar(32), age ...);
Now i want to search for a student by his name (with LIKE '%name%'), but only if he's in a certain school.
I already tried this:
SELECT * FROM `schools` WHERE `school_id` = 33 and `name` LIKE '%max%';
But then i realized, that i could also use subqueries like:
SELECT * FROM (SELECT * FROM `schools` WHERE `school_id` = 33) AS a
WHERE a.name LIKE '%max%';
Which way is more efficient/has a higher performance?
You can use the EXPLAIN keyword to see exactly how each query is executed.
I'd say it's almost a definite that these two will execute identically.
The query optimizer will probably choose the same plan for both queries. If you want to know for sure, look at the execution plan when you execute each query.
The query without the subquery is probably more efficient in MySQL:
SELECT *
FROM `schools`
WHERE `school_id` = 33 and `name` LIKE '%max%';
MySQL has this nasty tendency to materialize subqueries -- that is, to actually run the subquery and save it as a temporary table (it is getting better, though). Most other databases do not do this. So, in other databases, the two should be equivalent.
MySQL is smart enough to use an index, if available, for school_id, even though there are other comparisons. If no indexes are available, it will be doing a full table scan, which will probably dominate the performance.

MySQL(version 5.5): Why `JOIN` is faster than `IN` clause?

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

1k entries query with multiple JOIN's takes up to 10 seconds

Here's a simplified version of the structure (left out some regular varchar cols):
CREATE TABLE `car` (
`reg_plate` varchar(16) NOT NULL default '',
`type` text NOT NULL,
`client` int(11) default NULL,
PRIMARY KEY (`reg_plate`)
)
And here's the query I'm trying to run:
SELECT * FROM (
SELECT
car.*,
tire.id as tire,
client.name as client_name
FROM
car
LEFT JOIN client ON car.client = client.id
LEFT JOIN tire ON tire.reg_plate = reg_plate
GROUP BY car.reg_plate
) t1
The nested query is necessary due to the framework sometimes adding WHERE / SORT clauses (which assume there are columns named client_name or tire).
Both the car and the tire tables have approx. 1,5K entries. client has no more than 500, and for some reason it still takes up to 10 seconds to complete (worse, the framework runs it twice, first to check how much rows there are, then to actually limit to the requested page)
I'm getting a feeling that this query is very inefficient, I just don't know how to optimize it.
Thanks in advance.
First, read up on MySQL's EXPLAIN syntax.
You probably need indexes on every column in the join clauses, and on every column that your framework uses in WHERE and SORT clauses. Sometimes multi-column indexes are better than single-column indexes.
Your framework probably doesn't require nested queries. Unnesting and creating a view or passing parameters to a stored procedure might give you better performance.
For better suggestions on SO, always include DDL and sample data (as INSERT statements) in your questions. You should probably include EXPLAIN output on performance questions, too.

which query is better and efficient - mysql

I came across writing the query in differnt ways like shown below
Type-I
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS
INNER
JOIN Countries C
ON JS.CountryID = C.CountryID
INNER
JOIN SalaryBracket S
ON JS.MinSalaryID = S.SalaryID
INNER
JOIN DegreeDisciplines DD
ON JS.DegreeDisciplineID = DD.DisciplineID
INNER
JOIN DegreeType DT
ON JS.DegreeTypeID = DT.DegreeTypeID
WHERE
JS.ShowCV = 'Yes'
Type-II
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS, Countries C, SalaryBracket S, DegreeDisciplines DD
, DegreeType DT
WHERE
JS.CountryID = C.CountryID
AND JS.MinSalaryID = S.SalaryID
AND JS.DegreeDisciplineID = DD.DisciplineID
AND JS.DegreeTypeID = DT.DegreeTypeID
AND JS.ShowCV = 'Yes'
I am using Mysql database
Both works really well, But I am wondering
which is best practice to use all time for any situation?
Performance wise which is better one?(Say the database as a millions records)
Any advantages of one over the other?
Is there any tool where I can check which is better query?
Thanks in advance
1- It's a no brainer, use the Type I
2- The type II join are also called 'implicit join', whereas the type I are called 'explicit join'. With modern DBMS, you will not have any performance problem with normal query. But I think with some big complex multi join query, the DBMS could have issue with the implicit join. Using explicit join only could improve your explain plan, so faster result !
3- So performance could be an issue, but most important maybe, the readability is improve for further maintenance. Explicit join explain exactly what you want to join on what field, whereas implicit join doesn't show if you make a join or a filter. The Where clause is for filter, not for join !
And a big big point for explicit join : outer join are really annoying with implicit join. It is so hard to read when you want multiple join with outer join that explicit join are THE solution.
4- Execution plan are what you need (See the doc)
Some duplicates :
Explicit vs implicit SQL joins
SQL join: where clause vs. on clause
INNER JOIN ON vs WHERE clause
in the most code i've seen, those querys are done like your Type-II - but i think Type-I is better because of readability (and more logic - a join is a join, so you should write it as a join (althoug the second one is just another writing style for inner joins)).
in performance, there shouldn't be a difference (if there is one, i think the Type-I would be a bit faster).
Look at "Explain"-syntax
http://dev.mysql.com/doc/refman/5.1/en/explain.html
My suggestion.
Update all your tables with some amount of records. Access the MySQL console and run SQL both command one by one. You can see the time execution time in the console.
For the two queries you mentioned (each with only inner joins) any modern database's query optimizer should produce exactly the same query plan, and thus the same performance.
For MySQL, if you prefix the query with EXPLAIN, it will spit out information about the query plan (instead of running the query). If the information from both queries is the same, them the query plan is the same, and the performance will be identical. From the MySQL Reference Manual:
EXPLAIN returns a row of information
for each table used in the SELECT
statement. The tables are listed in
the output in the order that MySQL
would read them while processing the
query. MySQL resolves all joins using
a nested-loop join method. This means
that MySQL reads a row from the first
table, and then finds a matching row
in the second table, the third table,
and so on. When all tables are
processed, MySQL outputs the selected
columns and backtracks through the
table list until a table is found for
which there are more matching rows.
The next row is read from this table
and the process continues with the
next table.
When the EXTENDED keyword is used,
EXPLAIN produces extra information
that can be viewed by issuing a SHOW
WARNINGS statement following the
EXPLAIN statement. This information
displays how the optimizer qualifies
table and column names in the SELECT
statement, what the SELECT looks like
after the application of rewriting and
optimization rules, and possibly other
notes about the optimization process.
As to which syntax is better? That's up to you, but once you move beyond inner joins to outer joins, you'll need to use the newer syntax, since there's no standard for describing outer joins using the older implicit join syntax.