I am looking at a few queries for performance and made a change to a query, which is based on the following examples. The change turned a 6 minute query into one which completes in few seconds and I was wondering why? How has this altered things to such an extent?
In the example, please assume the BOOK table to contain the general details for all books in a library and the FORMATS table contains details, such as HARDBACK, PAPERBACK and eBOOK (allowing for new formats to be added) where there is a key (called FORMATID) linking the two tables.
Query executes in 6 minutes
select b.bookid, f.formatname
from book b
inner join formats f on f.formatid = b.formatid
select b.bookid, f.formatname
from book b
left join formats f on f.formatid = b.formatid
Query executes in 12 seconds
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
where b.formatid is not null
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
In the above, the first query of each pair achieves INNER JOIN results and the second, achieves LEFT JOIN. The results difference on my database is 295166 and 295376 rows; the ties differences remain pretty much the same.
[added] For confirmation; I have tested this (with the same results) by creating the two test tables mentioned herein, populating the BOOKS table with ~1 million rows and NOT applying any index or other optimisation.
Related
I have a query to get data of friends of user. I have 3 tables, one is user table, second is a user_friend table which has user_id and friend_id (both are foreign key to user table) and 3rd table is feed table which has user_id and feed content. Feed can be shown to friends. I can query in two ways either by join or by using IN clause (I can get all the friends' ids by graph database which I am using for networking).
Here are two queries:
SELECT
a.*
FROM feed a
INNER JOIN user_friend b ON a.user_id = b.friend_id
WHERE b.user_id = 1;
In this query I get friend ids from graph database and will pass to this query:
SELECT
a.*
FROM feed a
WHERE a.user_id IN (2,3,4,5)
Which query runs faster and good for performance when I have millions of records?
With suitable indexes, a one-query JOIN (Choice 1) will almost always run faster than a 2-query (Choice 2) algorithm.
To optimize Choice 1, b needs this composite index: INDEX(user_id, friend_id). Also, a needs an index (presumably the PRIMARY KEY?) starting with user_id.
This depends on your desired result when you have a compared big data in your subquery their always a join is much preferred for such conditions. Because subqueries can be slower than LEFT [OUTER] JOINS / INNER JOIN [LEft JOIN is faster than INNER JOIN], but in my opinion, their strength is slightly higher readability.
So if your data have fewer data to compare then why you chose a complete table join so that depends on how much data you have.
In my opinion, if you have a less number of compared data in IN than it's good but if you have a subquery or big data then you must go for a join...
These are my tables:
Source_Artikelen - columns: article - description (1.438.171 records)
Source_LevArt - columns: article - manufacturer part number (1.751.801 records)
... and this is the query I'm performing
SELECT a.Artikel,a.Omschrijving, l.Artikel_Leverancier
FROM Source_Artikelen AS a
LEFT OUTER JOIN Source_LevArt AS l
ON a.Artikel Like l.Artikel
This query was running tonight for more than 20 hours before I cancelled it manually.
So what am I trying to do?
I want to list down all articles from my table Source_Artikelen. Then I would like to see if there are manufacturer part numbers available in Source_LevArt.
not every article from Source_Artikelen is present in Source_LevArt
sometimes there are multiple manufacturer part numbers in Source_LevArt for one article
That's why I need to use a LEFT OUTER JOIN.
I've tried some things with indexes, but it's not really helping. Possibly I'm doing something wrong.
I can really use some help, as this is only the beginning of the query I'm writing.
I will have to add 2 other (large) tabes as left outer join later...
UPDATE 19/12/2016 16:24:
Hi piet.t
SELECT TOP(20) a.Artikel,a.Omschrijving, l.Artikel_Leverancier
FROM Source_Artikelen AS a
LEFT JOIN Source_LevArt AS l
ON a.Artikel LIKE l.Artikel
this takes 9 seconds
SELECT TOP(20) a.Artikel,a.Omschrijving, l.Artikel_Leverancier
FROM Source_Artikelen AS a
LEFT JOIN Source_LevArt AS l
ON a.Artikel = l.Artikel
this takes 1 second!
I really didn't know there was a difference as I'm not using wildcards.
This is covered by Paul White here :Dynamic Seeks and Hidden Implicit Conversions
using like even when there is exact match tends to do a dynamic seek..which means knowing the column to be seeked at execution time,not at compilation time..
below is how .,column is derived for the tables in below example of mine..
[Expr1005] = Scalar Operator(CONVERT_IMPLICIT(varchar(12),[Aegon_X].[Sales].[Orders].[custid] as [o].[custid],0)),
[Expr1006] = Scalar Operator(LikeRangeStart(CONVERT_IMPLICIT(varchar(12),[Aegon_X].[Sales].[Orders].[custid] as [o].[custid],0))),
[Expr1007] = Scalar Operator(LikeRangeEnd(CONVERT_IMPLICIT(varchar(12),[Aegon_X].[Sales].[Orders].[custid] as [o].[custid],0))),
[Expr1008] = Scalar Operator(LikeRangeInfo(CONVERT_IMPLICIT(varchar(12),[Aegon_X].[Sales].[Orders].[custid] as [o].[custid],0)))
below is what Paul describes ,how those are derived
The upper tooltip shows that the Compute Scalar uses three internal functions, LikeRangeStart, LikeRangeEnd, and LikeRangeInfo.
The first two functions describe the range as an open interval. The third function returns a set of flags encoded in an integer, that are used internally to define certain seek properties for the Storage Engine. The lower tooltip shows the seek on the open interval described by the result of LikeRangeStart and LikeRangeEnd, and the application of the residual predicate ‘LIKE #Like’.
So in summary ,using like SQL uses dynamic seek to derive seek properties at compile time ..
Examples below showing different plans
using like :
I really didn't know there was a difference as I'm not using wildcards.
select top 10* from sales.orders o
join
sales.customers c
on c.custid like o.custid
plan:
Now when using exact match..
select top 10* from sales.orders o
join
sales.customers c
on c.custid =o.custid
You can see merge join plan
Use = instead of like.
These 2 indexes should give you the best performance for a Select.
CREATE INDEX idx ON Source_Artikelen(Artikel) INCLUDE(Omschrijving);
CREATE INDEX idx ON Source_LevArt(Artikel) INCLUDE(Artikel_Leverancier);
If you implement them and try your SELECT again, can you please upload a copy of your execution plan?
I have 3 tables in which 2 tables have 200 000 records and another table of 1 800 000 records. I do merge these 3 tables using 2 contraints that is OCN and TIMESTAMP(month,year). first two tables has columns for month and year as Monthx (which includes both month,date and year). and other table as seperate columns for each month and year. I gave the query as,
mysql--> insert into trail
select * from A,B,C
where A.OCN=B.OCN
and B.OCN=C.OCN
and C.OCN=A.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(A.Monthx)=year(B.Monthx)
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
I gave this query 4days before its still running.could u tell me whether this query is correct or wrong and provide me a exact query..(i gave tat '%b' because my C table has a column which has months in the form JAN,MAR).
Please don't use implicit where joins, bury it in 1989, where it belongs. Use explicit joins instead
select * from a inner join b on (a.ocn = b.ocn and
date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b') ....
This select part of the query (had to rewrite it because I refuse to deal with '89 syntax)
select * from A
inner join B on (
A.OCN=B.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and year(A.Monthx)=year(B.Monthx)
)
inner join C on (
C.OCN=A.OCN
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
)
Has a lot of problems.
using a function on a field will kill any opportunity to use an index on that field.
you are doing a lot of duplicate test. if (A = B) and (B = C) then it logically follows that (A = C)
the translations of the date fields take a lot of time
I would suggest you rewrite your tables to use fields that don't need translating (using functions), but can be compared directly.
A field like yearmonth : char(6) e.g. 201006 can be indexed and compared much faster.
If the table A,B,C have a field called ym for short than your query can be:
INSERT INTO TRAIL
SELECT a.*, b.*, c.* FROM a
INNER JOIN b ON (
a.ocn = b.ocn
AND a.ym = b.ym
)
INNER JOIN c ON (
a.ocn = c.ocn
AND a.ym = c.ym
);
If you put indexes on ocn (primary index probably) and ym the query should run about a million rows a second (or more).
To test if your query is ok, import a small subset of records from A, B and C to a temporary database and test it their.
You have redundancies in your implicit JOIN because you are joining A.OCN with B.OCN, B.OCN with C.OCN and then C.OCN to A.OCN, on of those can be deleted. If A.OCN = B.OCN and B.CON = C.OCN, A.OCN = C.OCN is implied. Further, I guess you have redundancies in your date comparisons.
Hey All.
I have a bunch of tables that have some common fields tying them together, but I can't figure out the right way to dump them in a meaningful way.
Basically, users will be given two tests, and each test may be taken multiple times.
The main table stores information about the user and the test, similar to the below (we'll call this table MAIN):
user_id test iteration completion_time
1 1 1 1:00
1 2 1 1:30
1 1 2 0:49
1 2 2 1:30
Each test page then has its own table to store the answers provided, since some pages have a ton of questions. We'll call this one sample table RESULTS, but there are many tables like this that are basically the same.
user_id test iteration q1 q2 q3
1 1 1 A B A
1 2 1 B B A
1 1 2 A B B
1 2 2 A B B
These results tables (again, there are many) basically store the results, plus just enough information to accurately tie the results together across all tables. I set it up this way because to use just one table for the results would have left me with a table with several hundred columns, which I had read was not recommended.
So the problem here is i can't figure out how best to tie together these tables and get the results out. I've read up on joins and unions and neither one seems right, as far as I can tell, because i need to pull data from ~10-15 tables at once.
I can do a huge complex select -- something along the lines of
select m.*, a.*, b.*, c.* from main m, results_a a, results_b b, results_c c where a.user_id=m.user_id and b.user_id=m.user_id and c.user_id=m.user_id'
and that works, but there's got to be a better way. Keep in mind that i've only given 3 results tables in this example -- in my actual application, it's going to be more like 15-20 tables of results.
Beyond being really complicated, it returns duplicates of some rows, and if i want to throw in any extra logic (lets say, for example, that i only want the same data i queried before, but only for test 2) it gets even more complicated. And lets not even talk about sorting.
From what I've read, JOIN is for 2 tables, and UNION combines rows of results, not columns.
I don't claim to be a mysql expert, but i have looked into this before posting. I feel like I must be close to the right answer, but just not quite hitting on it.
Can anyone give me some guidance?
To use inner joins try:
SELECT m.*, a.*, b.*, c.*
FROM main m
INNER JOIN results_a a ON a.user_id = m.user_id
INNER JOIN results_b b ON b.user_id = m.user_id
INNER JOIN results_c c ON c.user_id = m.user_id
WHERE m.user_id = x
To differentiate column names, explicitly name the column and assign an alias
SELECT m.*, a.q1 as A_Q1, a.q2 as A_Q2, b.q1 as B_Q1, b.q2 as B_Q2 ...
I think you've designed your database in a way that you're going to have to do joins. It sounds better than the alternative (large, non-normalized table). What are you trying to get out? View all tests and determine a user's best score?
So if I understand the tables you have multiple users (obviously). Each user can take multiple tests. Each test could be taken multiple times.
select m.*, a.*, b.*, c.* from main m
join results_a a
on a.user_id = m.user_id and a.test_id = m.test_id and a.iteration_id = m.iteration_id
join results_b b
on b.user_id = m.user_id and b.test_id = m.test_id and b.iteration_id = m.iteration_id
join results_c c
on c.user_id = m.user_id and c.test_id = m.test_id and c.iteration_id = m.iteration_id
That should give you one row per iteration, per test, per user. If you want a certain test for a user then add:
where m.user_id = #userid and m.test_id = #testid
and then you can look at all the iterations for a test, by a user.
If you want the latest iteration:
select top(1) ...
where m.user_id = #userid and m.test_id = #testid
order by m.iteration_id desc
You might consider generating a view when your questions are set-up so you don't have to generate the select statement at run-time which represents a particular test.
MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.
is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off
That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).