I'm working on a project involving joins between datasets and we have a requirement to allow previews of arbitrary joins between arbitrary datasets. Which is crazy, but thats why its fun. This is use facing so given a join I want to show ~10 rows of results quickly.
I've been basing my experimentation around different ways to sub-sample the different tables in such a way that I get at least a few result rows but keep the samples small enough that the join is fast and not cause the sampling to be expensive.
Here are the methods I've found pass the smell test. I would like to know a few things about them:
What types of joins or datasets would these fail at?
How could I identify those datasets?
If both of these are bad at the same thing, how could they be improved?
Is there a type of sampling I have not put here that is better?
Subselect with a limit.
Takes a random sample of one dataset to reduce the overall size.
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 LIMIT #) AS sample2
on table1.col1 = sample2.col1
LIMIT 10;
I like this because its easy and there is potential in the future to be smart about which table to samples from. It is also possible to select a portion where table1.col1 never equals sample2.col1 so no results are returned.
Find equals values of col1 and Sample them
More complicated, multi-query approach. Here I would do a distinct select of the columns to join on, compare the results to find common values and then do a subselect limiting the results to the common values.
SELECT DISTINCT col1 FROM table1;
SELECT DISTINCT col1 FROM table2;
commonVals = intersection of above results
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 WHERE col1 IN(commonVals) LIMIT #) as sample2
on table1.col1 = sample2.col1
LIMIT 10;
This gets us a good sample of table2, but the select distinct query may be more expensive than the join. I believe there may be a way to determine if this method is faster if you knew something about how long the distinct cals would take but at this point we don't have that much knowledge of the datasets.
Slap a LIMIT on the join
This is the easiest and the one I'm leaning towards.
SELECT col1, col1 FROM table1 join table2 on table1.col1 = table2.col1 LIMIT #
Assuming the join is good, this will always return data and for at least a large set of cases it will do it fast.
The problem with the first approach is that the rows in the first table might not have a match in the second table. Remember, inner joins not only do matching, they also do filtering.
The second approach could work, if all the columns used for joining have indexes on them. You can then get a list of matching ids by doing something like:
where id in (select id from table1) and id in (select id from table2) . . .
This gets rid of the initial code and should be pretty fast.
The third method is using the capabilities of the database most directly. You would be depending on the ability of MySQL to optimize according to the size of the result set. This is something that it does, at least in theory.
I would strongly recommend the third approach in conjunction with indexes on the columns used in the joins. This requires minimal changes to the query (just add a limit clause). It allows the database to pursue additional optimizations, if appropriate. It works on a more general set of queries.
Related
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
I have about 20 tables. These tables have only id (primary key) and description (varchar). The data is a lot reaching about 400 rows for one table.
Right now I have to get data of at least 15 tables at a time.
Right now I am calling them one by one. Which means that in one session I am giving 15 calls. This is making my process slow.
Can any one suggest any better way to get the results from the database?
I am using MySQL database and using Java Springs on server side. Will making view for all combined help me ?
The application is becoming slow because of this issue and I need a solution that will make my process faster.
It sounds like your schema isn't so great. 20 tables of id/varchar sounds like a broken EAV, which is generally considered broken to begin with. Just the same, I think a UNION query will help out. This would be the "View" to create in the database so you can just SELECT * FROM thisviewyoumade and let it worry about the hitting all the tables.
A UNION query works by having multiple SELECT stataements "Stacked" on top of one another. It's important that each SELECT statement has the same number, ordinal, and types of fields so when it stacks the results, everything matches up.
In your case, it makes sense to manufacturer an extra field so you know which table it came from. Something like the following:
SELECT 'table1' as tablename, id, col2 FROM table1
UNION ALL
SELECT 'table2', id, col2 FROM table2
UNION ALL
SELECT 'table3', id, col2 FROM table3
... and on and on
The names or aliases of the fields in the first SELECT statement are the field names that are used in the result set that is returned, so no worries about doing a bunch AS blahblahblah in subsequent SELECT statements.
The real question is whether this union query will perform faster than 15 individual calls on such a tiny tiny tiny amount of data. I think the better option would be to change your schema so this stuff is already stored in one table just like this UNION query outputs. Then you would need a single select statement against a single table. And 400x20=8000 is still a dinky little table to query.
To get a row of all descriptions into app code in a single roundtrip send a query kind of
select t1.description, ... t15.description
from t -- this should contain all needed ids
join table1 t1 on t1.id = t.t1id
...
join table1 t15 on t15.id = t.t15id
I cannot get you what you really need but here merging all those table values into single table
CREATE TABLE table_name AS (
SELECT *
FROM table1 t1
LEFT JOIN table2 t2 ON t1.ID=t2.ID AND
...
LEFT JOIN tableN tN ON tN-1.ID=tN.ID
)
I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado
I have multiple queries (from different section of my site) i am executing
Some are like this:
SELECT field, field1
FROM table1, table2
WHERE table1.id = table2.id
AND ....
and some are like this:
SELECT field, field1
FROM table1
JOIN table2
USING (id)
WHERE ...
AND ....
and some are like this:
SELECT field, field1
FROM table1
LEFT JOIN table2
ON (table1.id = table2.id)
WHERE ...
AND ....
Which of these queries is better, or slower/faster or more standard?
The first two queries are equivalent; in the MySql world the using keyword is (well, almost - see the documentation but using is part of the Sql2003 spec and there are some differences in NULL values) the same as saying field1.id = field2.id
You could easily write them as:
SELECT field1, field2
FROM table1
INNER JOIN table2 ON (table1.id = table2.id)
The third query is a LEFT JOIN. This will select all the matching rows in both tables, and will also return all the rows in table1 that have no matches in table2. For these rows, the columns in table2 will be represented by NULL values.
I like Jeff Atwood's visual explanation of these
Now, on to what is better or worse. The answer is, it depends. They are for different things. If there are more rows in table1 than table2, then a left join will return more rows than an inner join. But the performance of the queries will be effected by many factors, like table size, the types of the column, what the database is doing at the same time.
Your first concern should be to use the query you need to get the data out. You might honestly want to know what rows in table1 have no match in table2; in this case you'd use a LEFT JOIN. Or you might only want rows that match - the INNER JOIN.
As Krister points out, you can use the EXPLAIN keyword to tell you how the database will execute each kind of query. This is very useful when trying to figure out just why a query is slow, as you can see where the database spends all of its time.
personally, i prefer using left joins in my queries, though you can run into issues in the case of null records or duplicates, but that can be resolved with a simple modification with an outer clause. it's my understanding that a join is a bit more resource intensive, but this is up for debate and might be based on personal preference.
just my $.02.
The third example, using ON (field1=field2) is the more common, and seems to be the more commonly accepted standard.
I don't know about the performance difference, you would have to run some EXPLAIN queries to see what MySQL actually ends up doing with them all really.
I do know though that the first, with WHERE being used to join them all, is much less readable on anything other than trivial queries. Once you have some complex conditions in a query, it's confusing to have "join conditions" all muddled in with "selection conditions".
What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.