For years, I understood that when tables are joined, one row from primary table is joined to a row in target table after applying conditions I.e the query results will <= rows in the primary table. But I have seen where one row in primary table can be joined multiple times of the conditions allow. e.g the query below's count function would not work without duplicate rows form primary table
SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
Which produces this result
+----------------------+-------+
| name | depth |
+----------------------+-------+
| ELECTRONICS | 0 |
| TELEVISIONS | 1 |
| TUBE | 2 |
| LCD | 2 |
| PLASMA | 2 |
| PORTABLE ELECTRONICS | 1 |
| MP3 PLAYERS | 2 |
| FLASH | 3 |
| CD PLAYERS | 2 |
| 2 WAY RADIOS | 2 |
+----------------------+-------+
I know I may be asking something really basic, but how exactly are rows joined together in the most simplest joins possible, does mysql take steps like when regex engine is executing pattern against string?
"How" joins are implemented is actually not important. SQL is a descriptive language, not a procedural language. The query engine can decide the "how". The query is describing the "what".
The conceptual definition of an inner join is rather simple. It is the Cartesian product of two sets that meets the conditions of the on and where clauses.
Most people don't think in terms of Cartesian products. A nested loop is equivalent. The logic is something like this:
for each row1 in table1
for each row2 in table2
output row1 || row2 if the on/where conditions are true
Outer joins extend this concept, allowing rows from one or both tables to be in the result set even when the on/where conditions are not true.
There is no concept whatsoever that about "query results will <= rows in the primary table." With some data structures -- notably a fact table with dimension tables joined in -- you will get that behavior. However, that is because the data model is designed for this purpose, not because SQL works that way.
My two cents. I agree that "how" is not important since SQL is a descriptive language. Well... it's not important until your queries become slow as hell (my experience) when the system is successful and the database grows (a lot).
If you need to find out why a SQL is slow or unresponsive you'll need to understand how the database works under the hood. There are multiple strategies databases use to JOIN tables. Commonly (not a complete list):
Nested Loop Join "NLJ": this is the one you mention.
Merge Join: joining tables "side by side".
Hash Join: hashing one table and then perform a scan on the other.
N-Ary Join: similar to NLJ but with more than two tables at once.
Depending on the size of the tables, column statistics, selectivity of your filter (where) your database can use one or the other. It can even change over time if column statistics & value distributions change.
If you want to learn what those strategies are, and when each one is convenient, you'll can start using
EXPLAIN <sql>
To see what strategy MySQL is using for your particular query. Then you can read about database theory to understand the details under the hood.
Related
I have Table A
+-------------+---------+----------+
| id | int | NOT NULL |
| name | varchar | NOT NULL |
| number | varchar | NOT NULL |
| description | varchar | NOT NULL |
| type | varchar | NOT NULL |
+-------------+---------+----------+
I then create table C
+--------+---------+----------+
| B_id | int | NOT NULL |
| number | varchar | NOT NULL |
| qty | int | NOT NULL |
+--------+---------+----------+
Our current query looks like the following:
SELECT C.*, A.* FROM C
JOIN A ON A.number = C.number
WHERE C.B_id = '<insert any number here>'
This join seems to be running a little slow even though we've created an INDEX on A.number. My question is, could we simply avoid the join by taking the desired columns we want from A and add them as columns in table C, or is this bad practice.
I ask this also because at my day job, in our schemas, we have several tables that reference the same column names from table to table. The are indexed, and from millions of rows of data pull seamlessly. Why can I not achieve this with such small tables? Am I setting up the relationships incorrectly?
Yes, violating normalization is sometimes necessary to resolve performance problems.
But you should try to avoid this, and only do it when absolutely necessary. Adding the redundant columns means you need to ensure that the columns in C are always in sync with A. You may be able to do this with triggers, but it adds complexity and performance impacts to all queries that update the tables.
This shouldn't normally be needed for individual columns that can be fetched using a simple join on indexed columns. It can be more useful for aggregated data, since queries that perform grouping and aggregation can be very expensive for large datasets. For instance, if you frequently need transaction totals by date, you could use the Event Scheduler to update a table with these totals every night. Past transactions are not usually changed, so you don't have to worry about this getting out of sync with the raw transactions table.
Your particular query would benefit from this index:
C: INDEX(B_id)
The query then would
find the index rows in C for the given B_id
reach over to C's data BTree to get C.*
use C.number to reach into A's INDEX(number)
reach over to A's data Btree to get A.*
If you don't need all of *, there may be further optimizations (by using a "covering" index).
Note: The above assumes ENGINE=InnoDB.
Here is my table structure:
// posts
+----+-----------+---------------------+-------------+
| id | title | body | keywords |
+----+-----------+---------------------+-------------+
| 1 | title1 | Something here | php,oop |
| 2 | title2 | Something else | html,css,js |
+----+-----------+---------------------+-------------+
// tags
+----+----------+
| id | name |
+----+----------+
| 1 | php |
| 2 | oop |
| 3 | html |
| 4 | css |
| 5 | js |
+----+----------+
// pivot
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
+---------+--------+
As you see, I store keywords in two ways. Both as string into a column named keywords and as relational into other tables.
Now I need to select all posts that have specific keywords (for example php and html tags). I can do that in two ways:
1: Using unnormalized design:
SELECT * FROM posts WHERE keywords REGEXP 'php|html';
2: Using normalized design:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name IN ('html', 'php')
GROUP BY posts.id
See? The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.
What do you think? I mean what's your recommendation and why?
The second approach uses two JOINs. I guess it will be slower than
using REGEXP in huge dataset.
Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.
Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.
Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.
There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.
In small tables, you can use both at your discretion.
If you expect the table to grow, you really need to second choice. The reason behind is that The regexp can never use an index in MySQL. And indexes are the key to fast queries.
join will use an index if an index is declared on the column;
All these look good when we talk about data in lower scale. It's very fundamental theory for an OLTP system to have denormalize tables. When you expect your table to scale and want data to be non-redundant and consistent, normalization is the answer. Of course there are costs involved with join but thats trivial with all these issues.
Lets talk about your scenario:
Pros:
all data available querying one table.
Cons:
function wrapped across columns force query optimizer to scan the whole table irrespective of the column index. This is very important from data scaling point of view.
Keyword in your case repeated multiple time leading data redundancy.
Keywords appear multiple times lead to data inconsistencies, if you want to remove/update a keyword, it requires column to be searched and replace everywhere from each row. And if anycase anywhere the keywords left behind, leads data integrity issues.
There are many more. Go through data normalization in RDBMS.
I have interesting question about database design:
I come up with following design:
first table:
**Survivors:**
Survivor_Id | Name | Strength | Energy
second table:
**Skills:**
Skill_Id | Name
third table:
**Survivor_skills:**
Surviror_Id |Skill_Id | Level
In first table Survivors there will be many records and will grow from time to time.
In second table will be just few skills which can survivors learn (for example: recoon (higher view range), sniper (better accuracy), ...). Theese skills aren't like strength or energy which all survivors have.
Third table is the most interesting, there survivors and skills join together. Everything will work just fine but I am worried about data duplication.
For example: survivor with id 1 will have 5 skills so first table would look like this:
// survivor_id | level_id | level
1 | 1 | 2
1 | 2 | 3
1 | 3 | 1
1 | 4 | 5
1 | 5 | 1
First record: survivor with id 1 has skill with id 1 on level 2
Second record ...
Is this proper approach or should I use something different.
Looks good to me. If you are worried about data duplication:
1) your server-side code should be gear to not letting this happen
2) you could check before inserting if it already exists
3) you could use MYSQL: REPLACE INTO - this will replace duplicate rows if configure proerply, or insert new ones (http://dev.mysql.com/doc/refman/5.0/en/replace.html)
4) set a unique index on columns where you want only unique rows, e.g. level_id, level
I concur with the others - this is the proper approach.
However, there is one aspect which hasn't been discussed: the order of columns in the composite key {Surviror_Id, Skill_Id}, which will be governed by the kinds of queries you need to run...
If you need to find skills of the given survivor, the order needs to be: {Surviror_Id, Skill_Id}.
If you need to find survivors with the given skill, the order needs to be: {Skill_Id, Surviror_Id}.
If you need both, you'll need both the key (and the implied index) on {Surviror_Id, Skill_Id} and an index on {Skill_Id, Surviror_Id}1. Since InnoDB tables are clustered, accessing Level through that secondary index requires double-lookup - to avoid that, consider using a covering index {Skill_Id, Surviror_Id, Level} instead.
1 Or vice-verse.
I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.
A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.