Isn't using unnormalized design better when there are multiple JOINS? - mysql

Here is my table structure:
// posts
+----+-----------+---------------------+-------------+
| id | title | body | keywords |
+----+-----------+---------------------+-------------+
| 1 | title1 | Something here | php,oop |
| 2 | title2 | Something else | html,css,js |
+----+-----------+---------------------+-------------+
// tags
+----+----------+
| id | name |
+----+----------+
| 1 | php |
| 2 | oop |
| 3 | html |
| 4 | css |
| 5 | js |
+----+----------+
// pivot
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
+---------+--------+
As you see, I store keywords in two ways. Both as string into a column named keywords and as relational into other tables.
Now I need to select all posts that have specific keywords (for example php and html tags). I can do that in two ways:
1: Using unnormalized design:
SELECT * FROM posts WHERE keywords REGEXP 'php|html';
2: Using normalized design:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name IN ('html', 'php')
GROUP BY posts.id
See? The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.
What do you think? I mean what's your recommendation and why?

The second approach uses two JOINs. I guess it will be slower than
using REGEXP in huge dataset.
Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.
Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.
Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.
There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.

In small tables, you can use both at your discretion.
If you expect the table to grow, you really need to second choice. The reason behind is that The regexp can never use an index in MySQL. And indexes are the key to fast queries.
join will use an index if an index is declared on the column;

All these look good when we talk about data in lower scale. It's very fundamental theory for an OLTP system to have denormalize tables. When you expect your table to scale and want data to be non-redundant and consistent, normalization is the answer. Of course there are costs involved with join but thats trivial with all these issues.
Lets talk about your scenario:
Pros:
all data available querying one table.
Cons:
function wrapped across columns force query optimizer to scan the whole table irrespective of the column index. This is very important from data scaling point of view.
Keyword in your case repeated multiple time leading data redundancy.
Keywords appear multiple times lead to data inconsistencies, if you want to remove/update a keyword, it requires column to be searched and replace everywhere from each row. And if anycase anywhere the keywords left behind, leads data integrity issues.
There are many more. Go through data normalization in RDBMS.

Related

How table joining really work in Mysql?

For years, I understood that when tables are joined, one row from primary table is joined to a row in target table after applying conditions I.e the query results will <= rows in the primary table. But I have seen where one row in primary table can be joined multiple times of the conditions allow. e.g the query below's count function would not work without duplicate rows form primary table
SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
Which produces this result
+----------------------+-------+
| name | depth |
+----------------------+-------+
| ELECTRONICS | 0 |
| TELEVISIONS | 1 |
| TUBE | 2 |
| LCD | 2 |
| PLASMA | 2 |
| PORTABLE ELECTRONICS | 1 |
| MP3 PLAYERS | 2 |
| FLASH | 3 |
| CD PLAYERS | 2 |
| 2 WAY RADIOS | 2 |
+----------------------+-------+
I know I may be asking something really basic, but how exactly are rows joined together in the most simplest joins possible, does mysql take steps like when regex engine is executing pattern against string?
"How" joins are implemented is actually not important. SQL is a descriptive language, not a procedural language. The query engine can decide the "how". The query is describing the "what".
The conceptual definition of an inner join is rather simple. It is the Cartesian product of two sets that meets the conditions of the on and where clauses.
Most people don't think in terms of Cartesian products. A nested loop is equivalent. The logic is something like this:
for each row1 in table1
for each row2 in table2
output row1 || row2 if the on/where conditions are true
Outer joins extend this concept, allowing rows from one or both tables to be in the result set even when the on/where conditions are not true.
There is no concept whatsoever that about "query results will <= rows in the primary table." With some data structures -- notably a fact table with dimension tables joined in -- you will get that behavior. However, that is because the data model is designed for this purpose, not because SQL works that way.
My two cents. I agree that "how" is not important since SQL is a descriptive language. Well... it's not important until your queries become slow as hell (my experience) when the system is successful and the database grows (a lot).
If you need to find out why a SQL is slow or unresponsive you'll need to understand how the database works under the hood. There are multiple strategies databases use to JOIN tables. Commonly (not a complete list):
Nested Loop Join "NLJ": this is the one you mention.
Merge Join: joining tables "side by side".
Hash Join: hashing one table and then perform a scan on the other.
N-Ary Join: similar to NLJ but with more than two tables at once.
Depending on the size of the tables, column statistics, selectivity of your filter (where) your database can use one or the other. It can even change over time if column statistics & value distributions change.
If you want to learn what those strategies are, and when each one is convenient, you'll can start using
EXPLAIN <sql>
To see what strategy MySQL is using for your particular query. Then you can read about database theory to understand the details under the hood.

Can't optimise mySQL query

I am running a query to retrieve some game levels from a MySQL database. The query itself takes around 0.00025 seconds to execute on a base that contains 40 level strings. I thought it was satisfactory, until I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
I tried optimising by using explain and explain extended and adjusting the columns accordingly(adding indexes), but am always getting the same performance. What I noticed also is that MySQL didn't use indexes where they were available but instead did a full-table scan.
Results from EXPLAIN EXTENDED:
table id select_type type possible_keys key key_len ref rows Extra
users 1 SIMPLE ALL PRIMARY,id NULL NULL NULL 7 Using temporary; Using filesort
AllTime 1 SIMPLE ref PRIMARY,userid PRIMARY 4 Test.users.id 1
query:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname, AllTime.levelstr
FROM AllTime
INNER JOIN users
ON AllTime.userid=users.id
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
The tables:
users
| id(int) | nickname(varchar) |
| (Primary, Auto_increment) | |
|---------------------------|-------------------|
| 1 | username1 |
| 2 | username2 |
| 3 | username3 |
| ... | ... |
and AllTime
| id(int) | userid(int) | levelname(varchar) | levelstr(text) |
| (Primary, Auto_increment) | (index) | | |
|---------------------------|-------------|--------------------|----------------|
| 1 | 2 | levelname1 | levelstr1 |
| 2 | 2 | levelname2 | levelstr2 |
| 3 | 3 | levelname3 | levelstr3 |
| 4 | 1 | levelname4 | levelstr4 |
| 5 | 1 | levelname5 | levelstr5 |
| 6 | 1 | levelname6 | levelstr6 |
| 7 | 2 | levelname7 | levelstr7 |
Is there a way to optimize this query or would I be better off by calling two consecutive queries from php just to avoid the warning?
I am just learning MySQL, so please take that information into account when replying, thank you :)
I'm assuming you're using InnoDB.
For an INNER JOIN, MySQL typically starts with the table with the fewest rows, in this case users. However, since you just want the latest 20 AllTime records joined with the corresponding user records, you actually should start with AllTime since with the LIMIT, it will be the smaller data set.
Use STRAIGHT_JOIN to force the join order:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname,
AllTime.levelstr
FROM AllTime
STRAIGHT_JOIN users
ON users.id = AllTime.userid
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
It should be able to use the primary key on the AllTime table and follow it in descending order. It'll grab all the data on the same pages as it goes.
It should also use the primary key on the users table to grab the id and nickname. If there are more than just two columns, you might add a multi-column covering index on (id, nickname) to improve the speed.
If you can, convert the levelstr column to VARCHAR so that the data is stored on the same page as the rest of the data, otherwise, it has to go fetch the text columns separately. This assumes that your columns are under the 8000 byte row limit for InnoDB. There is no way to avoid the USING TEMPORARY unless you get rid of the text column.
Most likely, your host has identified this query by using the slow query log, which can identify all queries that don't use an index, or they may have red flagged it because of the Using temporary.
it doesn't look like the query has a problem.
Review the application code. Most likely the issue is in the code
Check MySQL query execution plan
possibly you are missing an index
Make sure you cache the data in Application and Database (fyi, sometimes you can load the whole database into Application memory)
Make sure you use a connection pool
Create a view (a very small chance for improvement)
Try to remove the "Order By" clause (again a very small chance it will improve the performance)
The query itself takes around 0.00025 seconds ... I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
Ask the website host for more details about why this query has been flagged for attention. A query that trivial is not going to cause strain on anything unless it is being called very frequently.
Find out how many times that query is being run. I will bet you a nickel that your site is getting hammered by a bot and being executed hundreds or thousands of times per minute. If so, then that's your real problem.
LIMIT ($value_from_php),20; -- if $value_form_php is huge, then the query is slow. This is because all the 'old' pages need to be scanned before getting to the 20 you need.
By "remembering where you left off" you can make every page equally fast. See this for further details: http://mysql.rjweb.org/doc.php/pagination

Does a View created by joining two tables reads unqueried/unused columns?

Regarding this oversimplified example:
In this database scheme
+--------------+ +-------------------+
| MASTER_TABLE | | FILES_TABLE |
+-----+--------+ +-----+------+------+
| nID | field | | nID | meta | BLOB |
+-----+--------+ +-----+------+------+
| 1 | ... | | 1 | ... | ... |
+-----+--------+ +-----+------+------+
if I create a view like this:
CREATE VIEW myView AS
SELECT master.*, file.meta
FROM master_table master
LEFT JOIN files_table file
USING (nid)
does the unused column BLOB gets read when querying myView? (read as: will it be much slower to query the view rather than querying the master_table only)
I'm asking this because column BLOB will be used to store files. The reason we split the table in two in the first place was to speed up the queries of master_table.
DISCLAIMER:
When designing the data structure it was decided by the project manager that the files annexed to the data should be stored in the database rather than in the filesystem.
I'm quite aware of the numerous inflamed discussions regarding storing files in the database vs filesystem but, as I said, it was not decided by me nor I have the power to change the decision.
No, only the fields listed in the select needed for the query that builds the view are 'read'. However, any join will affect the select time vs. just a single table select statement.
Since you're not using the blob field in the view, you wont' take this hit.
If nid is indexed in both the master and file tables, performance should be fairly good.
Optimal performance on this view's JOIN would be if you had a composite index of (nid, meta) on the file table. This assume meta isn't too big to be part of a composite index.

Compound index required to speed up join-ed query?

A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.

How can I speed up this SQL query on MySQL 4.1?

I have a SQL query that takes a very long time to run on MySQL (it takes several minutes). The query is run against a table that has over 100 million rows, so I'm not surprised it's slow. In theory, though, it should be possible to speed it up as I really only want to get back the rows from the large table (let's call it A) that have a reference in another table, B.
So my query is:
SELECT id FROM A, B where A.ref = B.ref;
(A has over 100 million rows; B has just a few thousand).
I've added INDEXes:
alter table A add index(ref);
alter table B add index(ref);
But it's still very slow (several minutes -- I'd be happy with one minute).
Unfortunately, I'm stuck with MySQL 4.1.22, so I can't use views.
I'd rather not copy all of the relevant rows from A into a separate, smaller table, as the rows that I need will change from time to time. On the other hand, at the moment that's the only solution I can think of.
Any suggestions welcome!
EDIT: Here's the output of running EXPLAIN on my query:
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| 1 | SIMPLE | B | ALL | B_ref,ref | NULL | NULL | NULL | 16718 | Using where |
| 1 | SIMPLE | A | ref | A_REF,ref | A_ref | 4 | DATABASE.B.ref | 5655 | |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
(In redacting my original query example, I chose to use "ref" as my column name, which happens to be the same as one of the types, but hopefully that's not too confusing...)
The query optimizer is probably already doing the best that it can, but in the unlikely event that it's reading the giant table (A) first, you can explicitly tell it to read B first using the STRAIGHT_JOIN syntax:
SELECT STRAIGHT_JOIN id FROM B, A where B.ref = A.ref;
From the answers, it seems like you're doing the most efficient thing you can with the SQL. The A table seems to be the big problem, how about splitting it into three individual tables, kind of like a local version of sharding? Alternatively, is it worth denormalising the B table into the A table, assuming B doesn't have too many columns?
Finally, you could just have to buy a faster box to run it on - there's no substitute for horsepower!
Good luck.
SELECT id FROM A JOIN B ON A.ref = B.ref
You may be able to optimize further by using an appropriate type of join e.g. LEFT JOIN
http://en.wikipedia.org/wiki/Join_(SQL)