Regarding this oversimplified example:
In this database scheme
+--------------+ +-------------------+
| MASTER_TABLE | | FILES_TABLE |
+-----+--------+ +-----+------+------+
| nID | field | | nID | meta | BLOB |
+-----+--------+ +-----+------+------+
| 1 | ... | | 1 | ... | ... |
+-----+--------+ +-----+------+------+
if I create a view like this:
CREATE VIEW myView AS
SELECT master.*, file.meta
FROM master_table master
LEFT JOIN files_table file
USING (nid)
does the unused column BLOB gets read when querying myView? (read as: will it be much slower to query the view rather than querying the master_table only)
I'm asking this because column BLOB will be used to store files. The reason we split the table in two in the first place was to speed up the queries of master_table.
DISCLAIMER:
When designing the data structure it was decided by the project manager that the files annexed to the data should be stored in the database rather than in the filesystem.
I'm quite aware of the numerous inflamed discussions regarding storing files in the database vs filesystem but, as I said, it was not decided by me nor I have the power to change the decision.
No, only the fields listed in the select needed for the query that builds the view are 'read'. However, any join will affect the select time vs. just a single table select statement.
Since you're not using the blob field in the view, you wont' take this hit.
If nid is indexed in both the master and file tables, performance should be fairly good.
Optimal performance on this view's JOIN would be if you had a composite index of (nid, meta) on the file table. This assume meta isn't too big to be part of a composite index.
Related
So I am no MYSQL expert and I really need some help trying to figure this out. I currently have over 60 tables that I wish to join into a single table, none of the data in those tables match each other, so I need the rows of all the tables into a single one. They do have the same schema if that is the correct term, basically the same format. They all end in the same suffix '_dir'.
What I thought that could work was something like this,
Get all tables under the same suffix,
For each table in the table list join or insert row into main_table.
I don't know how to do this in mysql or if its even possible. I know I can use,
SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%_dir%'
to get the list of all the tables, but how can I use this to iterate over every table?
Here is an example of input data:
table 1:
| NAME | INST_NAME | Drop
| data 1 | 'this is an example instance1 | 1.5
| data 1 | 'this is an example of instance2| 2.0
table 2:
| NAME | INST_NAME | DROP
| data 2 | 'this is an example instance1 | 3.0
| data 2 | 'this is an example of instance2| 4.0
Output table:
| NAME | INST_NAME | DROP
| data 1 | 'this is an example instance1 | 1.5
| data 1 | 'this is an example of instance2| 2.0
| data 2 | 'this is an example instance1 | 3.0
| data 2 | 'this is an example of instance2| 4.0
Note that I have to do this for over 60 tables not just 2. There are also other tables with different information in the same database, so I cant just join all tables in there.
You really need to fix your data structure. You should not be storing data in tables with the same structure -- that information should all go into a single table. Then you wouldn't have this issue.
For now, you can construct a view with all the data. You can generate the code for the view with something like this:
SELECT CONCAT('CREATE VIEW vw_dir AS',
GROUP_CONCAT(REPLACE('SELECT NAME, INST_NAME, `DROP` FROM [T]', '[T]'), TABLE_NAME)
SEPARATOR ' UNION ALL '
)
) as create_view_sql
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE '%_dir%';
Then take the SQL, run it, and you'll have a view called vw_dir. The next time you add a table, you'll need to drop the view and then recreate it.
With this solved, you can now start thinking about how to get all the data into a single table, without having the intermediate tables cluttering up your database.
Here is my table structure:
// posts
+----+-----------+---------------------+-------------+
| id | title | body | keywords |
+----+-----------+---------------------+-------------+
| 1 | title1 | Something here | php,oop |
| 2 | title2 | Something else | html,css,js |
+----+-----------+---------------------+-------------+
// tags
+----+----------+
| id | name |
+----+----------+
| 1 | php |
| 2 | oop |
| 3 | html |
| 4 | css |
| 5 | js |
+----+----------+
// pivot
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
+---------+--------+
As you see, I store keywords in two ways. Both as string into a column named keywords and as relational into other tables.
Now I need to select all posts that have specific keywords (for example php and html tags). I can do that in two ways:
1: Using unnormalized design:
SELECT * FROM posts WHERE keywords REGEXP 'php|html';
2: Using normalized design:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name IN ('html', 'php')
GROUP BY posts.id
See? The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.
What do you think? I mean what's your recommendation and why?
The second approach uses two JOINs. I guess it will be slower than
using REGEXP in huge dataset.
Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.
Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.
Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.
There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.
In small tables, you can use both at your discretion.
If you expect the table to grow, you really need to second choice. The reason behind is that The regexp can never use an index in MySQL. And indexes are the key to fast queries.
join will use an index if an index is declared on the column;
All these look good when we talk about data in lower scale. It's very fundamental theory for an OLTP system to have denormalize tables. When you expect your table to scale and want data to be non-redundant and consistent, normalization is the answer. Of course there are costs involved with join but thats trivial with all these issues.
Lets talk about your scenario:
Pros:
all data available querying one table.
Cons:
function wrapped across columns force query optimizer to scan the whole table irrespective of the column index. This is very important from data scaling point of view.
Keyword in your case repeated multiple time leading data redundancy.
Keywords appear multiple times lead to data inconsistencies, if you want to remove/update a keyword, it requires column to be searched and replace everywhere from each row. And if anycase anywhere the keywords left behind, leads data integrity issues.
There are many more. Go through data normalization in RDBMS.
I have a CMS system that stores data across tables like this:
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
Entries META Table
+----+----------+-------+-------+
| id | entry_id | value | param |
+----+----------+-------+-------+
Files Table
+----+----------+----------+
| id | entry_id | filename |
+----+----------+----------+
Entries-to-Tags Table
+----+----------+--------+
| id | entry_id | tag_id |
+----+----------+--------+
Tags Table
+----+-----+
| id | tag |
+----+-----+
I am in trying to implement a revision system, a bit like SO has. If I was just doing it for the Entries Table I was planning to just keep a copy of all changes to that table in a separate table. As I have to do it for at least 4 tables (the TAGS table doesn't need to have revisions) this doesn't seem at all like an elegant solution.
How would you guys do it?
Please notice that the Meta Tables are modeled in EAV (entity-attribute-value).
Thank you in advance.
Hi am currently working on solution to similar problem, I am solving it by splitting my tables into two, a control table and a data table. The control table will contain a primary key and reference into the data table, the data table will contain auto increment revision key and the control table's primary key as a foreign key.
taking your entries table as an example
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
becomes
entries entries_data
+----+----------+ +----------+----+--------+------+--------+--------+
| id | revision | | revision | id | title | text | index1 | index2 |
+----+----------+ +----------+----+--------+------+--------+--------+
to query
select * from entries join entries_data on entries.revision = entries_data.revision;
instead of updating the entries_data table you use an insert statement and then update the entries table's revision with the new revision of the entries table.
The advantage of this system is that you can move to different revisions simply by changing the revision property within the entries table. The disadvantage is you need to update your queries. I am currently integrating this into an ORM layer so the developers don't have worry about writing SQL anyway. Another idea I am toying with is for there to be a centralised revision table which all the data tables use. This would allow you to describe the state of the database with a single revision number, similar to how subversion revision numbers work.
Have a look at this question: How to version control a record in a database
Why not have a separate history_table for each table (as per the accepted answer on the linked question)? That simply has a compound primary key of the original tables' PK and the revision number. You will still need to store the data somewhere after all.
For one of our projects we went the following way:
Entries Table
+----+-----------+---------+
| id | date_from | date_to |
+----+--------_--+---------+
EntryProperties Table
+----------+-----------+-------+------+--------+--------+
| entry_id | date_from | title | text | index1 | index2 |
+----------+-----------+-------+------+--------+--------+
Pretty much complicated, still allows to keep track of full object's lifecycle. So for querying active entities we were going for:
SELECT
entry_id, title, text, index1, index2
FROM
Entities INNER JOIN EntityProperties
ON Entities.id = EntityProperties.entity_id
AND Entities.date_to IS NULL
AND EntityProperties.date_to IS NULL
The only concern was for a situation with entity being removed (so we put a date_to there) and then restored by admin. Using given scheme there's no way to track such kind of tricks.
Overall downside of any attempt like that is obvious - you've to write tons of TSQL where non-versioned DBs will go for something like select A join B.
A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.
I have a SQL query that takes a very long time to run on MySQL (it takes several minutes). The query is run against a table that has over 100 million rows, so I'm not surprised it's slow. In theory, though, it should be possible to speed it up as I really only want to get back the rows from the large table (let's call it A) that have a reference in another table, B.
So my query is:
SELECT id FROM A, B where A.ref = B.ref;
(A has over 100 million rows; B has just a few thousand).
I've added INDEXes:
alter table A add index(ref);
alter table B add index(ref);
But it's still very slow (several minutes -- I'd be happy with one minute).
Unfortunately, I'm stuck with MySQL 4.1.22, so I can't use views.
I'd rather not copy all of the relevant rows from A into a separate, smaller table, as the rows that I need will change from time to time. On the other hand, at the moment that's the only solution I can think of.
Any suggestions welcome!
EDIT: Here's the output of running EXPLAIN on my query:
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| 1 | SIMPLE | B | ALL | B_ref,ref | NULL | NULL | NULL | 16718 | Using where |
| 1 | SIMPLE | A | ref | A_REF,ref | A_ref | 4 | DATABASE.B.ref | 5655 | |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
(In redacting my original query example, I chose to use "ref" as my column name, which happens to be the same as one of the types, but hopefully that's not too confusing...)
The query optimizer is probably already doing the best that it can, but in the unlikely event that it's reading the giant table (A) first, you can explicitly tell it to read B first using the STRAIGHT_JOIN syntax:
SELECT STRAIGHT_JOIN id FROM B, A where B.ref = A.ref;
From the answers, it seems like you're doing the most efficient thing you can with the SQL. The A table seems to be the big problem, how about splitting it into three individual tables, kind of like a local version of sharding? Alternatively, is it worth denormalising the B table into the A table, assuming B doesn't have too many columns?
Finally, you could just have to buy a faster box to run it on - there's no substitute for horsepower!
Good luck.
SELECT id FROM A JOIN B ON A.ref = B.ref
You may be able to optimize further by using an appropriate type of join e.g. LEFT JOIN
http://en.wikipedia.org/wiki/Join_(SQL)