I have the following question. I have two tables and I want to store them separately. However, in my application I need to regularly perform UNION operation on them, so I effectively need to treat them as one. I think this is not good in terms of performance. I was thinking about creating the following view:
CREATE VIEW merged_tables(fields from both tables) AS
SELECT * FROM table_a
UNION
SELECT * FROM table_b
How does it impact on the performance of SELECT query? Does it have any real impact (inner representation is different) or it is just a matter of using a simpler query to select?
Using a UNION inside a view will be no different in performance than using the UNION in a discrete SELECT statement. Since you are using SELECT * on both tables and require no inner complexity (e.g. no JOINS or WHERE clauses), there won't really be any way to further optimize it.
However, if the tables do hold similar data and you want to keep them logically separated, you might consider storing it all in one table with a boolean column that indicates whether a row would otherwise have been a resident of table_a or table_b. You gain a way to tell the rows apart and avoid the added confusion of the UNION then, and performance isn't significantly impacted either.
It is just a matter of using a simpler query to select. There will be no speed difference, however the cost of querying the union should not be much worse (in most cases) than if you kept all the data in a single table.
If the tables are really the same structure, you could also consider another design in which you used a single table for storage and two views to logically separate the records:
CREATE VIEW table_a AS SELECT * FROM table_all WHERE rec_type = 'A'
CREATE VIEW table_b AS SELECT * FROM table_all WHERE rec_type = 'B'
Because these are single-table, non-aggregated VIEWs you can use them like tables in INSERT, UPDATE, DELETE, and SELECT but you have the advantage of also being able to program against table_all when it makes sense. The advantage here over your solution is that you can update against either the table_a, table_b entities or the table_all entity, and you don't have two physical tables to maintain.
Related
I have a MySQL table which I'm trying to search a pair of columns for a single value. It's quite a large table, so I want the search time as fast as possible.
I have simplified the tables below for ease of understanding
SELECT * FROM clients WHERE name=? OR sirname=?
VS
SELECT * FROM clients WHERE ? IN (name, sirname)
with indexes on name and sirname
EXPLAIN on the former uses the indexes, but not on the latter
Is this accurate, or is there some optimisation going on under the hood which EXPLAIN doesn't catch?
Strongly related to Checking multiple columns for one value, but cannot discuss there due to age of thread.
Because MySQL generally uses one index per table reference in a query, you will probably have to do it this way:
SELECT * FROM clients WHERE name=?
UNION
SELECT * FROM clients WHERE sirname=?
This will count as two table references for purposes of index selection. The appropriate index will be used in each case.
I would like ask if there is any performance advantage of 1 over the other.
Here is an example:
// suppose I want to retrieve 10000 different records
select *
from table_a
where from in (1,2,3,4,5,6 .... 10000)
// alternatively
select *
from table_a
where from=1 or from=2 or from=3 ... from=10000
compared to
select * from table_a where from=1
select * from table_a where from=2
select * from table_a where from=3
.
.
select * from table_a where from=10000
What are the scenarios that one will outperform the other?
The WHERE clause is simplified here, it may have nested AND and OR clauses.
There are many factors beyond your simple example involved.
For your exact example 1 query is better than 1000, because example is simple and against one field.
Main factor is network I/O operations, physical and/or logic reads
and such.
But if you have more WHERE conditions especially when there are joins, that can be questionable what is better.
And it depends on actual DB tables, relationships, indexes design, types of joins, size of tables and (so and so)...
As general direction in most cases 1 SQL is better, but other factors can be much more important than that.
All starts from very careful database design. Mistakes there (happen quite often), cost a lot later.
Usually 1000 queries are better when database was designed badly.
I have a query which gets data by joining 3 big tables (~1mm records each), in addition they are very busy tables.
is it better to do the traditional joins? or rather first fetch values from first table and do a secondary query passing the values retrieved as in comma delimited in clause?
Option #1
SELECT *
FROM BigTable1 a
INNER JOIN BigTable2 b using(someField2)
INNER JOIN BigTable3 c using(someField3)
WHERE a.someField1 = 'value'
vs
Option #2
$values = SELECT someField2 FROM WHERE someField1 = 'value'; #(~20-200 values)
SELECT *
FROM BigTable2
INNER JOIN BigTable3 c using(someField1)
WHERE someField2 in ($values)
Option #3
create temp-table to store these values from BigTable1
and use this instead of join to BigTable1 directly
any other option?
I think the best option is to try both approaches and run explain on them.
Finally, one optimization you could make would be to use a stored procedure for the second approach which would reduce the time/overhead of having to run 2 queries from the client.
Finally, Joining is quite an expensive operation for very large tables since your essentially projecting and selecting over 1m X 1m rows. ( terms: What are projection and selection?)
There is no definitive answer to your question and you could profile both ways since they depend on multiple factors.
However, the first approach is usually taken and should be faster if all of the tables are correctly indexed and the sizes of the rows are "standard".
Also take into account that in the second approach the latency of the network communication will be far worse since you will need multiple trips to the DB.
I have a dozen of tables with the same structure. All of their names match question_20%. Each table has an indexed column named loaded which can have values of 0 and 1.
I want to count all of the records where loaded = 1. If I had only one table, I would run select count(*) from question_2015 where loaded = 1.
Is there a query I can run that finds the tables in INFORMATION_SCHEMA.TABLES, sums over all of these counts, and produces a single output?
You can do what you want with dynamic SQL.
However, you have a problem with your data structure. Having multiple parallel tables is usually a very bad idea. SQL supports very large tables, so having all the information in one table is a great convenience, from the perspective of querying (as you are now learning) and maintainability.
SQL offers indexes and partitioning schemes for addressing performance issues on large tables.
Sometimes, separate tables are necessary, to meet particular system requirements. If so, then a view should be available to combine all the tables:
create view v_tables as
select t1.*, 'table1' as which from table1 union all
select t2.*, 'table2' as which from table2 union all
. . .
If you had such a view, then your query would simply be:
select which, count(*)
from v_tables
where loaded = 1
group by which;
To make the system more efficient, should we reduce the number of database IO or reduce the size of data operation?
More specifically, suppose I want to get top 60-70 objects.
1st approach:
By joining several tables, I got a huge table here. Then sorting the table based on some attributes, and return the top 70 objects with all its attributes and I only use the 60-70 objects.
2nd approach:
By joining less tables and sorting them, I got top 70 objects' ids, and then I do a second lookup for 60-70 objects based on their ids.
So which one is better in terms of efficiency, esp for MySQL.
It will depend on how you designed your query.
Usually JOIN operations are more efficient than using IN (group) or nested SELECTs, but when joining 3 or more tables you have to choose carefully the order to optimize it.
And of course, every table bind should envolve a PRIMARY KEY.
If the query remain too slow despite of your efforts, then you should use a cache. A new table, or even a file that will store the results of this query up to a given expiration time when it should be updated.
This is a common practice when the results of a heavy query are needed frequently in the system.
You can always count on MySQL Workbench to measure the speed of your queries and play with your options.
Ordinarily, the best way to take advantage of query optimization is to combine the two approaches you present.
SELECT col, col, col, col, etc
FROM tab1,
JOIN tabl2 ON col = col
JOIN tabl3 ON col = col
WHERE tab1.id IN
( SELECT distinct tab1.id
FROM whatever
JOIN whatever ON col = col
WHERE whatever
ORDER BY col DESC
LIMIT 70
)
See how that goes? You make a subquery to select the IDs, then use it in the main query.