This is somewhat of a conceptual question. In terms of query optimization and speed, I am wondering which route would have the best performance and be the fastest. Suppose I am using JFreeChart (this bit is somewhat irrelevant). The entire idea of using JFreeChart with a MYSQL database is to query for two values, an X and a Y. Suppose the database is full of many different tables, and usually, the X and the Y come from two different tables. Would it be faster, in the query for the chart, to use joins and unions to get the two values...or... first create a table with the joined/union values, and then run queries on this new table (no joins or unions needed)? This would all be in one code mind you. So, overall: joins and unions to get X and Y values, or create a temporary table joining the values and then querying the chart with those.
It would, of course, be faster to pre-join the data and select from a single table than to perform a join. This assumes that you're saving one lookup per row and are properly using indexes in the first place.
However, even though you get performance improvements from dernormalization in such a manner, it's not commonly done. A few of the reason why it's not common include:
Redundant data takes up more space
With redundant data, you have to update both copies whenever something changes
JOINs are fast
JOINs on multiple rows can improve (they don't always require a lookup per row) with such things as the new Batched Key Access joins in MySQL 5.6, but it only helps with some queries, hence you have to tell MySQL which join type you want. It's not automatic.
Related
Assuming that I have 20L records,
Approach 1: Hold all 20L records in a single table.
Approach 2: Make 20 tables and enter 1L into each.
Which is the best method to increase performance and why, or are there any other approaches?
Splitting a large table into smaller ones can give better performance -- it is called sharding when the tables are then distributed across multiple database servers -- but when you do it manually it is most definitely an antipattern.
What happens if you have 100 tables and you are looking for a row but you don't know which table has it? If you put index on the tables you'll need to do it 100 times. If somebody wants to join the data set he might need to include 100 tables in his join in some use cases. You'd need to invent your own naming conventions, document and enforce them yourself with no help from the database catalog. Backup and recovery and all the other maintenance tasks will be a nightmare....just don't do it.
Instead just break up the table by partitioning it. You get 100% of the performance improvement that you would have gotten from multiple tables but now the database is handling the details for you.
When looking for read time performance, indexes are a great way to improve the performance. However, having indexes can slow down the write time queries.
So if you are looking for a read time performance, prefer indexes.
Few things to keep in mind when creating the index
Try to avoid null values in the index
Cardinality of the columns matter. It's been observed that having a column with lower cardinality first gives better performance when compared to a column with higher cardinality
Sequence of the columns in index should match your where clause. For ex. you create a index on Col A and Col B but query on Col C, your index would not be used. So formulate your indexes according to your where clauses.
When in doubt if an index was used or not, use EXPLAIN to see which index was used.
DB indexes can be a tricky subject for the beginners but imagining it as a tree traversal helps visualize the path traced when reading the data.
The best/easiest is to have a unique table with proper indexes. On 100K lines I had 30s / query, but with an index I got 0.03s / query.
When it doesn't fit anymore you split tables (for me it's when I got to millions of lines).
And preferably on different servers.
You can then create a microservice accessing all servers and returning data to consumers like if there was only one database.
But once you do this you better not have joins, because it'll get messy replicating data on every databases.
I would stick to the first method.
There are two big (millions of records) one-to-one tables:
course
prerequisite with a foreign key reference to the course table
in single-node relational MySQL database. A join is needed to list the full description of all the courses.
An alternative is to have only one single table to contain both the course and prerequisite data in the same database.
Question: is the performance of the join query still slower than that of a simple select query without join on the single denormalized table albeit the fact that they are on the same single-node MYSQL database?
It's true that denormalization is often done to shorten the work to look up one record with its associated details. This usually means the query responds in less time.
But denormalization improves one query at the expense of other queries against the same data. Making one query faster will often make other queries slower. For example, what if you want to query the set of courses that have a given prerequisite?
It's also a risk when you use denormalization that you create data anomalies. For example, if you change a course name, you would also need to update all the places where it is named as a prerequisite. If you forget one, then you'll have a weird scenario where the obsolete name for a course is still used in some places.
How will you know you found them all? How much work in the form of extra queries will you have to do to double-check that you have no anomalies? Do those types of extra queries count toward making your database slower on average?
The purpose of normalizing a database is not performance. It's avoiding data anomalies, which reduces your work in other ways.
I have a huge database and my task is to improve its performance to avoid the timeout issues and minimize the select query duration's.
Which all areas do i need to concentrate to improve the performance of Stored Procedures effectively?
How does sites like facebook store huge amount of data and still doesn't lack on performance?
What can be done to improve the performance of SPs?
Ninety percent of slow queries can be fixed by adding/rebuilding indexes. Make sure that you have indexes on all the tables involved, and that your join clause criteria match those index keys.
Note that adding indexes can have its own performance cost, however, especially when you insert records. But it's usually worth it.
If you want to improve Stored Procedure performance in SQL Server, would recommend below 3 things:
Add 'SET NOCOUNT ON' in the SP --- It can provide a significant performance boost, because network traffic is greatly reduced.
Try to use columns in the where conditions which are mainly indexed.
Verify the execution plan and if you see multiple parallelism occurring, try to use OPTION(MAXDOP N) where N you can set as per the requirement.
the question is
factors that affect multiple joins
There are many things that affect negatively but the usual suspects are below.
Lack of Index on the joined columns
Inefficient join orders for OUTER JOIN
Use of Subquery
Modification of search arguments or join column (e.g.A.intColumn+1 = B.intColumn
Clauses like ORDER BY will also impact performance in general.
(MySQL-centric answer)
JOINs are performed by tackling one table at a time. The optimizer picks which one it thinks is best to start with. Here are some criteria:
The table with the most filtering (WHERE ...) will probably be picked first.
If two tables look about the same, the smaller table will probably be picked first.
Something like that occurs when picking the 'next' table to use.
MySQL almost never uses more than one index per table in a SELECT (assuming there are no subqueries or UNIONs). A Composite INDEX is often useful. Sometimes a "covering" index is warranted.
See my index cookbook.
Stored Routines do not help performance much -- unless you are accessing the server over a WAN. In that case, a SP cuts down on the number of roundtrips, thereby improving latency.
30K inserts per day? That is trivial. Where is there performance issue? On big SELECTs? Is this a Data Warehouse application? Do you have Summary Tables? They are the big performance boost.
Millions of rows? Or Billions?
Normalized? Over-normalized? (Do not normalize 'continuous' values such as FLOAT, DATE, etc.)
That's a lot of hand-waving. If you want some real advice, let's see a slow query.
In my experience, it all comes down to indexing. This is best illustrated by using an example. Suppose you have two tables T1 and T2 and you want to join them. Each table only has 1000 rows in it. Without indexing, the query execution plan will take the cross product of the two tables and then iterate through sequentially filtering out the results that don't match the where condition. For simplicity, lets just assume only one row matches the filter condition.
T1 X T2 = 1000 * 1000 = 1,000,000
Without indexing, filtering will require 1 million steps.
However, with indexing, only 20 steps will be required. Log2(n)
I'm working on optimizing a mysql query that joins 2 tables together and has a few where clauses and an order by.
I noticed (using explain) that a temporary table is created during the evaluation of the query. (since I'm grouping on a field in a table that isn't the first table in the join queue)
I'd really like to know if this temp table is being written to disk or not, which the explain results don't tell me.
It would also be nice to be able to tell what exactly is going into said temporary table. Some of the restrictions in my where clause are on indexed columns and some aren't, so I think that mysql might not be optimally picking rows into the temporary table.
Specifically, my query is basically of the form: select ... from a join b where ... with restrictions on both a and b on both indexed and non-indexed columns. The problem is that the number of rows going into the temp table selected from a is more than I suspect it should be. I want to investigate this.
All databases use a memory area or work area to execute a query and will use temp tables in those memory areas depending on how you built your query. If your joining multiple tables it may use more than one to build the final result set. Those temp tables usually exist in memory as long as the user is logged on.
Explain is illustrating the process it is trying to optimize as it interprets your SQL. If you have a poorly indexed where clause or if you are using a where clause in a join it could be pulling an excessive amount of data into memory as it executes and builds your final result set. This is what poor performance at the DB level looks like.
By reading your pseudo code in the last paragraph I would say you need some indexing and to rewrite your Where clause to join on indexed fields. Post your SQL if you really want an opinion.
BACKGROUND
I'm working with a MySQL InnoDB database with 60+ tables and I'm creating different views in order to make dynamic queries fast and easier in the code. I have a couple of views with INNER JOINS (without many-to-many relationships) of 20 to 28 tables SELECTING 100 to 120 columns with row count below 5,000 and it works lighting fast.
ACTUAL PROBLEM
I'm creating a master view with INNER JOINS (without many-to-many relationships) of 34 tables and SELECTING about 150 columns with row count below 5,000 and it seems like it's too much. It takes forever to do a single SELECT. I'm wondering if I hit some kind of view-size limit and if there is any way of increasing it, or any tricks that would help me pass through this apparent limit.
It's important to note that I'm NOT USING Aggregate functions because I know about their negative impact on performance, which, by the way I'm very concerned about.
MySql does not use the "System R algorithm" (used by Postgresql, Oracle, and SQL Server, I think), which considers not only different merge algorithms (MySQL only has nested-loop, although you can fake a hash join by using a hash index), but also the possible ways of joining the tables and possible index combinations. The result seems to be that parsing of queries - and query execution - can be very quick upto a point, but performance can dramatically drop off as the optimizer chooses the wrong path through the data.
Take a look at your explain plans and try to see if a) the drop in performance is due to the number of columns you are returning (just do SELECT 1 or something) or b) if it is due to the optimizer choosing a table scan instead of index usage.
A view is just a named query. When you refer to a view in MySQL it just replaces the name with the actual query and run it.
It seems that you confuse it with materialized views, which are tables you create from a query. Afterwards you can query that table, and does not have to do the original query again.
Materialized views are not implemented in MySQL.
To improve the performance try to use the keyword explain to see where you can optimize your query/view.