What is the optimal solution, use Inner Join or multiple queries? - mysql

What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]

Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.

This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!

Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.

In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.

One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.

Related

why query on infomation_schema takes time?

I am running a very simple select query on some tables of information_schema but it always take too much of time.
For example:
select * from REFERENTIAL_CONSTRAINTS limit 3 ;
It takes around 34 seconds.
This query is very simple, no need table scan i think, no need of any condition etc. So why it takes too much of time.
some other tables in information_Schema also takes lot of time.
Thanks
The information_schema tables are not really tables. They are a mechanism that exposes server internals via the SQL interface. The responses to these queries are not from data that is "stored in a table" in any sense that you might expect -- it's "collected" each time the query is run.
The level of communication between the SQL layer and the lower layers that collect the data does not always support the optimizations you might expect; for example, the LIMIT here is most likely not making it down -- the entire table is being rendered internally and then all but the first three rows are discarded... so this query is probably just as slow with and without the limit.
Two general rules of thumb with information_schema -- which are really valid for all of SQL, but particularly here, are to select only the columns you need (not *, which will potentially require the server to do more work than necessary if you do not really need all the columns returned) and specify WHERE, both of which may reduce the amount of internal work being done.
Another potential performance killer is heavy-handed tweaking ("tuning") of server variables. Most variables on most systems need to be left alone often than they are. Some of them, like table_open_cache can even cause the server to perform worse, the more "optimally" you tune them.

MySQL Database design with multiple column or single column

Hi just a simple question
I need to store data to database, there are 2 option to show now
Data : a,b,c,d
1. store a,b,c,d in 1 column, when needed only query and perform splitting in application
2. store a,b,c,d to 4 different column, can query directly from database
Which option will be better? My concern is split it into 4 different column will make the tables contain many column, does it slow down the performance? And also I am curious is it possible the query is fast but the transfer of data to my application is slow?
MySQL performance is a complicated subject. To the issue you raised:
My concern is split it into 4 different column will make the tables
contain many column, does it slow down the performance?
there is nothing inherently worse, from a performance perspective, to have 4 columns, or 10, or 20, or 50.
Now, that being said, there are things that could impact performance, and probably will if you don't know about them. For example, if you SELECT * FROM {my_table} when really you only need to SELECT a FROM {my_table}... yeah, that'll impact your performance (although there are arguments to be made in favor of SELECT * FROM {my_table} depending on your caching strategy).
Likewise, you'll want to consider LIMIT statements. To your question
And also I am curious is it possible the query is fast but the
transfer of data to my application is slow?
Yes, of course. If you only need 50 rows and your table has 50000, you're gonna want to add limit clauses to your SQL statements, or you'll be sending a lot more data over the wire than you need to be. Memory is faster than disk and disk is faster than network. If you're sending a lot of data over the wire that you don't need, you better believe it's gonna cause performance problems. But again, keep in mind, that has nothing to do with how many columns you have. There is absolutely nothing inherent in the number of columns a table has that affects performance (at least not at the scale that you're talking about and in the way that you are thinking about it)
All of which to say, performance is a complex topic. You should take a look into it, if you're interested. And it sounds like a,b,c, and d are logically different columns, so you should probably go ahead and store them in different columns in MySQL. Hope this helps.

large single join queries vs multiple smaller ones

So we are building this app, where the retrieval of data is based on small, modular queries. So for a product it would be something like:
$product = $this->product->getProductData($prod_id); //get main product record
$locations = $this->locations->getAvailableLocations($prod_id); //sale locations
$comments = $this->feedback->getFeedback($prod_id,'COMMENTS'); //user comments
On the other hand we could also do something like: $this->getAllProductData($id)
which would essentially have an SQL that:
get * from product_data
left join locations on <...>
left join comments on <...>
From a programming perspective, the first option makes it much easier for us to handle data, mix and match build separate flows/user experience etc. Our concern is - from a performance perspective would this become an issue when the products run in hundreds of thousands of rows?
There's overhead associated with each execution of a SQL statement. Packets sent to the server, SQL text parsed, statement verified to syntactically correct (keywords, commas, parens, etc.), statement verified to be semantically correct (identifiers reference tables, columns, function, et al. exist and user has sufficient privilegs, evaluating possible execution plans and choosing optimum plan, executing the plan (obtaining locks, accessing data in buffers, etc.), materializing the resultset (metadata and values), and returning to caller, releasing locks, cleanup of resources, etc.. On the client side, there's the overhead of retrieving the resultset, fetching rows, and closing the statement.
In general, it's more efficient to retrieve the actual data that is needed with fewer statements, but not if that entials returning a whole slew of information that's not needed. If we only need 20 rows, then we add LIMIT 20 on the query. If we only need rows for a particular product_id, WHERE product_id = 42.
When we see tight loops of repeated execution of essentially the same statement, that's a tell tale sign that the developer is processing data RBER (row by excruciating row), rather than as a set.
Bottom line, it depends on the use case. Sometimes, it's more efficient to run a couple of smaller statements in place of one humongous statement.
Use your second example (all joins in one query). As long as you have an index on "prod_id" and anything else you're filtering on or joining on, the database query optimizer will do smart things, such as seeing that prod_id will only return a few records and that doing that first will make the query about as fast as it could possibly be. Query Optimizers are very, very good at this in general.
For simple joins you are probably good with one sql statement, but from my experience, multiple separate queries are better for performance when more tables are involved.
I have worked on a website where the product information was scattered across seven different tables and usually there were only two or three tables that needed to be joined. However on one page we had a complicated search function that needed to look at all seven of them, so on that page we wrote the code to join all the tables in one statement. It worked fine until a few months later, it gradually got slower and slower until it just wouldn't load altogether.
We went through all the tables and made sure everything was indexed properly and nothing was fixing it. We noticed that the sql statements were all running fine by themselves, so we ended up splitting it up into separate statements and that fixed it, haven't had to go back and look at it again.

Is SELECT * efficient than selecting particular columns? [duplicate]

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?
I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?
There are really three major reasons:
Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.
Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.
Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.
But it's not all bad for SELECT *. I use it liberally for these use cases:
Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.
When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:
SELECT COUNT(*) FROM table;
in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.
Same goes with this type of query:
SELECT a.ID FROM TableA a
WHERE EXISTS (
SELECT *
FROM TableB b
WHERE b.ID = a.B_ID);
in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)
The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.
Performance
The * shorthand can be slower because:
Not all the fields are indexed, forcing a full table scan - less efficient
What you save to send SELECT * over the wire risks a full table scan
Returning more data than is needed
Returning trailing columns using variable length data type can result in search overhead
Maintenance
When using SELECT *:
Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
Even if you need every column at the time the query is written, that might not be the case in the future
the usage complicates profiling
Design
SELECT * is an anti-pattern:
The purpose of the query is less obvious; the columns used by the application is opaque
It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.
When Should "SELECT *" Be Used?
It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.
Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.
Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT * you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.
Wouldn't it mean less code to change if you added a new column you wanted?
The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column - just a few characters of typing.
If you really want every column, I haven't seen a performance difference between select (*) and naming the columns. The driver to name the columns might be simply to be explicit about what columns you expect to see in your code.
Often though, you don't want every column and the select(*) can result in unnecessary work for the database server and unnecessary information having to be passed over the network. It's unlikely to cause a noticeable problem unless the system is heavily utilised or the network connectivity is slow.
If you name the columns in a SELECT statement, they will be returned in the order specified, and may thus safely be referenced by numerical index. If you use "SELECT *", you may end up receiving the columns in arbitrary sequence, and thus can only safely use the columns by name. Unless you know in advance what you'll be wanting to do with any new column that gets added to the database, the most probable correct action is to ignore it. If you're going to be ignoring any new columns that get added to the database, there is no benefit whatsoever to retrieving them.
In a lot of situations, SELECT * will cause errors at run time in your application, rather than at design time. It hides the knowledge of column changes, or bad references in your applications.
Think of it as reducing the coupling between the app and the database.
To summarize the 'code smell' aspect:
SELECT * creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.
If you add fields to the table, they will automatically be included in all your queries where you use select *. This may seem convenient, but it will make your application slower as you are fetching more data than you need, and it will actually crash your application at some point.
There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.
This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place that doesn't actually use the new data at all. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect the error to the change.
If you specify which fields you want in the result, you are safe from this kind of overhead overflow.
I don't think that there can really be a blanket rule for this. In many cases, I have avoided SELECT *, but I have also worked with data frameworks where SELECT * was very beneficial.
As with all things, there are benefits and costs. I think that part of the benefit vs. cost equation is just how much control you have over the datastructures. In cases where the SELECT * worked well, the data structures were tightly controlled (it was retail software), so there wasn't much risk that someone was going to sneek a huge BLOB field into a table.
Reference taken from this article.
Never go with "SELECT *",
I have found only one reason to use "SELECT *"
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Generally you have to fit the results of your SELECT * ... into data structures of various types. Without specifying which order the results are arriving in, it can be tricky to line everything up properly (and more obscure fields are much easier to miss).
This way you can add fields to your tables (even in the middle of them) for various reasons without breaking sql access code all over the application.
Using SELECT * when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.
In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.
Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.
This is also the same sort of reasoning why if you're doing an INSERT you should always specify the columns.
Selecting with column name raises the probability that database engine can access the data from indexes rather than querying the table data.
SELECT * exposes your system to unexpected performance and functionality changes in the case when your database schema changes because you are going to get any new columns added to the table, even though, your code is not prepared to use or present that new data.
There is also more pragmatic reason: money. When you use cloud database and you have to pay for data processed there is no explanation to read data that you will immediately discard.
For example: BigQuery:
Query pricing
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed.
and Control projection - Avoid SELECT *:
Best practice: Control projection - Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.
Understand your requirements prior to designing the schema (if possible).
Learn about the data,
1)indexing
2)type of storage used,
3)vendor engine or features; ie...caching, in-memory capabilities
4)datatypes
5)size of table
6)frequency of query
7)related workloads if the resource is shared
8)Test
A) Requirements will vary. If the hardware can not support the expected workload, you should re-evaluate how to provide the requirements in the workload. Regarding the addition column to the table. If the database supports views, you can create an indexed(?) view of the specific data with the specific named columns (vs. select '*'). Periodically review your data and schema to ensure you never run into the "Garbage-in" -> "Garbage-out" syndrome.
Assuming there is no other solution; you can take the following into account. There are always multiple solutions to a problem.
1) Indexing: The select * will execute a tablescan. Depending on various factors, this may involve a disk seek and/or contention with other queries. If the table is multi-purpose, ensure all queries are performant and execute below you're target times. If there is a large amount of data, and your network or other resource isn't tuned; you need to take this into account. The database is a shared environment.
2) type of storage. Ie: if you're using SSD's, disk, or memory. I/O times and the load on the system/cpu will vary.
3) Can the DBA tune the database/tables for higher performance? Assumming for whatever reason, the teams have decided the select '*' is the best solution to the problem; can the DB or table be loaded into memory. (Or other method...maybe the response was designed to respond with a 2-3 second delay? --- while an advertisement plays to earn the company revenue...)
4) Start at the baseline. Understand your data types, and how results will be presented. Smaller datatypes, number of fields reduces the amount of data returned in the result set. This leaves resources available for other system needs. The system resources are usually have a limit; 'always' work below these limits to ensure stability, and predictable behaviour.
5) size of table/data. select '*' is common with tiny tables. They typically fit in memory, and response times are quick. Again....review your requirements. Plan for feature creep; always plan for the current and possible future needs.
6) Frequency of query / queries. Be aware of other workloads on the system. If this query fires off every second, and the table is tiny. The result set can be designed to stay in cache/memory. However, if the query is a frequent batch process with Gigabytes/Terabytes of data...you may be better off to dedicate additional resources to ensure other workloads aren't affected.
7) Related workloads. Understand how the resources are used. Is the network/system/database/table/application dedicated, or shared? Who are the stakeholders? Is this for production, development, or QA? Is this a temporary "quick fix". Have you tested the scenario? You'll be surprised how many problems can exist on current hardware today. (Yes, performance is fast...but the design/performance is still degraded.) Does the system need to performance 10K queries per second vs. 5-10 queries per second. Is the database server dedicated, or do other applications, monitoring execute on the shared resource. Some applications/languages; O/S's will consume 100% of the memory causing various symptoms/problems.
8) Test: Test out your theories, and understand as much as you can about. Your select '*' issue may be a big deal, or it may be something you don't even need to worry about.
There's an important distinction here that I think most answers are missing.
SELECT * isn't an issue. Returning the results of SELECT * is the issue.
An OK example, in my opinion:
WITH data_from_several_tables AS (
SELECT * FROM table1_2020
UNION ALL
SELECT * FROM table1_2021
...
)
SELECT id, name, ...
FROM data_from_several_tables
WHERE ...
GROUP BY ...
...
This avoids all the "problems" of using SELECT * mentioned in most answers:
Reading more data than expected? Optimisers in modern databases will be aware that you don't actually need all columns
Column ordering of the source tables affects output? We still select and
return data explicitly.
Consumers can't see what columns they receive from the SQL? The columns you're acting on are explicit in code.
Indexes may not be used? Again, modern optimisers should handle this the same as if we didn't SELECT *
There's a readability/refactorability win here - no need to duplicate long lists of columns or other common query clauses such as filters. I'd be surprised if there are any differences in the query plan when using SELECT * like this compared with SELECT <columns> (in the vast majority of cases - obviously always profile running code if it's critical).

RDBMS for extremely large data sets - what are people using?

I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ... tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum, avg, count, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.
I suggest PostgreSQL, which I've been working with quite successfully on tables with ~0.5B rows that required some complex join operations. Oracle should be good for that too, but I don't have much experience with it.
It should be noted that switching an RDBMS isn't a magic solution, if you want to scale to those sizes there's a LOT of hard work to be done in optimizing your queries, optimizing the database structure and indexes, fine tuning the database configuration, using the right hardware for your usage, replication, using materialized views (which are extremely powerful when used correctly. see here and here - its postgres specific, but applies to other RDBMSs too)... and at some point, you just have to throw more money on the problem.
edited fixed some weird typos (useless android auto correct...) and added some resources about materialized views
We have used MS SqlServer to run analytics on financial data with ten of millions of rows and more using complex JOIN and aggregation. Several things that we have done other than what you have mentioned are:
We chunk the calculation into a lot of temporary tables instead of using sub-query. These tables then we apply proper keys, indexing and so on via the code. Query with sub-query just fails for us
In the temporary tables, we often apply the clustered index that makes sense for us. Note that this temporary tables are filtered results so applying the index on the fly is not expensive compared to use the sub query in place of this temporary tables. Note I am speaking from our experience and might not apply to all cases
As we have done a lot of aggregation function as well, we did a lot indexing on the group columns
We do a lot of query run planning using SQL Query Analyzer that shows us the execution plan. Based on the plan, we revised the query, change the index
We provide hints for the SQL Server that we think could help the execution such as the choice of JOIN Algorithm to take (Hash, Merged or Nested)