Which one is faster single big query or few small queries? - mysql

So I want to grab data from database, which method is faster, create several queries or one multi-query?

Each "round trip" to the database will have some overhead. So the fewer round-trips, the less overhead. Consider also that fewer requests means fewer packets from client to server. If the result of the consolidated query gives you just what you want, then single query is the way to go. If your single query is returning extra or redundant data (perhaps because of de-normalization) then the overhead savings of a single round trip may be lost in the extra data transferred.
Another consideration is latency. If the queries have to be completed in sequence because some part of the output of one is needed in the input of the next, consolidating into one query will cut out all the network latencies in between all the individual smaller queries, so a final result can be delivered faster. However, if the smaller queries are independent of each other, launching them in parallel can get all the results delivered faster, albeit less efficiently.
Bottom line: the answer depends on the specifics of your situation. The best way to get an answer will probably be to implement both ways, test, and compare the resource usage of each implementation.

In general it would be a multi query. But it depends on a lot of stuff such as the hardware, datastructure etc.
But connections do take a little time each one of them.

Related

Does queries per second in MySql really matter?

For example, if you were to have a webpage that writes 10 times (5ms each query so relatively inexpensive) to a MySQL database every time you visited the page. Would it be the same thing as a page load that writes only one query that is 50ms instead? What I'm basically asking is does queries per second really matter? Will it bottleneck my database faster?
If I had a pretty large database, is there any differences of writing 10000 inexpensive queries per second versus 1000 more expensive ones?
Besides the actual time spent executing a query, there is an overhead for your Rails application to establish a connection and close it.
Take your example of comparing one write of a certain amount of data versus 10 writes, each one-tenth the amount of data.
In the former case, only one database connection is needed, putting little stress on the connection pool. The write completes in 50ms, and during that time 9 other writes could also come in and execute. On the other hand, in the latter case, 10 connections from the pool must be used for the same data alone. And likely the amount of time needed for each of the 10 writes would be greater than 5ms, due to the overhead of establishing the connection.
Databases, like anything else, can benefit from economy of scale. Of course, there are instances when you simply need so many connections happening. But, all other factors being the same, if your business logic can be done with one large write rather than ten smaller ones, you might want to go for the single write.

how bad is it to have "extra" database queries?

I come from the front-end world in web development where we try really hard to limit the number of HTTP requests issued (by consolidating css, js files, images, etc.).
With db connections (MySQL), obviously you don't want to have unnecessary connections, but as a general rule, how bad is it to have multiple small queries? (they execute quickly)
I ask because I'm moving my application to a clustered environment and where before I was caching some stuff in server memory (as I was running on a single server), I am now trying to make my app "stateless" and in my current implementation that means more small db calls. This will help me with load balancing (avoiding sticky sessions) and also keep server memory usage down.
We're not talking a ton of queries, maybe 6-8 db calls instead of 2-4, returning anywhere from a handful of records to a few thousand records. Each of them executes quickly, less than 30ms (some much less), but I don't know if there is some "connection latency" I should be concerned about.
Thanks for your insight.
Short answer: (1) make sure you're staying at the same big-O level, reuse connections, measure performance; (2) think about how much you care about data consistency.
Long answer:
Performance
Strictly from performance perspective, and generally speaking, unless you are already close to maxing out your database resources, such as max connections, this is not likely to have major impact. But there are certain things you should keep in mind:
do the "6-8" queries that replace "2-4" queries stay in the same execution time? e.g. if current database interaction is at O(1) is it going to change to O(n)? Or current O(n) going to change to O(n^2)? If yes, you should think about what that means for your application
most application servers can reuse existing database connections, or have persistent database connection pools; make sure your application does not establish a new connection for every query; otherwise this is going to make it even more inefficient
in many common cases, mainly on larger tables with complex indexes and joins, doing few queries by primary keys may be more efficient than joining those tables in a single query; this would be the case if, while doing such joins, the server not only takes longer to perform the complex query, but also blocks other queries against affected tables
Generally speaking about performance, the rule of thumb is - always measure.
Consistency
Performance is not the only aspect to consider, however. Also think about how much you care about data consistency in your application.
For example, consider a simple case - tables A and B that have one-to-one relationship and you are querying for a single record using a primary key. If you join those tables and retrieve result using a single query, you'll either get a record from both A and B, or no records from either, which is what your application expects too. Now consider if you split that up into 2 queries (and you're not using transactions with preferred isolation levels) - you get a record from table A, but before you could grab the matching record from table B, it is deleted/updated by another process. Now your application has a record from A but none from B.
General question here is - do you care about ACID compliance of your relational data as it pertains to the queries you are breaking apart? If the answer is yes, you must think about how your application logic will react in these specific cases.
6-8 queries for one web page? Usually this is fine. I do it all the time.
Thousands of rows returned? Choke! What is the client going to do with that many? Can the SQL do more processing, then return fewer rows?
With rare exceptions, only 1 connection per web page.
Each query has a lot of overhead. For example, INSERTing 100 rows into a table -- 100 INSERT single-row statements will take about 10 times as long as a single 100-row INSERT. So when practical use fewer round-trips to the server. This becomes very important if the network is a WAN. The other side of the globe is 250ms away, just for latency. A server in the same datacenter is probably so close that latency can be ignored. In a WAN, use Stored Routines to minimize round trips.
I like to time each query actively in the code. Then, if I perceive a performance problem, I look to see which query to work on first. Or use the SlowLog.

What is the optimal solution, use Inner Join or multiple queries?

What is the optimal solution, use Inner Join or multiple queries?
something like this:
SELECT * FROM brands
INNER JOIN cars
ON brands.id = cars.brand_id
or like this:
SELECT * FROM brands
... (while query)...
SELECT * FROM cars WHERE brand_id = [row(brands.id)]
Generally speaking, one query is better, but there are come caveats to that. For example, older versions of SQL Server had a great decreases in performance if you did more than seven joins. The answer will really depend on the database engine, version, query, schema, fields, etc., so we can't say for sure which is better. Always look into minimizing the number of queries when possible without going too overboard and creating result sets that are crazy or impossible to maintain.
This is a very subjective question but remember that each time you call the database there's significant overhead.
Almost without exception the optimum is to issue as few commands and pull out all the data you'll need. However for practical reasons this clearly may not be possible.
Generally speaking if a database is well maintained one query is quicker than two. If it's not you need to look at your data/indicies and determine why.
A final point, you're hinting in your second example that you'd load the brand then issue a command to get all the cars in each brand. This is without a doubt your worst option as it doesn't issue 2 commands - it issues N+1 where N is the number of brands you have... 100 brands is 101 DB hits!
Your two queries are not exactly the same.
The first returns all fields from brands and cars in one row. The second returns two different result sets that need to be combined together.
In general, it is better to do as many operations in the database as possible. The database is more efficient for processing large amounts of data. And, it generally reduces the amount of data being brought back to the client.
That said, there are a few circumstances where more data is being returned in a single query than with multiple queries. For instance in your example, if you have one brand record with 100 columns and 10,000 car records with three columns, then the two-query method is probably faster. You are only bringing back the columns from brands table once rather than 10,000 times.
These examples where multiple queries is better are few and far between. In general, it is better to do the processing in the database. If performance needs to be improved, then in a few rare cases, you might be able to break up queries and improve performance.
In general, use first query. Why? Because query execution time is not just query itself time, but also some overheads, such as:
Creating connection overhead
Network data sending overhead
Closing (handling) connection overhead
Depending of situation, some overheads may present or not. For example, if you're using persistent connection, then you'll not get connection overhead. But in common case that's not true, thus, it will have place. And creating/maintaining/closing connection overhead is very significant part. Imagine that you have this overhead as only 1% from total query time (in real situation it will be much more). And you have - let's say, 1.000.000 rows. Then first query will produce that overhead only once, while second will be 1.000.000/100 = 10.000 times. Just think about - how slow it will be.
Besides, INNER JOIN will also be done using key - if it exists, thus, in terms of query itself speed it will be near same as second. So I highly recommend to use INNER JOIN option.
Breaking complex query into simple queries may be useful in a very specific cases. For example, case with IN subquery. In this situation, if you're using WHERE id IN (subquery), where (subquery) is some SQL, MySQL will treat that as = ANY subquery and will not use key for that, even if subquery results in narrow list of ids. And - yes, split it into two queries may have sense since WHERE IN(static list) will work in another way - MySQL will use range index scan for that (strange, but true - because for IN (static list) statement IN will be treated as comparison operator, and not =ANY subquery qualifier). This part isn't directly about your case - but to show that - yes, cases, when splitting processing from DBMS may be useful in terms of performance - exist.
One query is better, because up to about 90% of the expense of executing a query is in the overheads:
communication traffic to/from database
syntax checking
authority checking
access plan calculation by optimizer
logging
locking (even read-only requires a lock)
lots of other stuff too
Do all that just once for one query, or do it all n times for n queries, but get the same data.

Basic question: Querying data and performance tradeoffs

Let's say I have 100 rows in my table, with 3 columns of numbers. I don't need all the rows, only about half of them every time I fetch data. I only want the rows that have updated as getting the rest would be redundant.
Is it better to add a field and give it a datetime field to represent that it has updated since the last time I've fetched it (and use that as a criteria when SELECTing)? Or would it be better to simply download all the data each and every time (currently the data is being sent back as a JSON file).
What are the tradeoffs in terms of speed, bandwidth usage, and server cpu usage between these two options? Is the former just plain better than the latter?
Both Jens Struwe and roycl are right - but as you're asking a hypothetical question, you're going to get answers that are right and contradictory.
If only half the data is relevant, how is the client going to determine which data to show? If the decision can be made by software at all, it's more efficient to do it on the database - but it's also more logical.
With tables of 100 rows, performance is neither here nor there; maintainability and long-term upgradability is a far bigger deal. Most developers would expect a logical database design, and sorting/filtering to be done on the DB rather than the client.
Always (or at least if possible) select only data that you need to accomplish your task. Vice versa: Never select data that you have to filter out. In result: Add a timestamp field for the updates and select only these rows whose timestamp is > than the given one.
With a 100 rows in your table and 3 columns of numbers it really doesn't matter which approach you use if you don't mind if the server returns the data in less than a few 10s of milliseconds. The rows, if queried frequently, will all be in memory anyway. It also makes your json code simpler and your client code dumber (which is probably good, and more maintainable).
If you had a several-million row table with only a small percentage of data that was required, you would naturally want to limit the return set, and the easiest way of doing that is with an SQL WHERE clause, such as WHERE dt_modified > my_timestamp. On a properly optimised database even this query could come in at well under 100ms.
The issue may be more to do with time the data spends "on the wire", how much time the client spends either regenerating the page, or updating it based on the returned data. Client processing tim is often the slowest part of the process. Only testing on different browsers and over different network speeds will find the best balance between server-side tweeks, network fixes (such as gzipping to compress data) and optimising your javascript calls.

How to iteratively optimize a MySQL query?

I'm trying to iteratively optimize a slow MySQL query, which means I run the query, get timing, tweak it, re-run it, get timing, etc. The problem is that the timing is non-stationary, and later executions of the query perform very differently from earlier executions.
I know to clear the query cache, or turn it off, between executions. I also know that, at some level, the OS will affect query performance in ways MySQL can't control or understand. But in general, what's the best I can do wrt this kind of iterative query optimization, so that I can compare apples to apples?
Your best tool for query optimization is EXPLAIN. It will take a bit to learn what the output means, but after doing so, you will understand how MySQL's (horrible, broken, backwards) query planner decides to best retrieve the requested data.
Changes in the parameters to the query can result in wildly different query plans, so this may account for some of the problems you are seeing.
You might want to consider using the slow query log to capture all queries that might be running with low performance. Perhaps you'll find that the query in question only falls into the low performance category when it uses certain parameters?
Create a script that runs the query 1000 times, or whatever number of iterations causes the results to stabilize.
Then follow your process as described above, but just make sure you aren't relying on a single execution, but rather an average of multiple executions, because you're right, the results will not be stable as row counts change, and your machine is doing other things.
Also, try to use a wide array of inputs to the query, if that makes sense for your use case.