[mysql]: Querying more db data vs loop on large array - mysql

I need to compare few scenarios, which can be fulfilled either by calling:
an additional timestamp column from mysql database or
Looping over the resultant array.
Elaborating further:
CASE 1: 144 byte columns + 4 byte timestamp column for 10K rows, then looping on array of size 50.(download size- 1480000 Bytes)
CASE 2: 144 byte columns for 10K rows, looping on array of size 10000. (download size- 1440000 bytes)
Data download roughly 40KB more for Case1 while 10000 more loop iteration for case2.
Which of the 2 scenarios could be faster, downloading 40KB more or 10000 loop iterations?

Your first scenario is by far the best. Here is why.
SQL is designed to extract subsets of rows from tables. It's designed to allow your tables to be many orders of magnitude bigger than your RAM. If you use a query like
SELECT *
FROM mytable
WHERE mytimestamp >= TODAY() - INTERVAL 1 DAY
AND mytimestamp < TODAY()
you will get all the rows with timestamps anytime yesterday, for example. If you put an index on the mytimestamp column, SQL will satisfy this query very quickly and only give the rows you need. And it will satisfy any queries looking for a range of timestamps similarly quickly.

There are no answers that are true in 100% of situations. For example, I would prefer to do the first query when I use a fast enough network (anything 1Gbps or higher) and data transfer is free. The difference in the example you show is 40,000 bytes, but it's only 2.7% of the total.
On the other hand, if you need to optimize to minimize bandwidth usage, that's a different scenario. Like if you are transmitting data to a satellite over a very slow link. Or you have a lot of competing network traffic (enough to use up all the bandwidth), and saving 2.7% is significant. Or if you are using a cloud vendor that charges for total bytes transferred on the network.
You aren't taking into account the overhead of executing 1000 separate queries. That means 1000x the bytes sent to the database server, as you send queries. That takes some network bandwidth too. Also the database server needs to parse and optimize each query (MySQL does not cache optimization plans as some other RDBMS products do). And then begin executing the query, starting with an index lookup without the context of the previous query result.
"Performance" is a slippery term. You seem to be concerned only with bandwidth use, but there are other ways of measuring performance.
Throughput (bytes per second)
Latency (seconds per response)
Wait time (seconds until the request begins)
All of these can be affected by the total load, i.e. number of concurrent requests. A busy system may have traffic congestion or request queueing.
You can see that this isn't a simple problem.

Related

Surprising timing stats for sql queries

I have two queries whose timing parameters I want to analyze.
The first query is taking much longer than the second one while in my opinion it should be the other way round.Any Explanations?
First query:
select mrn
from EncounterInformation
limit 20;
Second query:
select enc.mrn, fl.fileallocationid
from EncounterInformation as enc inner join
FileAllocation as fl
on enc.encounterIndexId = fl.encounterid
limit 20;
The first query runs in 0.760 seconds on MYSQL while second one runs in 0.509 seconds surprisingly.
There are many reasons why measured performance between two queries might be different:
The execution plans for the queries (the dominant factor)
The size of the data being returned (perhaps mrn is a string that is really long for the result set in the first query but not the second)
Other database activity, that locks tables and indexes
Other server activity
Data and index caches that are pre-loaded -- either in the database itself or in the underlying OS components
Your observation is correct. The first should be faster than the second. More importantly though is the observation that this simply does not make sense for your simple query:
The first query runs in 0.760 seconds
select mrn
from EncounterInformation
limit 20;
The work done for this would typically be to load one data page (or maybe a handful). That would only consistently take 0.760 seconds if:
You had really slow data storage (think "carrier pigeons").
EncounterInformation is a view and not a table.
You don't understand timings.
If the difference is between 0.760 milliseconds and 0.509 milliseconds, then the difference is really small and likely due to other issues -- warm caches, other activity on the server, other database activity.
It is also possible that you are measuring elapsed time and not database time, so network congestion could potentially be an issue.
If you are querying views, all bets are off without knowing what the views are. In fact, if you care about performance you should be including the execution plan in the question.
I can't explain the difference. What I can say is that your observation is reasonable, but your question lacks lots of information that suggests you need to learn more about how to understand timings. I would suggest that you start with learning about explain.

For a same sql query, database takes different time to return the reponse

Below I executed a same select query five times (one after the other) and for each request the database takes different time to return the response. How cold that be possible?
Request to Database -- took 10 ms.
Request to Database -- took 6ms.
Request to Database -- took 12 ms
Request to Database -- took6 ms
Request to Database -- took 9 ms
Thanks
Shashidhar
Let's see the data. The minimum time needed was 50% of the maximum time needed. Generally, if the minimum is 80% of the maximum, then the needed time is considered to be fairly stable. If it is 40% or more, but less than 80%, then there is a moderate variance in needed time. If it is below 40%, then it can be considered volatile. Your needed time has a variance, but it is not volatile yet.
The cause of this variance depends on the tasks the server has to do while running your query. If there are a lot of requests to the database server, then, due to the increased concurrency the needed time is increased. Also, if data is varied (a lot of insertions and deletions occur in any short period of time), then selection's speed might increase or decrease, depending on the number of rows the table has and the number of rows matching the where criteria. If the table structure is varied (for instance an index is added or removed often), that may vary the performance of your operation as well. Last, but not least, if you did not actually check that the query being executed was the same each time, then there are chances that there were differences, especially if the query was generated dynamically.
So, the things you have observed are not strange, but whole minutes needed for a query to be executed seems to be a lot of time and maybe you should think about optimizing things.

UPDATE vs COUNT vs SELECT performance

Is this statement true or false
The performance of these queries
SELECT * FROM table;
UPDATE table SET field = 1;
SELECT COUNT(*) FROM table;
Are identical
Or is there ever a case in which the performance of one will greatly differ from the other?
UPDATE
I'm more interested if there's a large difference between the SELECT and the UPDATE. You can ignore the COUNT(*) if you want
Assume the select performs full table scan. The update will also perform update on all rows in the table.
Assume the update is only updating one field - though it will update all rows (it's an indexed field)
I know that they'll take different time and that they do different things. What I want to know is if the difference will be significant or not. EG. If the update will take 5 times longer than the select then it's significant. Use this as the threshold. And there's no need to be precise. Just give an approximation.
There are different resource types involved:
disk I/O (this is the most costly part of every DBMS)
buffer pressure: fetching a row will cause fetching a page from disk, which will need buffer memory to be stored in
work/scratch memory for intermediate tables, structures and aggregates.
"terminal" I/O to the front-end process.
cost of locking, serialisation and versioning and journaling
CPU cost : this is neglectable in most cases (compared to disk I/O)
The UPDATE query in the question is the hardest: it will cause all disk pages for the table to be fetched, put into buffers, altered into new buffers and written back to disk. In normal circumstances, it will also cause other processes to be locked out, with contention and even more buffer pressure as a result.
The SELECT * query needs all the pages, too; and it needs to convert/format them all into frontend-format and send them back to the frontend.
The SELECT COUNT(*) is the cheapest, on all resources. In the worst case all the disk pages have to be fetched. If an index is present, fewer disk- I/O and buffers are needed. The CPU cost is still neglectable (IMHO) and the "terminal" output is marginal.
When you say "performance", do you mean "how long it takes them to execute"?
One of them is returning all data in all rows.
One of them (if you remove the "FROM") is writing data to the rows.
One is counting rows and returning none of the data in the rows.
All three of those queries are doing entirely different things. Therefore, it is reasonable to conclude that all three of them will take different amounts of time to complete.
Most importantly, why are you asking this question? What are you trying to solve? I have a bad feeling that you're going down a wrong path by asking this.
I have a large (granted indexed) table here at work, and this is what I found
select * from X (limited to the first 100,000 records) (12.5 seconds)
select count(*) from X (returned millions of records) (15.57 seconds)
Update on an indexed table is very fast (less then a second)
The SELECT and UPDATE should be about the same (but they could easily vary, this depends on the database). COUNT(*) is cached in many databases, at some level, so that query could easily be O(1). Of course a lazy implementation of UPDATE could also be O(1), but I don't know of anyone doing that currently.
tl;dr: "False" or "it depends".
All three queries do vastly different things.
They each have their own performance characteristics and are not directly comparable.
Can you clarify what you are attempting to investigate?

Optimizing Sql Transactions (large single transaction vs many small ones)

I'm working on a webserver. I can have an endpoint that compiles data in multiple transactions, or all in a single transaction. Which would be faster?/
Better?
The answer: It depends on the amount of data you would expect your database to return.
A: A lot of data being returned (Thousands, millions):
Suppose you are doing the next Facebook. If you are about to fetch a really enormous amount of data (2 millions of email addresses) it would probably be better to use some kind of "pagination" and fire a query every few seconds or minutes. You wouldn't want a query which waits for 10 minutes in order to get your results and keep the entire server busy.
B: Small or moderate amount of data being returned
Or, if you are about to fetch some moderate amount of data (300 cities, 523 employees and 43 phones) then you wouldn't want wasting transaction times by executing a separate SQL query for cities, employees and phones and try to use as few separate queries, as possible. This means probably using a lot of JOINs.

code ping time meter - is this really true?

I am using a sort of code_ping for the time it took to process the whole page, to all my pages in my webportal.
I figured if I do a $count_start in the header initialised with current timestamp and a $count_end in the footer, the same, the difference is a meter to roughly let me know how well optimised the page is (queries, loading time of all things in that particular page).
Say for one page i get 0.0075 seconds, for others I get 0.045 etc...i'm working on optimising the queries better this way.
My question is. If one page says by this meter "rough loading time" that has 0.007 seconds,
will 1000 users querying the same page at the same time get each the result in 0.007 * 1000 = 7 seconds ? meaning they will each get the page after 7 seconds ?
thanks
Luckily, it doesn't usually mean that.
The missing variable in your equation is how your database and your application server and anything else in your stack handles concurrency.
To illustrate this strictly from the MySQL perspective, I wrote a test client program that establishes a fixed number of connections to the MySQL server, each in its own thread (and so, able to issue a query to the server at approximately the same time).
Once all of the threads have signaled back that they are connected, a message is sent to all of them at the same time, to send their query.
When each thread gets the "go" signal, it looks at the current system time, then sends the query to the server. When it gets the response, it looks at the system time again, and then sends all of the information back to the main thread, which compares the timings and generates the output below.
The program is written in such a way that it does not count the time required to establish the connections to the server, since in a well-behaved application the connections would be reusable.
The query was SELECT SQL_NO_CACHE COUNT(1) FROM ... (an InnoDB table with about 500 rows in it).
threads 1 min 0.001089 max 0.001089 avg 0.001089 total runtime 0.001089
threads 2 min 0.001200 max 0.002951 avg 0.002076 total runtime 0.003106
threads 4 min 0.000987 max 0.001432 avg 0.001176 total runtime 0.001677
threads 8 min 0.001110 max 0.002789 avg 0.001894 total runtime 0.003796
threads 16 min 0.001222 max 0.005142 avg 0.002707 total runtime 0.005591
threads 32 min 0.001187 max 0.010924 avg 0.003786 total runtime 0.014812
threads 64 min 0.001209 max 0.014941 avg 0.005586 total runtime 0.019841
Times are in seconds. The min/max/avg are the best/worst/average times observed running the same query. At a concurrency of 64, you notice the best case wasn't all that different than the best case with only 1 query. But biggest take-away here is the total runtime column. That value is the difference in time from when the first thread sent its query (they all send their query at essentially the same time, but "precisely" the same time is impossible since I don't have a 64-core machine to run the test script on) to when the last thread received its response.
Observations: the good news is that the 64 queries taking an average of 0.005586 seconds definitely did not require 64 * 0.005586 seconds = 0.357504 seconds to execute... it didn't even require 64 * 0.001089 (the best case time) = 0.069696 All of those queries were started and finished within 0.019841 seconds... or only about 28.5% of the time it would have theoretically taken for them to run one-after-another.
The bad news, of course, is that the average execution time on this query at a concurrency of 64 is over 5 times as high as the time when it's only run once... and the worst case is almost 14 times as high. But that's still far better than a linear extrapolation from the single-query execution time would suggest.
Things don't scale indefinitely, though. As you can see, the performance does deteriorate with concurrency and at some point it would go downhill -- probably fairly rapidly -- as we reached whichever bottleneck occurred first. The number of tables, the nature of the queries, any locking that is encountered, all contribute to how the server performs under concurrent loads, as do the performance of your storage, the size, performance, and architecture, of the system's memory, and the internals of MySQL -- some of which can be tuned and some of which can't.
But of course, the database isn't the only factor. The way the application server handles concurrent requests can be another big part of your performance under load, sometimes to a larger extent than the database, and sometimes less.
One big unknown from your benchmarks is how much of that time is spent by the database answering the queries, how much of the time is spent by the application server executing the logic business, and how much of the time is spent by the code that is rendering the page results into HTML.