Slow data transfer of large result set - mysql

I have a large MySQL table, with proper indices etc. I run a simple select * query from a remote machine and I expect a large result set.
My problem is that when I run the query, the result set returns at a maximum data transfer speed of ~300 KBytes/sec.
I created the same table, and run the same query on SQLServer Express 2008 R2, and the results returned at a transfer speed of 2MBytes/second (my line limit).
The server machine is Windows Server 2008 R2 x64, Quad core, 4GB RAM and the the MySQL version is 5.6.2 m5 64-bit. I tried disabling the compession in the communication protocol but the results where the same.
Does anyone have an idea as to why this is happening ?
--theodore

You might be comparing apples to oranges.
I'd run SELECT * on the MySQL server, and see what kind of data rate you get for retrieving data on the server locally -- without the additional constraint of a network.
If that's slow also -- then it isn't the network's fault.
When the MySQL setup program runs, it asks the person setting up MySQL what role MySQL is going to play on the hardware -- i.e., Development Server, Shared Server, Dedicated.
The difference in all of these is how much memory MySQL will seek to consume on the Server.
The slowest setting is Development (use the least memory), and the fastest one is Dedicated (attempt to use a lot of memory). You can tinker with the my.ini file to change how much memory MySQL will allocate for itself, and/or google 'my.ini memory' for more detailed instructions.
The memory that MySQL is using (or isn't, as the case may be), will make a huge difference on performance.
First, check to see what the speed is retrieving data locally on the MySQL server is. If it's slow, the network isn't the problem -- check MySQL's memory usage -- ideally give it as much as possible. And of course, if it's fast, then either the network and/or some piece of database middleware (ODBC?) or tool-used-to-display-the-data -- is slow...
One more thing -- try the SELECT * TWICE... why? The second time some or all of the results (again, depending on memory) should be cached... the second time it should be faster...
Also, don't forget to restart MySQL when changing the my.ini file (and create a backup before you make any changes...)

Related

Amazon RDS MySQL/Aurora query sometimes hangs forever. Any 2 cents on the metrics and approaches we can triage it and prevent it from happening?

Just some contexts: In our old data pipeline system, we are running MySQL 5.6. or Aurora on Amazon rds. Bad thing about our old data pipeline is running a lot of heavy computations on the database servers because we are handcuffed by what was designed: treating transactional databases as data warehouse and our backend API directly “fishing” the databases heavily in our old system. We are currently patching this old data pipeline, while re-design the new data warehouse in SnowFlake.
In our old data pipeline system, the data pipeline calculation is a series of sequential MySQL queries. As our data grows bigger and bigger in the old data pipeline, what the problem now is the calculation might just hang forever at, for example, the step 3 MySQL query, while all metrics in Amazon CloudWatch/ grafana we are monitoring (CPU, database connections, freeable memory, network throughput, swap usages, read latency, available storage, write latency, etc. ) looks normal. The MySQL slow query log is not really helpful here because each of our query in the data pipeline is essentially slow anyway (can takes hours to run a query because the old data pipeline is running a lot of heavy computations on the database servers). The way we usually solve these problems is to “blindly” upgrade the MySQL/Aurora Amazon rds service and hoping it will solve the issue. I am wondering
(1) What are the recommended database metrics in MySQL 5.6. or Aurora on Amazon rds we should monitor real-time to help us identify why a query freezes forever? Like innodb_buffer_pool_size?
(2) Is there any existing tool and/or in-house approach where we can predict how many hardware resources we need before we can confidently execute a query and know it will succeed? Could someone share some 2 cents?
One thought: Since Amazon rds sometimes is a bit blackbox, one possible way is to host our own MySQL server on an Amazon EC2 instance in parallel to our Amazon MySQL 5.6/Aurora rds production server, so we can ssh into MySQL server and run a lot of command tools like mytop (https://www.tecmint.com/mysql-performance-monitoring/) to gather a lot more real time MySQL metrics which can help us triage the issue. Open to any 2 cents from gurus. Thank you!
None of the tools mentioned at that link should need to run on the database server itself, and to the extent that this is true, there should be no difference in their behavior if they aren't. Run them on any Linux server, giving the appropriate --host and --user and --password arguments (in whatever form they may expect). Even mysqladmin works remotely. Most of the MySQL command line tools do (such as the mysql cli, mysqldump, mysqlbinlog, and even mysqlcheck).
There is no magic coupling that most administrative utilities can gain by running on the same server as MySQL Server itself -- this is a common misconception but, in fact, even when running on the same machine, they still have to make a connection to the server, just like any other client. They may connect to the unix socket locally rather than using TCP, but it's still an ordinary client connection, and provides no extra capabilities.
It is also possible to run an external replica of an RDS/MySQL or Aurora/MySQL server on your own EC2 instance (or in your own data center, even). But this isn't likely to tell you a whole lot that you can't learn from the RDS metrics, particularly in light of the above. (Note also, that even replica servers acquire their replication streams using an ordinary client connection back to the master server.)
Avoid the temptation to tweak server parameters. On RDS, most of the defaults are quite sane, and unless you know specifically and precisely why you want to adjust a parameter... don't do it.
The most likely explanation for slow queries... is poorly written queries and/or poorly designed indexes.
If you are not familiar with EXPLAIN SELECT, then you need to learn it, live it, an love it. SQL is declarative, not procedural. That is, SQL tells the server what you want -- not specifically how to obtain it internall. For example: SELECT ... FROM x JOIN y tells the server to match up the rows from table x and y ON a certain criteria, but does not tell the server whether to read from x then find the matching rows in y... or read from y and find the matching rows in x. The net result is the same either way -- it doesn't matter which table the server examines first, internally -- but if the query or the indexes don't allow the server to correctly deduce the optimum path to the results you've requested, it can spend countless hours churning through unnecessary effort.
Take for an extreme and overly-simplified example, a table with millions of rows and table with 1 row. It would make sense to read the small table first, so you know what 1 value you're trying to join in the large table. It would make no sense to read throuh each row in the large table, then go over and check the small table for a match for each of the millions of rows. The order in which you join tables can be different than the order in which the actual joining is done.
And that's where EXPLAIN comes in. This allows you to inspect the query plan -- the strategy the internal query optimizer has concluded will get it to the answer you need with the least amount of effort. This is the core of the magic of relational database systems -- finding the correct solution in the optimal time, based on what it knows about the data. EXPLAIN shows you the order in which the tables are being accessed, how they're being joined, which indexes are being used, and an estimate of the number of rows from each table are involved -- and these numbers multiply together to give you an estimate of the number of permutations involved in resolving your query. Two small tables, each with 50,000 rows, joined without a proper index, means an entirely unreasonable 2,500,000,000 unique combinations between the two tables that must be evaluated; every row must be compared to every other row. In short, if this turns out to be the kind of thing that you are (unknowingly) asking the server to do, then you are definitely doing something wrong. Inspecting your query plan should be second nature any time you write a complex query, to ensure that the server is using a sensible strategy to resolve it.
The output is cryptic, but secret decoder rings are available.
https://dev.mysql.com/doc/refman/5.7/en/explain.html#explain-execution-plan

MySql localhost vs Amazon RDS instance

I have surprise with some of mysql performance.
When I run simple query 'SELECT 1;' on my local host (mysql 5.6.x) using workbench, its execute in 0.000s, but the same query I ran on Amazon RDS (medium mysql 5.5.x) it tooks almost 0.094s.
I can not understand this behavior of mysql.
I would propose that you go for simplicity of maintenance and scalability (which RDS apparently provides much better than local MySQL) over performance for now.
Later on, when you get insufficient output for dollar paid for Amazon, you could start measuring carefully to find bottlenecks.
Nonetheless, if you are used to maintain private tightly packed VPS servers — local MySQL could be more simple to maintain, and you should only go for external services much later :)
The query SELECT 1 nearly requires no parsing and no table access so its execution is quick. For remote servers however there's also the time to transmit the request and shared resources like RDS are not real-time resources, so it might take a millisecond or two to get the task executed. If there's no bigger difference then just ignore this little extra time.

MySQL query runs 5x slower on staging server than local dev maching

I've got a query that is running 5x slower on my staging server as opposed to my local dev machine.
Stackoverflow doesn't want to play nicely with the formatting; the query, describes, and explains are located here
Looking at the describe statements, I can't see any difference between the local and remote schemas.
The record counts for the 2 machines are in the same order of magnitude (500k vs 600k)
Edit In Response to Comments
It was my highly unscientific approach of throwing the queries into MySQL Workbench and looking at the query time. The local query time was on the order of 1.3 seconds and the remote query time was on the order of 5.2 seconds (so its 4x as slow). I'm sure there's a better way to test this query time.
The machines are different. My dev machine is a Mac Book Pro with 8 gigs of RAM. The staging server is a linode VPS with 512 megabytes of RAM. There shouldn't be much load on the staging server (I'm the only one that uses it). I've noticed most queries run in approximately the same time frame on the local machine and staging server, so I was confused as to why this one had such a drastically different time frame.
RAM Issue
Since a temporary table isn't being used (no mention in the EXPLAINS), is the amount of RAM still an issue?
Output from free
total used free shared buffers cached
Mem: 508576 453880 54696 0 4428 254200
-/+ buffers/cache: 195252 313324
Swap: 262140 19500 242640
Profiling Added to Gist
It looks like the remote is taking 2.5 seconds 'sending data' whereas the local is only taking 0.5 seconds. Is this an I/O issue? (Complete profiling info in gist)
Your staging server has one sixteenth of the RAM that you Mac Book Pro has.
Without knowing how much RAM is available to your two instances of MySQL, it's hard to be definitive, but that's the first place I'd look.
Also, if you run these queries from the MySQL command line, locally, how do the times compare?
It could be that the increase in time is in network transfer and not query processing.
Actually... network transfer time is the first place I'd look... then MySQL memory usage.
EDIT following question updates
The 'sending data' phase is the phase where the server is sending data to the client ref. I don't know exactly how large your dataset is, but 2.5s seems pretty high for what's probably 50kB of data or so.
Having looked at the profiling data, nearly all the time is spent sending data, so I'd strongly suspect the network here.
EDIT 2
Some research lead me to this page which indicates that the 'Sending data' is misleading and that this is actually the time spend executing your query.
Thus, I really think you need to be looking at CPU and memory usage on your server since it's specced at a level so much lower than your MacBook.

What is happening as my Sphinx search server warms up?

I have Sphinx Search running on a Linux server with 38GB of RAM. The sphinx index contains 35M full text documents plus meta data indexed from a MySQL table. When I launch a fresh server, I run a script that "warms up the sphinx cache" by sending my 10,000 most common queries through it. It takes about an hour to run the warm up script the first time, but the same script completes in just a few minutes if I run it again.
My confusion arises from the fact that Sphinx doesn't have any documented caching, other than a file based cache that I am not using. The index is loaded into memory when Sphinx starts, but individual queries take the same length of time each time they are run after the system has been "warmed up".
There is a clear warm up period when I run my scripts. What is going on? Is Linux caching something that helps Sphinx run faster? Does the underlying MySQL system cache queries ( I believe Sphinx is basically a custom MySQL storage engine )? How are new queries that have never been run being made faster by what is going on?
I realize there is likely a very complex explanation for this, but even a little direction should help be dig deeper.
( I believe Sphinx is basically a custom MySQL storage engine )
SphinxSE is a 'fake' storage engine. fake because it doesnt store any data - but rather take requests for data from its 'table', but really it just proxies it back to a running searchd instance in the background.
searchd itself doesnt have any caching - but as mentioned as indexed are read from, the OS may well start caching the files - so dont have to go all the way back to disk.
If you are using SphinxSE - then queries may be cached by the normal mysql query cache - so whole result sets are cached. But in addiction, the usual way to use SphinxSE, is to join the search results back with the original dataset, so you get both returned to the app in one go. So your queries are also dependent on the real mysql data tables. And they will be subject to the same OS caching - as mysql reads data it will be cached.
When I launch a fresh server
that suggests you are using a VM? If so the virtual disk might actully be located on a remote SAN. (or EBS on Amazon ec2)
which means loading a large sphinx index via that route might well be slow.
Depending on where your VM is hosted might be able to get some special high performance disks - ideally local to the host - maybe even SSD - which may well help.
Anyway to trace the issue, more you should almost certainly enable the sphinx query log. Look into that to see if queries are slow executing there. There is also a startup upoption to searchd - where you can enable iostats. This will log more information to the quyery log about io stats as queries are run. This can give you additional insights.
Sphinx doesn't cache your queries, but file system does. So, yes, second time queries executing faster than first time.

Necessity of static cache for mysql queries?

This seems to be a clear issue; but I was unable to find an explicit answer. Consider a simple mysql database with indexed ID; without any complicated process. Just reading a row with WHERE clause. Does it really need to be cached? Reducing mysql queries apparently satisfies every one. But I tested reading a text from a flat cache file and by mysql query in a for loop of 1 - 100,000 cycles. Reading from flat file was only 1-2 times faster (but needed double memory). The CPU usage (by rough estimate from top in SSH) was almost the same.
Now I do not see any reason for using flat file cache. Am I right? or the case is different in long term? What may make slow query in such a simple system? Is it still useful to reduce mysql queries?
P.S. I do not discuss internal QC or systems like memcached.
It is depending of how you see the problem.
There is a limit on number of mysql connection can be established at any one time.
Holding the mysql connection resources in a busy site could lead to max connection error.
Establish a connection to mysql via TCP is a resource eater (if your database is sitting in different server). In this case, access a local disk file will be much faster.
If your server is located outside the network, the cost of physical distance will be heavier.
If records are updated once daily, stored into cache is truly request once and reused for the day.