Consecutive MySQL queries from Go become MUCH slower after some point - mysql

I'm writing a job in go that goes through some MySQL tables, selects some of the rows based on some critera, extracts email addresses from them and sends an email to each.
The filtering process looks at a table (let's call it storage) which is pretty big (~6gb dumped) and looks like this:
Columns:
id varchar(64) PK
path varchar(64) PK
game varchar(64)
guid varchar(64)
value varchar(512)
timestamp timestamp
There are two indices: (id, path) (the PK as seen above) and guid.
The job first retrieves a long list of guids from one table, then batches them and performs consecutive queries like this on the storage table:
SELECT guid, timestamp FROM storage
WHERE game = 'somegame'
AND path = 'path' AND value = 'value' AND timestamp >= '2015-04-22 00:00:00.0' AND timestamp <= '2015-04-29T14:53:07+02:00'
AND guid IN ( ... )
Where the IN clause contains a list of guids.
I need to retrieve timestamps to be able to filter further.
When running against my local MySQL, everything works as expected, the query takes about 180ms with batches of 1000 guids.
When running against the same DB on Amazon RDS, the queries begin quick, but after some point, they suddenly start taking around 30 seconds, and continue to do so until the job ends.
I have tried many many things to fix this, but can't figure out the reason. Some notes:
The job uses only one sql.DB object. Also, I prepare the above statement once and reuse it quite heavily.
At first, I thought it was because the RDS DB was running MySQL 5.5, and I was running 5.6. I made a replica of the RDS DB, upgraded to 5.6, ran the job again. The problem happened again.
The volume of data in the two databases is the same: I dumped the production database and imported it into my local database and ran the job. Same behaviour (it still ran quickly locally).
The AWS monitoring of the RDS nodes doesn't show any significant spikes. The CPU usage jumps from 1% to up to 10%, and the job seems to open just a few connections (~4).
I had a colleague run the job on their PC, pointing to my MySQL DB, just to make sure the great performance didn't stem from the fact that the connection was local. It ran just as quickly as on my PC (admittedly, over LAN).
I ran the job against RDS both from my local PC and from an Amazon EC2 node, which is considerably closer to RDS. From EC2, it performed better, but the problem still appeared.
The job is very concurrent, each step has input and output channels (with buffer sizes of 1000), and the work is performed by goroutines. Between the steps, I have other goroutines that batch the output of the previous goroutine.
The slowdown is sudden, one query takes milliseconds, and the next one takes tens of seconds.
I haven't the faintest idea why this happens. Any suggestions would be appreciated.

So, after lots and lots of experimentation, I found the solution.
I am using Magnetic Storage on the RDS instances involved, which guarantees approximately 100 IOPS. This limited the speed with which we could query the data.
I tested using 2000 Provisioned IOPS, and the job ran quickly all the way.

Related

Altering MySQL table column type from INT to BIGINT

I have a table with just under 50 million rows. It hit the limit for INT (2147483647). At the moment the table is not being written to.
I am planning on changing the ID column from INT to BIGINT. I am using a Rails migration to do this with the following migration:
def up
execute('ALTER TABLE table_name MODIFY COLUMN id BIGINT(8) NOT NULL AUTO_INCREMENT')
end
I have tested this locally on a dataset of 2000 rows and it worked ok. Running the ALTER TABLE command across the 50 million should be ok since the table is not being used at the moment?
I wanted to check before I run the migration. Any input would be appreciated, thanks!
We had exactly same scenario but with postgresql, and i know how 50M fills up the whole range of int, its gaps in the ids, gaps generated by deleting rows over time or other factors involving incomplete transactions etc.
I will explain what we ended up doing, but first, seriously, testing a data migration for 50M rows on 2k rows is not a good test.
There can be multiple solutions to this problem, depending on the factors like which DB provider are you using? We were using mazon RDS and it has limits on runtime and what they call IOPS(input/output operations) if we run such intensive query on a DB with such limits it will run out of its IOPS quota mid way throuh, and when IOPS quota runs out, DB ends up being too slow and kind of just useless. We had to cancel our query, and let the IOPS catch up which takes about 30 minutes to 1 hour.
If you have no such restrictions and have DB on premises or something like that, then there is another factor, which is, if you can afford downtime?**
If you can afford downtime and have no IOPS type restriction on your DB, you can run this query directly, which will take a lot fo time(may half hour or so, depending on a lot of factors) and in the meantime
Table will be locked, as rows are being changed, so make sure not only this table is not getting any writes, but also no reads during the process, to make sure your process goes to the end smoothly without any deadlocks type situation.
What we did avoiding downtimes and the Amazon RDS IOPS limits:
In my case, we had still about 40M ids left in the table when we realized this is going to run out, and we wanted to avoid downtimes. So we took a multi step approach:
Create a new big_int column, name it new_id or something(have it unique indexed from start), this will be nullable with default null.
Write background jobs which runs each night a few times and backfills the new_id column from id column. We were backfilling about 4-5M rows each night, and a lot more over weekends(as our app had no traffic on weekends).
When you are caught up backfilling, now we will have to stop any access to this table(we just took down our app for a few minutes at night), and create a new sequence starting from the max(new_id) value, or use existing sequence and bind it to the new_id column with default value to nextval of that sequence.
Now switch primary key from id to new_id, before that make new_id not null.
Delete id column.
Rename new_id to id.
And resume your DB operations.
This above is minimal writeup of what we did, you can google up some nice articles about it, one is this. This approach is not new and pretty much common, so i am sure you will find even mysql specific ones too, or you can just adjust a couple of things in this above article and you should be good to go.

Rails Writes Take 100% Longer After Postgres Migration

I'm working on a migration from MySQL to Postgres on a large Rails app, most operations are performing at a normal rate. However, we have a particular operation that will generate job records every 30 minutes or so. There are usually about 200 records generated and inserted after which we have separate workers that pick up the jobs and work on them from another server.
Under MySQL it takes about 15 seconds to generate the records, and then another 3 minutes for the worker to perform and write back the results, one at a time (so 200 more updates to the original job records).
Under Postgres it takes around 30 seconds, and then another 7 minutes for the worker to perform and write back the results.
The table being written to has roughly 2 million rows, and 1 sequence column under ID.
I have tried tweaking checkpoint timeouts and sizes with no luck.
The table is heavily indexed and really shouldn't be any different than it was before.
I can't post code samples as its a huge codebase and without posting pages and pages of code it wouldn't make sense.
My question is, can anyone think of why this would possibly be happening? There is nothing in the Postgres log and the process of creating these objects has not changed really. Is there some sort of blocking synchronous write behavior I'm not aware of with Postgres?
I've added all sorts of logging in my code to spot errors or transaction failures but I'm coming up with nothing, it just takes twice as long to run, which doesn't seem correct to me.
The Postgres instance is hosted on AWS RDS on a M3.Medium instance type.
We also use New Relic, and it's showing nothing of interest here, which is surprising
Why does your job queue contain 2 million rows? Are they all live or are have not moved them to an archive table to keep your reporting more simple?
Have you used EXPLAIN on your SQL from a psql prompt or your preferred SQL IDE/tool?
Postgres is a completely different RDBMS then MySQL. It allocates space differently and manipulates space differently so may need to be indexed differently.
Additionally there's a tool called pgtune that will suggest configuration changes.
edit: 2014-08-13
Also, rails comes with a profiler that might add some insight. Here's a StackOverflow thread about rails profiling.
You also want to watch your DB server at the disk IO level. Does your job fulfillment to a large number of updates? Postgres created new rows when you update a existing rows, and marks the old rows as available, instead of just overwriting the existing row. So you may be seeing a lot more IO as a result of your RDBMS switch.

Mysql MyISAM table crashes with lots of inserts and selects, what solution should I pick?

I have the following situation:
MySQL MyISAM database on Amazon EC2 instance with PHP on a apache webserver. We need to store incomming packages in json in MySql. For this I use a staging database where a cronjob each minutes moves old data with a where DateTime > 'date - 2min' query to another table (named stage2).
The stage1 table has only actual information and contains 35k rows at normal and can contain up to 100k when it's busy. We can reach 50k new rows a minute, which should be about 10k insert queries. The insert looks like this:
INSERT DELAYED IGNORE INTO stage1 VALUES ( ... ), (....), (....), (....), (...)
Then we have 4 scripts running about each 10second doing the following:
grab max RowID from stage1 (primary key)
export data till that rowID and from the previous max RowId
a) 2 scripts are in bash and using the mysql export commandline method
b) 1 script in node.js and is using the export method with into outfile
c) 1 script in php which using the default mysql select statement and loop through each row.
send data to external client
write last send time and last rowid to a mysql table so it knows where it is next time.
Then we have one cronjob each minute moving old data form stage1 to stage2.
So everything worked well for a long time but now we are increasing in users and during our rush hours the stage1 table is crashing now and then. We can easily repair it but that's not the right way because we will be down for some time. Memory and CPU are ok during the rush hours but when stage1 is crashing everything is crashing.
Also worth to say: I don't care if I'm missing rows because of a failure, so I don't need any special backup plans just in case something went wrong.
What I did so far:
Adding delayed and ignore to the insert statements.
Tried switching to innoDB but this was even worse, mainly think of the large memory it needed. My EC2 currently is a t2.medium which has 4gb memory and 2 vCPU with burst capacity. Following: https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size and running this query:
SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;
it returned 11gb, I tried 3gb which is the max for my instance (80%). And since it was more instable I switched every table back to myISAM yesterday
recreate the stage1 table structure
What are my limitations?
I cannot change all 4 scripts to one export because the output to the client is different. for example some use json others xml.
Options I'm considering
m3.xlarge instance with 15GB memory is 5 times more expensive, but if this is needed Im willing to do the offer. Then switch to innoDB again and see if it's stable ?
can I just move stage1 to innoDB and run it with 3gb buffer pool size? So the rest will be myISAM ?
Try doing it with a nosql database or a in memory type database. Should that work?
Queue the packages in memory and have the 4 scripts get the data from memory and save all later when done to mysql. Is there some kind of tool for this?
Move stage1 to a RDS instance with innoDB
Love to get some advice and help on this! Perhaps I'm missing the easy answer ? or what options should I not consider.
Thanks,
Sjoerd Perfors
Today we fixed these issues with the following setup:
AWS Loadbalancer going to a T2.Small instance "Worker" where Apache en PHP handeling the request and sending to a EC2 instance mysql system calling the "Main".
When CPU of the T2.small instance is above 50% automatically new instances are launched connecting to the loadbalancer.
"Main" EC2 has mysql running with innodb.
All updated to Apache 2.4 and php 5.5 with performance updates.
Fixed one script acting a lot faster.
Innodb has now 6GB
Things we did try-out but didn't worked:
- Setting up a DynamoDB but sending to this DB did cost almost 5seconds.
Things we are considering:
- Removing the stage2 database and doing backups directly from Stage1. Seems having this kind of rows isn't bad for the performance.

MySQL query slowing down until restart

I have a service that sits on top of a MySQL 5.5 database (INNODB). The service has a background job that is supposed to run every week or so. On a high level the background job does the following:
Do some initial DB read and write in one transaction
Execute UMQ (described below) with a set of parameters in one transaction.
If no records are returned we are done!
Process the result from UMQ (this is a bit heavy so it is done outside of any DB
transaction)
Write the outcome of the previous step to DB in one transaction (this
writes to tables queried by UMQ and ensures that the same records are not found again by UMQ).
Goto step 2.
UMQ - Ugly Monster Query: This is a nasty database query that joins a bunch of tables, has conditions on columns in several of these tables and includes a NOT EXISTS subquery with some more joins and conditions. UMQ includes ORDER BY also has LIMIT 1000. Even though the query is bad I have done what I can here - there are indexes on all columns filtered on and the joins are all over foreign key relations.
I do expect UMQ to be heavy and take some time, which is why it's executed in a background job. However, what I'm seeing is rapidly degrading performance until it eventually causes a timeout in my service (maybe 50 times slower after 10 iterations).
First I thought that it was because the data queried by UMQ changes (see step 4 above) but that wasn't it because if I took the last query (the one that caused the timeout) from the slow query log and executed it myself directly I got the same behavior only until I restated the MySQL service. After restart the exact query on the exact same data that took >30 seconds before restart now took <0.5 seconds. I can reproduce this behavior every time by restoring the database to it's initial state and restarting the process.
Also, using the trick described in this question I could see that the query scans around 60K rows after restart as opposed to 18M rows before. EXPLAIN tells me that around 10K rows should be scanned and the result of EXPLAIN is always the same. No other processes are accessing the database at the same time and the lock_time in the slow query log is always 0. SHOW ENGINE INNODB STATUS before and after restart gives me no hints.
So finally the question: Does anybody have any clue of why I'm seeing this behavior? And how can I analyze this further?
I have the feeling that I need to configure MySQL differently in some way but I have searched and tested like crazy without coming up with anything that makes a difference.
Turns out that the behavior I saw was the result of how the MySQL optimizer uses InnoDB statistics to decide on an execution plan. This article put me on the right track (even though it does not exactly discuss my problem). The most important thing I learned from this is that MySQL calculates statistics on startup and then once in a while. This statistics is then used to optimize queries.
The way I had set up the test data the table T where most writes are done in step 4 started out as empty. After each iteration T would contain more and more records but the InnoDB statistics had not yet been updated to reflect this. Because of this the MySQL optimizer always chose an execution plan for UMQ (which includes a JOIN with T) that worked well when T was empty but worse and worse the more records T contained.
To verify this I added an ANALYZE TABLE T; before every execution of UMQ and the rapid degradation disappeared. No lightning performance but acceptable. I also saw that leaving the database for half an hour or so (maybe a bit shorter but at least more than a couple of minutes) would allow the InnoDB statistics to refresh automatically.
In a real scenario the relative difference in index cardinality for the tables involved in UMQ will look quite different and will not change as rapidly so I have decided that I don't really need to do anything about it.
thank you very much for the analysis and answer. I've been searching this issue for several days during ci on mariadb 10.1 and bacula server 9.4 (debian buster).
The situation was that after fresh server installation during a CI cycle, the first two tests (backup and restore) runs smoothly on unrestarted mariadb server and only the third test showed that one particular UMQ took about 20 minutes (building directory tree during restore process from the table with about 30k rows).
Unless the mardiadb server was restarted or table has been analyzed the problem would not go away. ANALYZE TABLE or the restart changed the cardinality of the fields and internal query processing exactly as stated in the linked article.

MySQL is very slow with DELETE query, Apache is weird while query runs

To start, a few details to describe the situation as a whole:
MySQL (5.1.50) database on a very beefy (32 CPU cores, 64GB RAM) FreeBSD 8.1-RELEASE machine which also runs Apache 2.2.
Apache gets an average of about 50 hits per second. The vast majority of these hits are API calls for a sale platform.
The API calls usually take about a half of a second or less to generate a result, but could take up to 30 seconds depending on third parties.
Each of the API calls stores a row in a database. The information stored there is important, but only for about fifteen minutes, after which it must expire.
In the table which stores API call information (schema for this table is below), InnoDB row-level locking is used to synchronize between threads (Apache connections, really) requesting the same information at the same time, which happens often. This means that several threads may be waiting for a lock on a row for up to 30 seconds, as API calls can take that long (but usually don't).
Above all, the most important thing to note is that everything works perfectly under normal circumstances.
That said, this is the very highly used table (fifty or so INSERTs per second, many SELECTs, row-level locking is utilized) I'm running the DELETE query on:
CREATE TABLE `sales` (
`sale_id` int(32) unsigned NOT NULL auto_increment,
`start_time` int(20) unsigned NOT NULL,
`end_time` int(20) unsigned default NULL,
`identifier` char(9) NOT NULL,
`zip_code` char(5) NOT NULL,
`income` mediumint(6) unsigned NOT NULL,
PRIMARY KEY USING BTREE (`sale_id`),
UNIQUE KEY `SALE_DATA` (`ssn`,`zip_code`,`income`),
KEY `SALE_START` USING BTREE (`start_time`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii ROW_FORMAT=FIXED
The DELETE query looks like this, and is run every five minutes on cron (I'd prefer to run it once per minute):
DELETE FROM `sales` WHERE
`start_time` < UNIX_TIMESTAMP(NOW() - INTERVAL 30 MINUTE);
I've used INT for the time field because it is apparent that MySQL has trouble using indexes with DATETIME fields.
So this is the problem: The DELETE query seems to run fine the majority of the time (maybe 7 out of 10 times). Other times, the query finishes quickly, but MySQL seems to get choked up for awhile afterwards. I can't exactly prove it's MySQL that is acting up, but the times the symptoms happen definitely coincides with the times that this query is run. Here are the symptoms while everything is choked up:
Logging into MySQL and using SHOW FULL PROCESSLIST;, there are just a few INSERT INTOsales... queries running, where normally there are more than a hundred. What's abnormal here is actually the lack of any tasks in the process list, rather than there being too many. It seems MySQL stops taking connections entirely.
Checking Apache server-status, Apache has reached MaxClients. All threads are in "Sending reply" status.
Apache begins using lots of system time CPU. Load averages shoot way up, I've seen 1-minute load averages as high as 100. Normal load average for this machine is around 15. I see that it's using system CPU (as opposed to user CPU) because I use GKrellM to monitor it.
In top, there are many Apache processes using lots of CPU.
The web site and API (served by Apache of course) are unreachable most of the time. Some requests go through, but take around three or four minutes. Other requests reply after a time with a "Can't connect to MySQL server through /tmp/mysql.sock" error - this is the same error as I get when MySQL is over capacity and has too many connections (only it doesn't actually say too many connections).
MySQL accepts a maximum of 1024 connections, mysqltuner.pl reports "[!!] Highest connection usage: 100% (1025/1024)", meaning it's taken on more than it could handle at one point. Generally under normal conditions, there are only a few hundred concurrent MySQL connections at most. mysqltuner.pl reports no other issues, I'd be happy to paste the output if anybody wants.
Eventually, after about a minute or two, things recover on their own without any intervention. CPU usage goes back to normal, Apache and MySQL resume normal operations.
So, what can I do? :) How can I even begin to investigate why this is happening? I need that DELETE query to run for various reasons, why do things go bonkers when it's run (but not all the time)?
hard one. This is not a response but the start of a brainstorming.
I would say, maybe, a re-Index problem on delete, on the doc we can find "delete quick" followed by "optimize table" to try avoiding the multi index-merge.
One other possibility, maybe as well, is a chain of dead lock on delete with at least one other thread, row locks could pause the delete operation, and the delete operation could pause some next row lock. And then you've got either a detected deadlock , or an undetected one and so a timeout occuring. How do you detect such concurrency aborted exceptions? Do you re-run your transactions? If you threads are doing a lot of different row locks in the same transactions chances are that the first deadlock will impact more and more threads (traffic jam).
Did you tried to lock the table in the delete transaction? Check the manual, the way of locking tables in transaction in Innodb or to get a SHARE LOCK on all rows. Maybe it will take you some time to get the table only for you but if your delete is quite fast no one will notice you've taken the table for you only for 1s.
Now even if you do not tried it before, it's maybe what the delete is doing. Check as well this doc on implicit locks, your delete query should be using the start_time index, so I'm quite sure your current delete is not locking all rows (not completly sure, they lock al analysed rows not only the rows matching the where condition), but the delete is quite certainly blocking inserts. Some examples of deadlocks with transaction performing deletes are explained. Good luck! For me it's too late to understand all the lock isolation impacts.
edit you could try to change your DELETE by an UPDATE setting a deleted=1, and perform the real delete on low usage times (if you have some). And change the client queries to check this indexed deleted status.