Stored procedures either finishing for less than a minute, or never - sql-server-2014

Background : the instance was migrated from SQL Server 2008 R2 to SQL Server 2014. The old machine had half the memory and CPU.
On the old machine, the procedures were running slower (20-30 mins) but they never froze completely. All databases and objects are 100% copies of the older ones.
The issue: I have a couple of parameterless stored procedures that are all taking either a minute or eternity to finish. I am also working with a static set of data (and static queries), as I am doing all testing on a copy of the production database (on the production server). Meaning that there are no other queries against this DB, and the table data is not changed. Unfortunately, for security reasons, I cannot post a full query, but I will explain what the stored procedures do and where they all freeze:
All stored procedures contain three UPDATE statements and a couple of correlated subqueries for each of the three UPDATE statements, that do calculation. The logic is as follows:
Statement 1: UPDATE three columns in all rows based on various filters
Statement 2 and 3: UPDATE same three columns based on same filters as above, with the exception of one, and maybe most importantly, only UPDATE rows where a value of a particular column is NULL (it can be calculated as NULL from the previous UPDATE).
I will provide this little snippet:
UPDATE table
SET column1 = CASE
WHEN (SELECT TOP 1
FROM STUFF (+ various joins)
WHERE rate = 1)
THEN something
ELSE something2
END
This is the same for all three updates, all three columns, except rate is 2 and 3 for the second and third batch respectively.
Now the strange issue is, statement 1 is always completing in about 30 seconds. If the stored procedure freezes, it is always on statement 2.
Here is what I have tried so far:
Create additional indexes, rebuild the ones I have, update statistics. Create filtered index for the "WHERE IS NOT NULL" filter on Statement 2 and 3
Set the compatibility level to 100
Force the query to use the cardinality estimater of SQL Server 2008 R2
Force the query to run with different MAXDOP settings - 0,1,4,8,16,28 (we have 28 physical cores with hyperthreading, so 56 logical)
Set up all kinds of traces - the only thing I am getting is the wait_info - it is always either CXPACKET for MAXDOP>1 or SOS_SCHEDULER_YIELD, both of which are normal.
Looking at the execution plans, the most consuming operations are the couple Lazy Table Spools. The "Rate" column which is the differentiating filter for the UPDATE statements (besides the IS NULL predicate) has 8000 rows with value 1, 7000 for value 2 and 10000 for value 3.
I am really out of ideas at this point, due to this strange behavior. Please share any suggestions you have!

Related

Will a MySQL SELECT statement interrupt INSERT statement?

I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result

How to select consistent data from multiple tables efficiently

I'm using MySQL 5.6. Let's say we have the following two tables:
Every DataSet has a huge amount of child DataEntry records that the number would be 10000 or 100000 or more. DataSet.md5sum and DataSet.version get updated when its child DataEntry records are inserted or deleted, in one transaction. A DataSet.md5sum is calculated against all of its children DataEntry.content s.
Under this situation, What's the most efficient way to fetch consistent data from those two tables?
If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:
SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000
SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000 -- I think the result of this query will possibly incosistent with the md5sum which fetched by former query
I think I can get consistent data with one query as follows:
SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000
But it produces redundant dataset which filled with 10000 or 100000 duplicated md5sums, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).
I think using pessimistic read / write lock (SELECT ... LOCK IN SHARE MODE / FOR UPDATE) would be another option but it seems overkill. Are there any other better approaches?
The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.
When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?
When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.
I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...
Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.
Use this pattern:
START TRANSACTION;
SELECT ... FOR UPDATE; -- this locks the row
...
UPDATE ...
COMMIT;
(and check for errors after every statement, including COMMIT.)
"100000" is not "huge", but "BIGINT" is. Recomment INT UNSIGNED instead.
For an MD5, make sure you are not using utf8: CHAR(32) CHARACTER SET ascii. This goes for any other hex strings.
Or, use BINARY(16) for half the space. Then use UNHEX(md5...) when inserting, and HEX(...) when fetching.
You are concerned about bandwidth, etc. Please describe your client (PHP? Java? ...). Please explain how much (100K rows?) needs to be fetched to re-do the MD5.
Note that there is a MD5 function in MySQL. If each of your items had an MD5, you could take the MD5 of the concatenation of those -- and do it entirely in the server; no bandwidth needed. (Be sure to increase group_concat_max_len)

Improving Speed of SQL 'Update' function - break into Insert/ Delete?

I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.

How can I parallelize Writes to the same row in MySQL?

I'm currently building a system that does running computations, and every 5 seconds inserts or updates information based on those computations to a few rows in MySQL. I'm working on running this system on a few different servers at once right now with a few agents that are each doing similar processing and then writing on the same set of rows. I already randomize the order in which each agent writes its set of rows, but there's still a lot of deadlock happening. What's the best/fastest way to get through those deadlocks? Should I just rerun the query each time one happens, or do row locks, or something else entirely?
I suggest you try something that won't require more than one client to update your 'few rows.'
For example, you could have each agent that produces results do an INSERT to a staging table with the MEMORY access method.
Then, every five seconds you can run a MySQL event (a stored procedure within the server) that loops through all the rows in that table, posting their results to your 'few rows' and then deleting them. If it's important for the rows in your staging table to be processed in order, then you can use an AUTO_INCREMENT id field. But it might not be important for them to be in order.
If you want to get fancier and more scalable than that, you'll need a queue management system like Apache ActiveMQ.

Mysql Update performance suddenly abysmal

MySQL 5.1, Ubuntu 10.10 64bit, Linode virtual machine.
All tables are InnoDB.
One of our production machines uses a MySQL database containing 31 related tables. In one table, there is a field containing display values that may change several times per day, depending on conditions.
These changes to the display values are applied lazily throughout the day during usage hours. A script periodically runs and checks a few inexpensive conditions that may cause a change, and updates the display value if a condition is met. However, this lazy method doesn't catch all posible scenarios in which the display value should be updated, in order to keep background process load to a minimum during working hours.
Once per night, a script purges all display values stored in the table and recalculates them all, thereby catching all possible changes. This is a much more expensive operation.
This has all been running consistently for about 6 months. Suddenly, 3 days ago, the run time of the nightly script went from an average of 40 seconds to 11 minutes.
The overall proportions on the stored data have not changed in a significant way.
I have investigated as best I can, and the part of the script that is suddenly running slower is the last update statement that writes the new display values. It is executed once per row, given the (INT(11)) id of the row and the new display value (also an INT).
update `table` set `display_value` = ? where `id` = ?
The funny thing is, that the purge of all the previous values is executed as:
update `table` set `display_value` = null
And this statement still runs at the same speed as always.
The display_value field is not indexed. id is the primary key. There are 4 other foreign keys in table that are not modified at any point during execution.
And the final curve ball: If I dump this schema to a test VM, and execute the same script it runs in 40 seconds not 11 minutes. I have not attempted to rebuild the schema on the production machine, as that's simply not a long term solution and I want to understand what's happening here.
Is something off with my indexes? Do they get cruft in them after thousands of updates on the same rows?
Update
I was able to completely resolve this problem by running optimize on the schema. Since InnoDB doesn't support optimize, this forced a rebuild, and resolved the issue. Perhaps I had a corrupted index?
mysqlcheck -A -o -u <user> -p
There is a chance the the UPDATE statement won't use an index on id, however, it's very improbable (if possible at all) for a query like yours.
Is there a chance your table are locked by a long-running concurrent query / DML? Which engine does the table use?
Also, updating the table record-by-record is not efficient. You can load your values into a temporary table in a bulk manner and update the main table with a single command:
CREATE TEMPORARY TABLE tmp_display_values (id INT NOT NULL PRIMARY KEY, new_display_value INT);
INSERT
INTO tmp_display_values
VALUES
(?, ?),
(?, ?),
…;
UPDATE `table` dv
JOIN tmp_display_values t
ON dv.id = t.id
SET dv.new_display_value = t.new_display_value;