sql log file growing too big - sql-server-2008

I have a table with 10 million records with no indexes and I am trying to dedupe the table. I tried the inserts with select where either using a left join or where not exists; but each time I get the error with violation of key. The other problem is that the log file grows too large and the transaction will not complete. I tried setting the recovery to simple as recommended online but that does not help. Here are the queries I used;
insert into temp(profile,feed,photo,dateadded)
select distinct profile,feed,photo,dateadded from original as s
where not exists(select 1 from temp as t where t.profile=s.profile)
This just produces the violation of key error. I tried using the following:
insert into temp(profile,feed,photo,dateadded)
select distinct profile,feed,photo,dateadded from original as s
left outer join temp t on t.profile=s.profile where t.profile is null
In both instances now the log file fills up before the transaction completes. So my main question is about the log file and I can figure out the deduping with the queries.

You may need to work in batches. Write a loop to go through 5000 (you can experiment with the number, I've had to go as far down as 500 or up to 50,000 depending on the db and how busy it was) records or so at a time.
What is your key? Likely your query will need to pick using an aggreagate function on dataadded (use the min or the max function).

the bigger the transaction, the bigger the transaction log will be.
The log is used for uncommitted recovery of an open transaction so if you’re not committing frequently and your executing a very large transaction, it will cause the log file to grow substantially. Once it commits, then the file will become free space. This is to safe guard the data in case something fails and roll back is needed.
my suggestion would be to run the insert in batches, committing after each batch

Related

Long running innodb query generate a big undo file in mariadb

I have a big query in php using MYSQLI_USE_RESULT not to put all the results into the php memory.
Because if I use MYSQLI_STORE_RESULT it will put all of the data into memory for all results, which takes multiple GB of ram, instead of getting row by row.
It returns millions of rows and each row will generate an api request, so the query will be running for days.
In the mean time, I have other mysql queries that update/insert the tables related to the first query, and I think it cause the undo log to grow without stopping.
I setup innodb_undo_tablespaces=2 and innodb_undo_log_truncate = ON
so the undo log is separated from ibdata1, but the undo files are still big until I kill the queries that have been running for days.
I executed "SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;" before running the long running query, hoping that it would prevent undo file to grow, but it didn't.
The other queries that are updating/inserting have autocommit.
In 1-2 day, the undo file is already 40GB large.
The question : how to prevent this undo file to increase ? As I don't want to keep the previous version of the data while the query is running. It's not important if I get updated data instead of the data that was at the time of the query.
Regardless of your transaction isolation level, a given query will always establish a fixed snapshot, which requires the data to be preserved in the state it was when the query started.
In other words, READ-COMMITTED or READ-UNCOMMITTED allow subsequent queries in the same transaction to see updated data, but a single query will never see a changing data set. Thus concurrent updates to data will force old record versions to be copied to the undo log, and those record versions will be preserved there until your long-running query is finished.
READ-UNCOMMITTED doesn't help any more than READ-COMMITTED. In fact, I've never needed to use READ-UNCOMMITTED for any reason. Allowing "dirty reads" of unfinished transactions breaks rules of ACID databases, and leads to anomalies.
The only way to avoid long-lasting growth of your undo log is to finish your query.
The simplest way to achieve this is to use multiple short-running queries, each fetching a subset of the result. Finish each query in a timely way.
Another solution would be to run the whole query for the millions of rows of result, and store the result somewhere that isn't constrained by InnoDB transaction isolation.
MyISAM table
Message queue
Plain file on disk
Cache like Memcached or Redis
PHP memory (but you said you aren't comfortable with this because of the size)

MariaDB. Use Transaction Rollback without locking tables

On a website, when a user posts a comment I do several queries, Inserts and Updates. (On MariaDB 10.1.29)
I use START TRANSACTION so if any query fails at any given point I can easily do a rollback and delete all changes.
Now I noticed that this locks the tables when I do an INSERT from an other INSERT, and I'm not talking while the query is running, that’s obvious, but until the transaction is not closed.
Then DELETE is only locked if they share a common index key (comments for the same page), but luckily UPDATE is no locked.
Can I do any Transaction that does not lock the table from new inserts (while the transaction is ongoing, not the actual query), or any other method that lets me conveniently "undo" any query done after some point?
PD:
I start Transaction with PHPs function mysqli_begin_transaction() without any of the flags, and then mysqli_commit().
I don't think that a simple INSERT would block other inserts for longer than the insert time. AUTO_INC locks are not held for the full transaction time.
But if two transactions try to UPDATE the same row like in the following statement (two replies to the same comment)
UPDATE comment SET replies=replies+1 WHERE com_id = ?
the second one will have to wait until the first one is committed. You need that lock to keep the count (replies) consistent.
I think all you can do is to keep the transaction time as short as possible. For example you can prepare all statements before you start the transaction. But that is a matter of milliseconds. If you transfer files and it can take 40 seconds, then you shouldn't do that while the database transaction is open. Transfer the files before you start the transaction and save them with a name that indicates that the operation is not complete. You can also save them in a different folder but on the same partition. Then when you run the transaction, you just need to rename the files, which should not take much time. From time to time you can clean-up and remove unrenamed files.
All write operations work in similar ways -- They lock the rows that they touch (or might touch) from the time the statement is executed until the transaction is closed via either COMMIT or ROLLBACK. SELECT...FOR UPDATE and SELECT...WITH SHARED LOCK also get involved.
When a write operation occurs, deadlock checking is done.
In some situations, there is "gap" locking. Did com_id happen to be the last id in the table?
Did you leave out any SELECTs that needed FOR UPDATE?

Will a MySQL SELECT statement interrupt INSERT statement?

I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result

How to select consistent data from multiple tables efficiently

I'm using MySQL 5.6. Let's say we have the following two tables:
Every DataSet has a huge amount of child DataEntry records that the number would be 10000 or 100000 or more. DataSet.md5sum and DataSet.version get updated when its child DataEntry records are inserted or deleted, in one transaction. A DataSet.md5sum is calculated against all of its children DataEntry.content s.
Under this situation, What's the most efficient way to fetch consistent data from those two tables?
If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:
SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000
SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000 -- I think the result of this query will possibly incosistent with the md5sum which fetched by former query
I think I can get consistent data with one query as follows:
SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000
But it produces redundant dataset which filled with 10000 or 100000 duplicated md5sums, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).
I think using pessimistic read / write lock (SELECT ... LOCK IN SHARE MODE / FOR UPDATE) would be another option but it seems overkill. Are there any other better approaches?
The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.
When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?
When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.
I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...
Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.
Use this pattern:
START TRANSACTION;
SELECT ... FOR UPDATE; -- this locks the row
...
UPDATE ...
COMMIT;
(and check for errors after every statement, including COMMIT.)
"100000" is not "huge", but "BIGINT" is. Recomment INT UNSIGNED instead.
For an MD5, make sure you are not using utf8: CHAR(32) CHARACTER SET ascii. This goes for any other hex strings.
Or, use BINARY(16) for half the space. Then use UNHEX(md5...) when inserting, and HEX(...) when fetching.
You are concerned about bandwidth, etc. Please describe your client (PHP? Java? ...). Please explain how much (100K rows?) needs to be fetched to re-do the MD5.
Note that there is a MD5 function in MySQL. If each of your items had an MD5, you could take the MD5 of the concatenation of those -- and do it entirely in the server; no bandwidth needed. (Be sure to increase group_concat_max_len)

MySQL INSERT SELECT on large static table

I need to copy the content of one table to another. So I started using:
INSERT new_table SELECT * FROM old_table
However, I am getting the following error now:
1297, "Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDBCLUSTER"
I think I have an understanding why this occurs: My table is huge, and MySQL tries to take a snapshot in time (lock everything and make one large transaction out of it).
However, my data is fairly static and there is no other concurrent session that would modify the data. How can I tell MySQL to copy one row at a time, or in smaller chunks, without locking the whole thing?
Edit note: I already know that I can just read the whole table row-by-row into memory/file/dump and write back. I am interested to know if there is an easy way (maybe setting isolation level?). Note that the engine is InnoDB.
Data Migration is one of the few instances where a CURSOR can make sense, as you say, to ensure that the number of locks stays sane.
Use a cursor in conjunction with TRANSACTION, where you commit after every row, or after N rows (e.g. use a counter with modulo)
select the data from innodb into an outfile and load infile into
cluster