I'm writing a script to load records from a file in to a mysql DB using hibernate. I'm processing records in batches of 1000 using Transactions, an insert would fail if the record already exists in the DB, which would essentially make entire Transaction rolled back.Is there a way to know what are the records processed in rolled back transaction ?
Also, considering this scenario is there a better to way to do it ? Do note that the script runs daily and its not a one time loading and the file typically will have about 250 million records daily.
You can use the StatelessSession API and check for a ConstraintViolationException; you can discard failure record without rollback transaction.
Related
Is there any good & performant alternative to FOR UPDATE SKIP LOCKED in mariaDB? Or is there any good practice to archieve job queueing in mariaDB?
Instead of using a lock to indicate a queue record is being processed, use an indexed processing column. Set it to 0 for new records, and, in a separate transaction from any processing, select a single not yet processing record and update it to 1. Possibly also store the time and process or thread id and server that is processing the record. Have a separate monitoring process to detect jobs flagged as processing that did not complete processing within the expected time.
An alternative that avoids even the temporary lock on a non-primary index needed to select a record is to use a separate, non-database message queue to notify you of new records available in the database queue. (Unless you won't ever care if a unit of work is processed more than once, I would always use a database table in addition to any non-database queue.)
DELETE FROM QUEUE_TABLE LIMIT 1 RETURNING *
for dequeue operations. Depending on your needs it might work ok
Update 2022-06-14:
MariaDB supports SKIP LOCKED now.
I am using the Talend open studio for Data Integration tool for transfer sql server table data to mysql server database.
I have a 40 million records into the table.
I created and run the job but after inserting 20 million approx, connection failed.
When i Tried again to insert the Data then talend job firstly truncate the data from the table then it is inserting data from the beginning,
The question seems to be incomplete, but assuming that you want the table not to truncate before each load, check the "Action on table" property. It should be set as "Default" or "Create Table if does not exist".
Now, if you're question is to handle restart-ability of the job where the job should resume from 20 million rows on the next run, there are multiple ways you could achieve this. In your case since you are dealing with high number of records, having a mechanism like pagination would help in which you load the data in chunks (lets say 10000 at a time) and loop it setting the commit interval as 10000. After each successful entry in the database of 10000 records, make an entry into one log table with the timestamp or incremental key in your data (to mark the checkpoint). Your job should look something like this:
tLoop--{read checkpoint from table}--tMSSqlInput--tMySqlOutput--{load new checkpoint in table}
You can set the a property in context variable, say 'loadType' which will have value either 'initial' or 'incremental'.
And before truncating table you should have 'if' link to check what is the value of this variable, if it is 'initial' it will truncate and it is 'incremental' then you can run your subjob to load data.
in my sql server stored procedure, I am processing more than million records. For improving performance, I am processing records after splitting all the records in different batches.
Example : if i am processing 10,000 records then my logic picks up 1000 records at a time, apply business logic and apply the result to database and commit the transaction.
In my situation, when one of the record out of those 1000 records has a error, i want to log the error record and process rest of the records and issue a commit for successfully processed record.
how can i achieve this ?
....
I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.
I have a MySQL table implementing a mail queue, and I use it also to send mails which reports unexpected errors in the system. Sometimes these unexcepted errors ocurrs inside a transaction so when I rollback the transacion also I undo the row inserted (the mail which is reporting the unexpected error) in the mail queue table.
My question is how can I force to insert a row in a table in the middle a transaction ignoring the possible transaction rollback?. I mean, If the transactions finally rollsback, not to rollback also the row insertion for the email reporting the error details.
This table can be read by multiple asyncronous process to send the mails in the queue, so in this scenario the rows have to be blocked to send only once the emails so is not possible to use a MyISAM table type and is using Innodb.
Thanks in advance.
If you INSERT should survive a ROLLBACK of the transaction, it is safe to say, that it is not part of the transaction. So what you should do is to simply move it outside the transaction. There are many ways to achieve that:
While in the transaction, instead of running you INSERT, store the
fields in session variables (these will survive a ROLLBACK), after
the transaction run the insert from the session variables
Rethink your schema - this reeks of some deeper-lying problem
Open a second DB connection and run your INSERT on this one, it will not be affected by the transaction on the first connection.
You could create a different connection to the database to insert the errors and it won't be in the same transaction context, so they would be inserted.