Pentaho Table input to table output in batches - mysql

My input table in MySQL has 20 Million records and the target table in Oracle is empty. I need to load the entire table from MySQL into Oracle. I am simply using a Table Input and Table Output step.
My intention is not to lock the source table for a long time whilst reading.
Is there a problem with the load (Number of records) I am trying to achieve?
I could see Use batch update for inserts option in the Table Output. I could not see something similar in the Table Input. Is there a way to perform batch processing in Pentaho?

Don't worry, 20 millions records is a small number for PDI, and you will not lock the table if it is open for input. That's why the Bulk load is for output table not input tables.
A common beginer trap however, is the Truncate table option on the output table. If you run (inadertantly or for parallel processing) twice the output step, each one will lock the other. Forever.
To speed up: You may use the Lazy conversion check box on input, so that the data remains in byte format until it is used. But I am not sure you win something on a simple input/output table. if something wrong happens with Dates or Blobs on writing on the output file, the error message will be quite cryptic.
You can also increase the speed of the output by inceasing the commit size (worst a few trials in Oracle), and by inceasing the number of rows in row set which will increase the number of rows read by the table input. To do so right-click anywhere then Properties/Miscelanous.
Something I really advise to do, is to increase the JVM memory size. Use an editor (notepad or better) to edit the file named spoon.bat. You'll find around line 94-96 a line containing someting like "-Xmx256K". Change it to "-Xmx4096M" (where 4096 is half the size of your machine RAM).
To perform "batch processing" has many meaning. One of them beeing Make the transformation database transactional. Which you can do with the check box just below the above mentionned Number of row in rowset (and buggily spelled as Make the transformation database in PDI latest version). With that box checked, if something goes wrong the state of the databases is rolled back as if the transformation was never executed. But I don't advise to do this in your case.

In additinal of #AlainD solution There are a couple of options:
- Tune MySQL for better performance on Inserts
- Use the MySQL Bulk loader step in PDI
- Write SQL statements to file with PDI and read them with mysql-binary
Speed can be boosted by using some simple JDBC-connection setting.
useServerPrepStmts= false
rewriteBatchedStatements= true
useCompression= true

Related

Optimize loading time in MySQL Database

I have a huge amount of data which is loaded from ETL tool into the database. Sometimes etl tool generates some unusual data and puts them inside a table, say for simlicity I want to fill 5 correct data and get 10 as a result in my database, so I detect the inconsistency.
As the option to update data to the state which I want I had to TRUNCATE the schema in MySQL database and INSERT data from ETL tool again under my control. In this case everything looks nice, but it takes too much time to reload data.
I investigated this issue and found out that to DELETE data and INSERT it again takes much more time as for example to use the query INSERT…..ON DUPLICATE KEY UPDATE. So I don‘t need to delete all data but can just check and update it when necessary, what will save my load time.
I want to use this query, but I am a little bit confused, because of these additional 5 wrong data, which are already sitting in my database. How can I remove them without deleting everything from my table before inserting??
as you mention
"Sometimes etl tool generates some unusual data and puts them inside
a table"
You need to investigate your ETL code and correct it. Its not suppose to generate any data, ETL tool only transforms your data as per rule. Focus on ETL code rather than MySQL database.
To me that sounds like there’s a problem in the dataflow setup in your ETL tool. You don’t say what you are using, but I would go back over the select criteria and review what fields you are selecting and what are your WHERE criteria. Perhaps what is in your WHERE statements is causing the extra data.
As for the INSERT…ON DUPLICATE KEY UPDATE syntax, make sure you don’t have an AUTO_INCREMENT column in an InnoDB table. Because in that case only the INSERT will increase the auto-increment value. And check that your table doesn’t have multiple unique indexes because if your WHERE a=xx matches several rows than only 1 will be updated. (MySQL 5.7, see reference manual: https://dev.mysql.com/doc/refman/5.7/en/ .)
If you find that your ETL tools are not providing enough flexibility then you could investigate other options. Here is a good article comparing ETL tools.

How can I multiply records on mysql table quickly?

I have a project assignment that need the big data. So, I decide to test mysql query performance with big data. I want to multiply one table on the database. I've try it before, but I got an very long process to multiply it.
First I've try to use INSERT INTO the table itself and I got long process.
Second, I've tried a different way, and I use mysqlimport to import 1 GB data and I got about 1,5 hours long.
So If I want to enlarge the mysql table, do you have any suggestions for me?
Though this question should be flagged as "not constructive". I will still suggest you something.
If your objective is "only to make the table large" as per your comment. Why taking all the trouble to insert duplicate OR mysqlimport . Instead search for and download free sample large databases and play around
https://launchpad.net/test-db/+download
http://dev.mysql.com/doc/index-other.html
http://www.ozerov.de/bigdump/
If explicitly a particular table structure is needed, then run some DDL queries (ALTER TABLE) to shape those tables (downloaded) according to your wish

SSIS; row redirected to error even after inserting to DB

I have a SSIS package to insert/update rows to Database. First i use look up for checking if row already inserted to DB;If yes i update that row, else insert as new row.
My problem is when inserting , a row is inserted successfully but also redirected to error. How both can happen at same time? That too it happens some times not always - very inconsistent. How to track what caused the error ? I used "redirect row" here to get failed rows.
This happens only when it deployed on server.On running my local machine using BIDS works fine.
Your OLE DB Destination is likely set to the default values
A quick recap of what all these values mean
Data access mode: You generally want Table or view - fast load or Table or view name variable - fast load as this will perform bulk inserts. The non-fast load choices result in singleton inserts which for any volume of data will have you questioning the sanity of those who told you SSIS can load data really fast.
Keep Identity: This is only needed if you want to explicitly provide an identity seed
Keep nulls: This specifies whether you should allow the defaults to fire
Table lock: Preemptively lock the table. Unless you're dealing with Hekaton tables (new SQL Server 2014 candy), touching a table will involve locks. Yes, even if you use the NOLOCK hint. Inserting data obviously results in locking so we can assure our ACID compliance. SQL Server will start with a small lock, either Row or Page level. If you cross a threshold of modifying data, SQL Server will escalate that lock to encapsulate the whole table. This is a performance optimization as it's easier to work if nobody else has their grubby little paws in the data. The penalty is that during this escalation, we might now have to wait for another process to finish so we can get exclusivity to the table. Had we gone big to begin with, we might have locked the table before the other process had begun. - Check constraints: Should we disable the checking of constraint values. Unless you have a post import step to ensure the constraint is valid, don't uncheck this. Swiftly loaded data that is invalid for the domain is no good.
Rows per batch: this is a pass through value to the INSERT BULK statement as the ROWS_PER_BATCH value.
Maximum insert commit size: The FastLoadMaxInsertCommitSize property specifies how many rows should be held in the transaction before committing. The 2005 default was 0 which meant everything gets committed or none of it does. The 2008+ default of 2 billion may be effectively the same, depending on your data volume.
So what
You have bad data somewhere in your insert. Something is causing the insert to fail. It might be the first row, last row, both or somewhere in between. The net effect is that the insert itself is rolled back. You designed your package to allow for the bad data to get routed to a flat file so a data steward can examine it, clean it up and get it re-inserted into the pipeline.
The key then is you needed to find some value that provides the optimal balance of insert performance size, more is better, relative to badness size. For the sake of argument, let's use a commit size of 5003, because everyone likes prime numbers, and assume our data source supplies 10009 rows of data. Three rows in there will violate the integrity of the target table and will need to be examined by a data steward.
This is going to result in 3 total batches being sent to the destination. The result is one of the following scenarios
Bad rows are the final 3 rows, resulting in only those 3 rows being sent to the text file and 10006 rows committed to the table in 2 batches.
Bad rows exist only in 1 of full sets. This would result in 5006 rows being written to the table and 5003 rows sent to our file
Bad rows split amongst each commit set. This results in 0 rows written to the table and all the data in our file.
I always felt Murphy was an optimist and the disk holding the error file would get corrupt but I digress.
What would be ideal is to whittle down the space bad data can exist in while maximizing the amount of good data inserted at a shot. In fact, there are a number of people who have written about it but I'm partial to the approach outlined in "Error Redirection with the OLE DB Destination".
We would perform an insert at our initial commit size of 5003 and successful rows will go as they will. Bad rows would go to a second OLE DB Destination, this time with a smaller commit size. There's differences of opinion whether you should immediately go to singleton inserts here or add an intermediate bulk insert attempt at half your primary commit size. This is where you can evaluate your process and find the optimal solution for your data.
If data still fails the insert at the single row level, then you write that to your error file/log. This approach allows you to put as much good data into the target table as you can in the most efficient mechanism as possible when you know you're going to have bad data.
Bad data addendum
Yet a final approach to inserting bad data is to not even try to insert it. If you know foreign keys must be satisfied, add a Lookup component to your data flow to ensure that only good values are presented to the insert. Same for NULLs. You're already checking your business key so duplicates shouldn't be an issue. If a column has a constraint that the Foo must begin with Pity, then check it in the data flow. Bad rows all get shunted off to a Derived Column Task that adds business friendly error messages and then they land at a Union All and then all the errors make it to the error file.
I've also written this logic where each check gets its own error column and I only split out the bad data prior to insertion. This prevents the business user from fixing one error in the source data only to learn that there's another error. Oh, and there's still another, try again. Whether this level of data cleansing is required is a conversation with your data supplier, data cleaner and possibly your boss (because they might want to know how much time you're going to have to spend making this solution bullet proof for the horrible data they keep flinging at you)
References
Keep nulls
Error Redirection with the OLE DB Destination
Batch Sizes, Fast Load, Commit Size and the OLE DB Destination
Default value for OLE DB Destination FastLoadMaxInsertCommitSize in SQL Server 2008
I have noticed that if you check lock tables, and also have an update, you will get deadlocks between the 2 flows in your dataflow. So we do not check table lock. The performance seems to be the same.
My finding might help those who visit here..
#billinkc made a broad comment; i had gone all through that.Well Later after digging down the system the problem was something different.
My SSIS package has script task within to do some operations.That one uses a folder called TEMP in the disk.The program which triggered this SSIS also simultaneously using the same TEMP folder. Now there the file read/write exceptions were not handled.
This caused script task to fail resulting a package fail error.Since the INSERT functionality carried before the script task,INSERT was successful.Later when script failed it moved rows to error too!
I tried with catching these "file errors/exceptions" and it worked!!

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

MySQL insert query cannot finish

I am inserting part of a large table to a new MyISAM table. I tried both command line and phpmyadmin, both take a long time. But I find in the mysql data folder, the table file actually has GB of data, but in phpmyadmin, it shows there is no record. Then I "check" the table, and it takes like forever...
What is wrong here? Should I change to innoDB?
Do you have indicies defined on your table? If you're most interested in inserting a lot of data quickly, you could consider dropping the indicies, doing the insert, and then re-adding the indicies. It won't be any faster overall (in fact the manual intervention would likely make the overall operation slower), but it would give you more direct visibility into how long the data insertion is taking versus the indexing that follows.