NIFI insert large CSV into SQL Database - csv

I'm trying to insert large quantites of large CSV files into a database.
I'm doing this with the PutDataBaseRecord processor, which makes this process really fast and easy.
The problem is that I don't know how to handle failures properly, e.g. if a value doesn't match a column's datatype or if a row is a duplicate.
If such a thing occurs, the PutDataBaseRecord processor discards all the records of the batch it just converted from the CSV file. So if one record of 2.000.000 fails, none of the 2.000.000 records make it into the db.
I managed to fix one problem source through cleaning the CSV data beforehand, but I still run into the issue of duplicate rows.
This I attempted to fix by splitting the CSV into single rows inside NIFI before passing them into the PutDatabaseRecord processor, which is really really slow and often results in a OOM error.
Can someone suggest an alternative way of inserting large CSVs in a SQL database?

You should be able to use ValidateCsv or ValidateRecord to do the datatype stuff and other validation. Detecting duplicates in huge files is difficult since you have to keep track of everything you've seen, which can take a lot of memory. If you have a single column that could be used for detecting duplicates, try ValidateCsv with a Unique constraint on that column, and set the Validation Strategy to line-by-line. That should keep all valid rows together so you can still use PutDatabaseRecord afterward.
Alternatively, you can split the CSV into single rows (use at least two SplitText or SplitRecord processors, one to split the flow file into smaller chunks, followed by a second that splits the smaller chunks into individual lines) and use DetectDuplicate to remove duplicate rows. At that point you'd likely want to use something like MergeContent or MergeRecord to bundle rows back up for more efficient use by PutDatabaseRecord

Related

can I query a mysql database while data is being loaded from a csv file?

It is taking too long, and I don't have a way of knowing if it is going to be loaded as expected after it finishes. Can I query the table to at least make sure that the data is being loaded as expected ? is there a way of seeing some rows while the load is working?
If we assume you are using the LOAD DATA INFILE statement to do the bulk load, then the answer is no, the bulk-load executes atomically. This means no other session can see the result of the bulk-load until it is complete. If it fails for some reason, the entire dataset is rolled back.
If you want to see incremental progress, you would need to use some client that reads the CSV file and inserts individual rows (or at least subsets of rows) and commits the inserts at intervals.
Or you could use LOAD DATA INFILE if you split your CSV input file into multiple smaller files, so you can load them in batches. If you just want to test if the loading is done properly, you should start with a much smaller file and load that.

Pentaho Table input to table output in batches

My input table in MySQL has 20 Million records and the target table in Oracle is empty. I need to load the entire table from MySQL into Oracle. I am simply using a Table Input and Table Output step.
My intention is not to lock the source table for a long time whilst reading.
Is there a problem with the load (Number of records) I am trying to achieve?
I could see Use batch update for inserts option in the Table Output. I could not see something similar in the Table Input. Is there a way to perform batch processing in Pentaho?
Don't worry, 20 millions records is a small number for PDI, and you will not lock the table if it is open for input. That's why the Bulk load is for output table not input tables.
A common beginer trap however, is the Truncate table option on the output table. If you run (inadertantly or for parallel processing) twice the output step, each one will lock the other. Forever.
To speed up: You may use the Lazy conversion check box on input, so that the data remains in byte format until it is used. But I am not sure you win something on a simple input/output table. if something wrong happens with Dates or Blobs on writing on the output file, the error message will be quite cryptic.
You can also increase the speed of the output by inceasing the commit size (worst a few trials in Oracle), and by inceasing the number of rows in row set which will increase the number of rows read by the table input. To do so right-click anywhere then Properties/Miscelanous.
Something I really advise to do, is to increase the JVM memory size. Use an editor (notepad or better) to edit the file named spoon.bat. You'll find around line 94-96 a line containing someting like "-Xmx256K". Change it to "-Xmx4096M" (where 4096 is half the size of your machine RAM).
To perform "batch processing" has many meaning. One of them beeing Make the transformation database transactional. Which you can do with the check box just below the above mentionned Number of row in rowset (and buggily spelled as Make the transformation database in PDI latest version). With that box checked, if something goes wrong the state of the databases is rolled back as if the transformation was never executed. But I don't advise to do this in your case.
In additinal of #AlainD solution There are a couple of options:
- Tune MySQL for better performance on Inserts
- Use the MySQL Bulk loader step in PDI
- Write SQL statements to file with PDI and read them with mysql-binary
Speed can be boosted by using some simple JDBC-connection setting.
useServerPrepStmts= false
rewriteBatchedStatements= true
useCompression= true

Optimize loading time in MySQL Database

I have a huge amount of data which is loaded from ETL tool into the database. Sometimes etl tool generates some unusual data and puts them inside a table, say for simlicity I want to fill 5 correct data and get 10 as a result in my database, so I detect the inconsistency.
As the option to update data to the state which I want I had to TRUNCATE the schema in MySQL database and INSERT data from ETL tool again under my control. In this case everything looks nice, but it takes too much time to reload data.
I investigated this issue and found out that to DELETE data and INSERT it again takes much more time as for example to use the query INSERT…..ON DUPLICATE KEY UPDATE. So I don‘t need to delete all data but can just check and update it when necessary, what will save my load time.
I want to use this query, but I am a little bit confused, because of these additional 5 wrong data, which are already sitting in my database. How can I remove them without deleting everything from my table before inserting??
as you mention
"Sometimes etl tool generates some unusual data and puts them inside
a table"
You need to investigate your ETL code and correct it. Its not suppose to generate any data, ETL tool only transforms your data as per rule. Focus on ETL code rather than MySQL database.
To me that sounds like there’s a problem in the dataflow setup in your ETL tool. You don’t say what you are using, but I would go back over the select criteria and review what fields you are selecting and what are your WHERE criteria. Perhaps what is in your WHERE statements is causing the extra data.
As for the INSERT…ON DUPLICATE KEY UPDATE syntax, make sure you don’t have an AUTO_INCREMENT column in an InnoDB table. Because in that case only the INSERT will increase the auto-increment value. And check that your table doesn’t have multiple unique indexes because if your WHERE a=xx matches several rows than only 1 will be updated. (MySQL 5.7, see reference manual: https://dev.mysql.com/doc/refman/5.7/en/ .)
If you find that your ETL tools are not providing enough flexibility then you could investigate other options. Here is a good article comparing ETL tools.

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

Bulk insert of MySQL related tables from bash

I need to upload regularly quite a bulk of data to a MySQL database from CSV files. I used do this by simply executing LOAD DATA INFILE from bash scripts. Now however, the data are to be spread over several tables and relations are to be kept. What are general strategies in such cases?
Let's assume an initially simple task: relation one-to-many, two tables.
I consider something like:
getting maximal identyfier for table 1
manually applying identifiers to the CSV file
splitting the file with two target tables in mind
inserting both tables
Is it an optimal solution? (In the real case for example I'm going to have lots of many-to-many relations to be updated this way.)
Can I lock the table 1 from the level of bash for the duration of whole the process? Or do I have to use some intermediary tool like perl or Python to keep all the things in one session?
There are various conflicting requirements expressed in your question. This answer concentrates on the “keep lock” aspect of it.
In order to maintain a table lock for the whole operation, you'll have to maintain a single connection to the sql server. One way would be passing everything as a multi-line multi-command input to a single invocation of the mysql command line client. Basically like this:
{ echo "LOCK TABLES Table1 WRITE"
for i in "${infiles[#]}"; do
echo "LOAD DATA LOCAL INFILE '${i}'"
done
} | mysql
That would work as long as you can generate all the required statements without asking questions from the database (like maximal identifier) while the lock is kept.
In order to mix read operations (like asking for a maximal value) and write operations (like loading content of some files), you'll nead a bidirectional communication with the server. Achieving this through bash is very tricky, so I'd advise against it. Even if you don't need to ask questions, the unidirectional connection provided by a bash pipe is a source of danger: If anything goes wrong on the mysql side, bash won't notice and will issue the next command anyway. You might end up committing inconsistent data.
For these reasons, I'd rather suggest some scripting language for which mysql bindings are available, like the Perl or Pyhon options you mentioned. Reading CVS files in those languages is easy, so you might do all of the following in a single script:
lock tables
start transaction
read input csv files
ask questions like max id
adjust input data to match table layout
insert data into tables
if no errors occurred, commit transaction