I have used Nifi-0.6.1 with combination of GetFile+SplitText+ReplaceText processor to split the csv data which has 30MB (300 000 rows).
GetFile is able to pass 30mb to SplitText very quickly.
In SpliText +Replace Text takes 25 mins to split the data into Json.
Just 30 mb data is taking 25 mins for store csv into SQL Server.
It performs conversion byte by byte.
I have tried Concurrent Task option in Processor. It can able to speed but it also take more time. At that time it attain 100% cpu Usage.
How can I perform csv data into sql Server faster?
Your incoming CSV file has ~300,000 rows? You might try using multiple SplitText processors to break that down in stages. One big split can be very taxing on system resources, but dividing it into multiple stages can smooth out your flow. The typically recommended maximum is between 1,000 to 10,000 per split.
See this answer for more details.
You mention splitting the data into JSON, but you're using SplitText and ReplaceText. What does your incoming data look like? Are you trying to convert to JSON to use ConvertJSONtoSQL?
If you have CSV incoming, and you know the columns, SplitText should pretty quickly split the lines, and ReplaceText can be used to create an INSERT statement for use by PutSQL.
Alternatively, as #Tomalak mentioned, you could try to put the CSV file somewhere where SQLServer can access it, then use PutSQL to issue a BULK INSERT statement.
If neither of these is sufficient, you could use ExecuteScript to perform the split, column parsing, and translation to SQL statement(s).
Related
I'm trying to insert large quantites of large CSV files into a database.
I'm doing this with the PutDataBaseRecord processor, which makes this process really fast and easy.
The problem is that I don't know how to handle failures properly, e.g. if a value doesn't match a column's datatype or if a row is a duplicate.
If such a thing occurs, the PutDataBaseRecord processor discards all the records of the batch it just converted from the CSV file. So if one record of 2.000.000 fails, none of the 2.000.000 records make it into the db.
I managed to fix one problem source through cleaning the CSV data beforehand, but I still run into the issue of duplicate rows.
This I attempted to fix by splitting the CSV into single rows inside NIFI before passing them into the PutDatabaseRecord processor, which is really really slow and often results in a OOM error.
Can someone suggest an alternative way of inserting large CSVs in a SQL database?
You should be able to use ValidateCsv or ValidateRecord to do the datatype stuff and other validation. Detecting duplicates in huge files is difficult since you have to keep track of everything you've seen, which can take a lot of memory. If you have a single column that could be used for detecting duplicates, try ValidateCsv with a Unique constraint on that column, and set the Validation Strategy to line-by-line. That should keep all valid rows together so you can still use PutDatabaseRecord afterward.
Alternatively, you can split the CSV into single rows (use at least two SplitText or SplitRecord processors, one to split the flow file into smaller chunks, followed by a second that splits the smaller chunks into individual lines) and use DetectDuplicate to remove duplicate rows. At that point you'd likely want to use something like MergeContent or MergeRecord to bundle rows back up for more efficient use by PutDatabaseRecord
I am working with Google BigQuery for the first time on a client project and have created packages in SSIS to insert data into tables (an odd combination but one required by my client), using an SSIS plugin (CData).
I am looking to insert around 100k rows into a BigQuery table, however, when I look to do further update queries on this table, these cannot be performed because the data is still in the buffer. How does one know how long this will take in BigQuery and are there ways to speed up the process?
It doesn't matter if the data is still in the buffer. If you query the table, the data in the buffer will be included too. Just one of the many awesome things about BigQuery.
https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert
A record that arrives in the streaming buffer will remain there for
some minimum amount of time (minutes). During this period while the
record is buffered, it's possible that you may issue a query that will
reference the table. The Instant Availability Reader allows workers
from the query engine to read the buffered records prior to being
committed to managed storage.
data is still in the buffer. How does one know how long this will take in BigQuery?
Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table.
Data can take up to 90 minutes to become available for copy and export operations. See more in documentation
Meantime, have in mind - Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. So, as stated above - up to 90 minutes
are there ways to speed up the process?
The only way in your case is to use loading data instead of streaming data. As per how I understand your case - data is in MS SQL, so you can potentially make your SSIS package batch aware and load it batch by batch through Cloud Storage
In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.
Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.
Hi everyone I'm trying to extract a lot of records from a lot of joined tables and views using SSIS (OLE DB SOURCE) but it takes a huge time! the problem is due to the query because when I parsed it on sql server it takes more than hour ! Her's my ssis package design
I thought of paralleled extraction using two OLE DB source and merge join but it isn't recommended using it! besides it takes more time! Is there any way to help me please?
Writing the T-sql query with all the joins in the OLEDB source will always be faster than using different source and then using Merge Join IMHO. The reason is SSIS is memory Oriented architecture .It has to bring all the data from N different tables into its buffers and then filter it using Merge join and more over Merge Join is an asynchronous component(Semi Blocking) therefore it cannot use the same input buffer for its output .A new buffer is created and you may run out of memory if there are large number of rows extracted from the table.
Having said that there are few ways you can enhance the extraction performance using OLEDB source
1.Tune your SQL Query .Avoid using Select *
2.Check network bandwidth .You just cannot have faster throughput than your bandwidth supports.
3.All source adapters are asynchronous .The speed of an SSIS Source is not about how fast your query runs .It's about how fast the data is retrieved .
As others have suggested above ,you should show us the query and also the time it is taking to retireve the data else these are just few optimization technique which can make the extraction faster
Thank you for posting a screen shot of your data flow. I doubt whether the slowness you encounter is truly the fault of the OLE DB Source component.
Instead, you have 3 asynchronous components that result in a 2 full blocks of your data flow and one that's partially blocking (AGG, SRT, MRJ). That first aggregate will have to wait for all 500k rows to arrive before it can finish the aggregate and then pass it along to the sort.
These transformations also result in fragmented memory. Normally, a memory buffer is filled with data and visits each component in a data flow. Any changes happen directly to that address space and the engine can parallelize operations if it can determine step 2 is modifying field X and step 3 is modifying Y. The async components are going to cause data to be copied from one space to another. This is a double slow down. The first is the physical act of copying data from address space 0x01 to 0xFA or something. The second is that it reduces the available amount of memory for the dtexec process. No longer can SSIS play with all N gigs of memory. Instead, you'll have quartered your memory and after each async is done, that memory partition is just left there until the data flow completes.
If you want this run better, you'll need to fix your query. It may result in your aggregated data being materialized into a staging table or all in one big honkin' query.
Open a new question and provide insight into the data structures, indexes, data volumes, the query itself and preferably the query plan - estimated or actual. If you need help identifying these things, there are plenty of helpful folks here that can help you through the process.
I have a ssis package which I run using an sql job for bulk copy of data from one database to other. the destination is our integration server where we have enough space for database. But when i run this job (i.e package). it creates huge number of temp files in localsettings/temp folder in orders of for a 1GB mdf file it creates some 20gb of temp files. I have manually created this package and didnot use import export wizard. Can any one help me how to avoid this huge tempfiles while executing?.If any further details needed plese mention.
Note: many said if we create a package using import export wizard and set optimize for many tables true this will happen. But in this package i query only one table and have created manually without using import export wizard.
Why is the package creating temp files?
SSIS is an in-memory ETL solution, except when it can't keep everything in memory and begins swapping to disk.
Why would restructuring the package as #jeff hornby suggested help?
Fully and partially blocking transformations force memory copies in your data flow. Assume you have 10 buckets carrying 1MB of data each. When you use a blocking transformation, as those buckets arrive at a transformation the data has to be copied from one memory location to another one. You've now doubled your packages total memory consumption as you have 10MB of data used before the union all transformation and then another 10MB after it.
Only use columns that you need. If a column is not in your destination, don't add it to the data flow. Use the database to perform sorts and merges. Cast your data to the appropriate types before it ever hits the data flow.
What else could be causing temp file usage
Lookup transformations. I've seen people crush their ETL server when they use SELECT * FROM dbo.BillionRowTable when all they needed was one or two columns for the current time period. The default behaviour of a lookup operation is to execute that source query and cache the results in memory. For large tables, wide and/or deep, this can make it look like your data flow isn't even running as SSIS is busy streaming and caching all of that data as part of the pre-execute phase.
Binary/LOB data. Have an (n)varchar(max)/varbinary(max) or classic BLOB data type in your source table? Sorry, that's not going to be in memory. Instead, the data flow is going to carry a pointer along and write a file out for each one of those objects.
Too much parallel processing. SSIS is awesome in that you get free paralleization of your proessing. Except you can have too much of a good thing. If you have 20 data flows all floating in space with no precedence between them, the Integration Services engine may try to run all of them at once. Add a precedence constraint between them, even if it's just on completion (on success/on fail) to force some serialization of operations. Inside a data flow, you can introduce the same challenge by having unrelated operations going on. My rule of thumb is that starting at any source or destination, I should be able to reach all the other source/destinations.
What else can I do?
Examine what else is using memory on the box. Have you set a sane (non-default) maximum memory value for SQL Server? SSIS like RAM like a fat kid loves cake so you need to balance the memory needs of SSIS against the database itself-they have completely separate memory spaces.
Each data flow has the ability to set the [BufferTempStoragePath and BlobTempStoragePath2. Take advantage of this and put that on a drive with sufficient storage
Finally, add more RAM. If you can't make the package better by doing the above, throw more hardware at it and be done.
If you're getting that many temp files, then you probably have a lot of blocking transforms in your data flow. Try to eliminate the following types of transformations: Aggregate, Fuzzy Grouping, Fuzzy Lookup, Row Sampling, Sort, Term Extraction. Also, partially blocking transactions can create the same problems but not in the same scale: Data Mining Query, Merge, Merge Join, Pivot, Term Lookup, Union All, Unpivot. You might want to try to minimize these transformations.
Probably the problem is a sort transformation somewhere in your data flow (this is the most common). You might be able to eliminate this by using an ORDER BY clause in your SQL statement. Just remember to set the sorted property in the data source.