Camel large CSV file processing Issue - csv

I am trying to process a large CSV file of approximately 1 million records and after reading the rows (line/line or in chunks), I need to push this to camel-flatpack to create a map with field names and their values.
My requirement is to feed all the CSV records to a flatpack config and generate a java.util.map out of it.
There have been several posts on stackoverflow to resolve this by splitter but my process works fast till almost 35000 records but thereafter it slows down.
I tried even adding a throttler, it still doesnt work. I get a GC Out Of Memory Error. I even shot up my JAVA_MIN_MEM, JAVA_MAX_MEM, JAVA_PERM_MEM, JAVA_MAX_PERM_MEM but the result is the same. Hawtio console shows that JAVA_HEAP_MEMORY after about 5-6 mins is more than 95%.
Here is my code snippet:
<route id="poller-route">
<from uri="file://temp/output?noop=true&maxMessagesPerPoll=10&delay=5000"/>
<split streaming="true" stopOnException="false">
<tokenize token="\n" />
<to uri="flatpack:delim:flatpackConfig/flatPackConfig.pzmap.xml?ignoreFirstRecord=false"/>
</split>
</route>
<route id="output-route">
<from uri="flatpack:delim:flatpackConfig/flatPackConfig.pzmap.xml?ignoreFirstRecord=false"/>
<convertBodyTo type="java.util.Map"/>
<to uri="mock:result"/>
</route>

One potential problem is that when you create hash maps and continuously add data to it, it needs to recreate the hash. For example, if i have hash of size 3, and input 0,1,2,3 into it, assuming my hash function is mod 3, three would be assigned to the zero slot thus creating overflow, so I would either need to store overflows or recreate a new hash.
I'm sure that this is how java implements its hashmap, but you could try initializing your hashmap's initial capacity to how many records there are.

Related

Custom SSIS workflow task

I have a ton of containers that all follow this same basic premise:
When I pull data from a remote database I first blank out a collector table, copy the data from the remote DB to the collector, count the rows in the collector, and if there are enough rows then I merge into the real table. If not, I send an email with an error message.
Instead of repeating this over and over I would like to make a custom component. I think this is just a filter component I would make, but what I'm not really sure of is how to replicate the Data Flow Task piece. Are there any good examples somebody could point me to, or even just let me know what I want to do isn't possible?
When I see problems like this, Biml tends to offer the lowest barrier to creating a simple, repeatable solution. Biml is free, all it costs you is a registration email and install BimlExpress into whatever version of Visual Studio/SSDT you are working with.
I assume that I'm going to collect the data from AdventureWorks2014 Sales.Currency table and transport it to a table in tempdb called dbo.SalesCurrency.
I defined it as
CREATE TABLE dbo.SalesCurrency
(
CurrencyCode nchar(3) NOT NULL
, Name nvarchar(50) NOT NULL
, ModifiedDate datetime NOT NULL
);
Given that, let's look at some Biml concepts. Biml is an XML based dialect that describes the business intelligence artifacts (and then some). If you ever did classic ASP development with the mix of scripting and tags, it's a similar concept but much nicer due to the .NET integration.
<# #> this is a multi-line block
<#= #> is a single line expression
Great, how do I use it? Assuming you've installed BimlExpress, open an SSIS project and right click on the Project section and select Add New Biml File. Do that twice and we'll rename the second one. The first one is a driver, the second one is the worker.
Brains biml
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<OleDbConnection Name="Source" ConnectionString="Data Source=localhost\dev2017;Initial Catalog=AdventureWorks2014;Provider=SQLNCLI11.0;Integrated Security=SSPI;" />
<OleDbConnection Name="Target" ConnectionString="Data Source=localhost\dev2017;Initial Catalog=tempdb;Provider=SQLNCLI11.0;Integrated Security=SSPI;" />
</Connections>
<#
string sourceQuery = "SELECT * FROM Sales.Currency;";
string targetSchemaTable = "[dbo].[SalesCurrency]";
string templateName = "so_56050574_include.biml";
dynamic customOutput;
#>
<Packages>
<#= CallBimlScriptWithOutput(templateName, out customOutput, sourceQuery, targetSchemaTable) #>
</Packages>
</Biml>
The first line is just xml namespace.
The next block, the Connections collections I define my Source and Target connections. I'm very creative and named them Source and Target
The next lines look a lot like C# because they are. I define my source query, fully qualified target table name, square brackets included and the name of my template file. The final variable, customOutput isn't used in here but it's a bag that allows me to pass information back from the template file - namely the name of the SSIS package it built.
I then define a Packages collection and make a single package. The package I make is defined by whatever I send to CallBimlScriptWithOutput and I then use the variables I just defined.
It looks complex but it's not. Why I like this approach is that instead of hard coding these values into my driver program, it allows me to take a metadata driven approach to development. I could look these values up from a spreadsheet, Sharepoint List, webservice, whatever I feel like (or my client offers as a repository).
Worker biml
I name this file so_56050574_include.biml and while there's plenty of text in there, it's fairly straight forward.
The first line helps the Intellisense during the Biml design experience.
The next two lines specify that these variables are going to be passed in - like a function call. I'll be able to use them like a .NET variable within the scope of this file.
The next few lines are a little funky but SSIS doesn't like duplicated names and it also doesn't like "bad" characters in names. I specify the package name will be Populate Collector and then I make the target table safe for SSIS. All the way at the bottom of the file, you'll see I have made a tiny method called MakeSsisSafeName which I use to sanitize the package name.
I create a Package and give it a good name. That Package has a Container. Within the Container, I create a handful of SSIS Variables that I'll need to do my work. That Container has tasks of Execute SQL Task -> Data Flow Task -> Execute SQL Task -> Execute SQL Task --> Send Mail Task
<## template designerbimlpath="/Biml/Packages" #>
<## property name="SourceQuery" type="string" #>
<## property name="TargetSchemaTable" type="string" #>
<#
string packageName = string.Format("Populate Collector {0}", MakeSsisSafeName(TargetSchemaTable));
CustomOutput.PackageName = packageName;
#>
<Package Name="<#= packageName #>" ConstraintMode="Linear">
<Tasks>
<Container Name="SEQC Collector" ConstraintMode="Parallel">
<Variables>
<Variable Name="RowCount" DataType="Int64">0</Variable>
<Variable Name="QueryEmpty" DataType="String">TRUNCATE TABLE <#=TargetSchemaTable#></Variable>
<Variable Name="QueryCount" DataType="String">SET NOCOUNT ON; SELECT COUNT_BIG(1) AS rc FROM <#=TargetSchemaTable#></Variable>
<Variable Name="QuerySource" DataType="String"><#=SourceQuery#></Variable>
<Variable Name="TargetSchemaTable" DataType="String"><#=TargetSchemaTable #></Variable>
</Variables>
<Tasks>
<ExecuteSQL Name="SQL Empty Collector Table" ConnectionName="Target">
<VariableInput VariableName="User.QueryEmpty" />
</ExecuteSQL>
<Dataflow Name="DFT Populate Collector Table">
<Transformations>
<OleDbSource Name="OLESRC Query" ConnectionName="Source">
<VariableInput VariableName="User.QuerySource" />
</OleDbSource>
<OleDbDestination Name="OLEDST Target" ConnectionName="Target">
<TableFromVariableOutput VariableName="User.TargetSchemaTable" />
</OleDbDestination>
</Transformations>
<PrecedenceConstraints>
<Inputs>
<Input OutputPathName="SQL Empty Collector Table.Output" EvaluationValue="Success" />
</Inputs>
</PrecedenceConstraints>
</Dataflow>
<ExecuteSQL Name="SQL Count Collector Table Rows" ConnectionName="Target" ResultSet="SingleRow">
<VariableInput VariableName="User.QueryCount" />
<Results>
<Result Name="0" VariableName="User.RowCount" />
</Results>
<PrecedenceConstraints>
<Inputs>
<Input OutputPathName="DFT Populate Collector Table.Output" EvaluationValue="Success" />
</Inputs>
</PrecedenceConstraints>
</ExecuteSQL>
<ExecuteSQL Name="SQL Merge Collector Data" ConnectionName="Target">
<DirectInput>SELECT 1; -- simulate merge</DirectInput>
<PrecedenceConstraints>
<Inputs>
<Input OutputPathName="SQL Count Collector Table Rows.Output" EvaluationOperation="ExpressionAndConstraint" EvaluationValue="Success" Expression="#[User::RowCount] > 0" />
</Inputs>
</PrecedenceConstraints>
</ExecuteSQL>
<!--
<SendMail Name="Send Mail" ToLine="Foo#bar.com" ConnectionName="Target" Subject="Subject line">
<DirectInput>Body here, I think</DirectInput>
<PrecedenceConstraints>
<Inputs>
<Input OutputPathName="SQL Count Collector Table Rows.Output" EvaluationOperation="ExpressionOrConstraint" EvaluationValue="Success" Expression="#[User::RowCount] == 0" />
</Inputs>
</PrecedenceConstraints>
</SendMail>
-->
<ExecuteSQL Name="SQL Pretend I send mail" ConnectionName="Target">
<DirectInput>SELECT 2; -- simulate merge</DirectInput>
<PrecedenceConstraints>
<Inputs>
<Input OutputPathName="SQL Count Collector Table Rows.Output" EvaluationOperation="ExpressionAndConstraint" EvaluationValue="Success" Expression="#[User::RowCount] ==0" />
</Inputs>
</PrecedenceConstraints>
</ExecuteSQL>
</Tasks>
</Container>
</Tasks>
</Package>
<#+
private static string MakeSsisSafeName(string name)
{
return name.Replace("/", "_").Replace("\\", "_").Replace(":", "_").Replace("[", "_").Replace("]", "_").Replace(".", "_").Replace("=", "_").Trim();
}
#>
Right click on the BimlScript brains file and select Generate SSIS Package
That should build out a package like this and hey, it works!
What's not covered
I don't know how you actually use this. Maybe you have one big package with lots of containers and your vision is to just push the button and have another template container added. Biml won't do that. It doesn't merge two SSIS packages - it overlays one with current definition. But, the way I defined all of this, you should be able to copy the generated Container and paste it into an existing SSIS package - assuming it has two connections named Source and Target.
Connections can also be tricky. If you're collecting data from N source servers then you'll likely want a looping mechanism to change out the Source value. That's not hard. But if the source data you're pulling back for each Collector has a different signature, then you need each bespoke Data Flow task.
Sending Email. I don't have an SMTP connection handy so I put a best guess at what the Send Mail would look like and then commented it out <!-- ... --> You'll need to add a Connection for your SMTP server in the brains package and then configure the SendMail task to use it. And then remove my "SQL Pretend I send mail" task.
Finally, you'll notice the names are repeated in the worker Biml. That tells the engine how things should be wired up. If you don't like what I called something, you'll need to change it in two places. Search and Replace will be handy in this ;)
The question asked about custom workflow tasks - answer it
Fine. It sucks. The DataFlow stuff gets into COM objects and they aren't pleasant to work with. When you supply a query or source table, you need to check the metadata, add/remove columns and lots of stuff that's poorly documented and is a lot of scut work. And that's just building a "regular" package through the interfaces. Once you get that solved, then you are looking at encapsulating that logic into a custom componentry which used to be documented with fair enough samples on Codeplex but that's dead now and I don't know if it's been migrated to github. Oh and custom tasks and components especially are version dependent so you get to build against the various binaries to get a dll for each. And then you'll likely need to build out UI components to help folks configure your SSIS task/component. And then you'll need to worry about delivering and installing it on each developer's computer. And the server installation.
Or, I can define it once via Biml and be done.

Returning large collectionresult set from WCF

I am trying to return large result set from WCF service. Large result set would probably have approax 500K records and each record will have 150 columns.
I know by configuring WCF bindings we can return large result set. But I am not sure about the limit. I tried this scenario but got error to increase the limit of "maxItemsInObjectGraph" property even I set the value as "2147483647" of this property. I googled the alternate options and found that this can be achieved with messageEncoding and transferMode property of Binding. I tried with "Mtom" and "StreamResponse" but I am not sure how it is working?
I also referred this link but dont want to make pagination as my client wants data in one go.
So Conclusion is:
1. Can we return large result set from WCF? Or Does the use of WCF best to return large result set or I need to move to different way like WebAPI?
2. StreamResponse should work like returning results in chunck I guess but after implementation, I don't think it is working as I am getting result altogether.
Your inputs will be appreciated.
Thanks!!
You can return large object sets from WCF service. WebAPI or other .NET service framework doesn't have advantage considering transferring a lot of data.
StreamResponse is the best choice for you. It transfers message portion by portion, so that you don't have all your 500K records on the wire at one moment. It's a transport layer option, hence for client it seems like he got the whole message at once. You can manage that portion size by editing MaxBufferSize property of your binding.
To minimize the size of data you can also compress it using BinaryMessageEncoding.
<customBinding>
<binding name="primaryBinding" openTimeout="00:01:00" closeTimeout="00:01:00"
sendTimeout="00:30:00" receiveTimeout="00:30:00">
<binaryMessageEncoding compressionFormat="GZip">
<readerQuotas maxDepth="2147483647" maxStringContentLength="2147483647"
maxArrayLength="2147483647" maxBytesPerRead="2147483647"
maxNameTableCharCount="2147483647" />
</binaryMessageEncoding>
<httpsTransport transferMode="Streamed" maxReceivedMessageSize="6000000"
maxBufferSize="6000000" maxBufferPoolSize="12000000" />
</binding>

Duplicate values from csv are inserted to DB using Apache Camel

I have a large chunk of CSV files(Each containing around millions of records).
So I use seda to use the multi-threading feature. I split 50000 in chunks, process it and get a List of Entity objects, which I want to split and persist to DB using jpa. Initially I was getting a Out of Heap Memory Exception. But later I used a high configuration system and Heap issue was solved.
But right now the issue is, I am getting duplicate records getting inserted in the DB. say if there are 1000000 records in the csv, around 2000000 records are getting inserted to DB.
There is no primary key for the records in the Csv files. So I have used hibernate to generate a primary key for it.
Below is my code (came-context.xml)
<camelContext xmlns="http://camel.apache.org/schema/spring">
<route>
<from uri="file:C:\Users\PPP\Desktop\input?noop=true" />
<to uri="seda:StageIt" />
</route>
<route>
<from uri="seda:StageIt?concurrentConsumers=1" />
<split streaming="true">
<tokenize token="\n" group="50000"></tokenize>
<to uri="seda:WriteToFile" />
</split>
</route>
<route>
<from uri="seda:WriteToFile?concurrentConsumers=8" />
<setHeader headerName="CamelFileName">
<simple>${exchangeId}</simple>
</setHeader>
<unmarshal ref="bindyDataformat">
<bindy type="Csv" classType="target.bindy.RealEstate" />
</unmarshal>
<split>
<simple>body</simple>
<to uri="jpa:target.bindy.RealEstate"/>
</split>
</route>
Please Help.
Could you be accidently starting 2 contexts up so the routes run twice? If How do you start the route?
I think the problem may be with "?noop=true". Since this doesn't move the file that is being processed. As a result, Camel will consume the file again and again. Have you tried removing this option so Camel would move this file to a .camel subdirectory? Camel by default doesn't process files that are in a "hidden" directory -the ones that start with DOT. You can also add "?moveFailed=.failed" as a precaution, so files will be always moved to a directory, even if they fail. Let me know if this helps.
R.
To eliminate the duplicates in the DB, you could create the primary key from a hash of a record's contents instead of using hibernate to generate a random one.
add this noop.flag: true, to yml file, it will flag a file that camel process and will not process it again, also you can specify a destination location as soon as d file is processed, it moves a copy, then you do a manual delete method to remove the processed files from the source folder. a scheduler will be best to achieve it

Retrieve row number with supercsv

Is there a way with the super-csv library to find out the row number of the file that will be processed?
In other word, before i start to process my rows with a loop:
while ((obj = csvBeanReader.read(obj.getClass(),
csvModel.getNameMapping(), processors)) != null) {
//Do some logic here...
}
Can i retrieve with some library class the number of row contained into the csv file?
No, in order to find out how many rows are in your CSV file, you'll have to read the whole file with Super CSV (this is really the only way as CSV can span multiple lines). You could always do an initial pass over the file using CsvListReader (it doesn't do any bean mapping, so probably a bit more efficient) just to get the row count...
As an aside (it doesn't help in this situation), you can get the current line/row number from the reader as you are reading using the getLineNumber() and getRowNumber() methods.

How to define an Aggregator in Camel after using a ProducerTemplate?

I have a route that processes csv files and inserts them as records into the database. Since it's a huge csv-file and the Camel csv-splitter ran out of memory we had to write our own splitter. I wrote the splitter using a ProducerTemplate.
The route to process the csv looks a bit like this:
<route id="processCsvRoute">
<from ref="inbox" />
<to uri="bean:csvBean?method=process"/>
</route>
In the csvBean we do the splitting and finally execute the following java code for every csv-line (a csv line results in a product object).
producer.sendBodyAndHeader("direct:csvAggregator", product, "ID", csv.getFilename());
No the csvAggregator-route picks up the csv:
<route id="csvAggregator">
<from uri="direct:csvAggregator" />
<aggregate strategyRef="exchangeAggregatorStrategy"
completionSize="10000"
completionInterval="10000"
parallelProcessing="true">
<correlationExpression>
<header>ID</header>
</correlationExpression>
<to uri="bean:batchInsertBean"/>
</aggregate>
</route>
Is there a way to define the aggregator in the processCsvRoute? My solution is working, but it doesn't feel right I have to create a separate route for it.
Thanks for your help.
You can just enable streaming mode on the splitter, then it reads the CSV file "line by line" and you won't run out of memory.
The documentation has more details: http://camel.apache.org/splitter
<split streaming="true" ...>