Read a csv file that has a JSON column in SSIS? - json

I have the following CSV file that has 4 columns. The last column addresses holds 2 addresses history in a JSON format. I have tried to read it in SSIS but it splits the JSON along with the comma(,) instead of grouping all the addresses under one column.
I am using a flat-file connector for this. Is there any other source component for this type of content? How can I parse this in SSIS so that there are just 4 columns and the addresses appear all under one column?
id,title,name,addresses
J44011,Mr,James,"{""address_line_1"": 45, ""post_code"": ""XY7 10PG""},{""address_line_1"": 15, ""post_code"": ""AB7 1HG""}"

You can use a script component to process the JSON into its own detail table.
I created the following dataflow:
Here are the steps to the script component:
On inputs add ID and Address columns:
On inputs and outputs: add a new output and create columns (remember to program the datatypes:
The script:
public class Addresses
{
public int address_line_1 { get; set; }
public string post_code { get; set; }
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Test if addresses exist, if not leave the Row processing
if (string.IsNullOrEmpty(Row.addresses)) return;
//Fix Json to make it an array of objects
string json = string.Format("[{0}]", Row.addresses);
//Load into an array of Addressses
Addresses[] adds = new System.Web.Script.Serialization.JavaScriptSerializer().Deserialize<Addresses[]>(json);
//Process the array
foreach (var a in adds)
{
rowsAddressesBuffer.AddRow();
rowsAddressesBuffer.ID = Row.id;
rowsAddressesBuffer.Address1 = a.address_line_1;
rowsAddressesBuffer.PostalCode = a.post_code;
}
}
Notes:
The class added to store results.
The JSON had to be fixed to create an array of objects.
You need to add a reference to System.Web.Extensions.
This goes to the load. Make sure text qualifier is defined as a double quote (")

I have tried to read it in SSIS but it splits the JSON along the comma(,) instead of grouping all the addresses under one column.
In order to force SSIS to read the flat file row in 4 columns, you should open the flat file connection manager, go to Advanced Tab, and Add only 4 columns. Make sure the last column length is equal to 4000. This will force reading the 4th column without splitting it.
After importing data to SQL Server, you can parse the JSON content using OPENJSON() function
Parse and Transform JSON Data with OPENJSON (SQL Server)

Related

Parsing csv data format in apache Camel

I followed an example from a book Camel in action. how to marshal and unmarshal csv data format. However, I want to unmarshal a csv file with (comma seperated delimiter) and split body. Then, I will use content based .choice to distribute messages according to required tasks.
In fact, The first and simple example didn't work for me. I used camel 2.15.6 (camel-core, camel-context, camel-csv, commons-csv) and java 7.
public void configure()
{
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter(",");
from("file:test?noop=true")
.unmarshal().csv()
.split(body())
.to("file:out");
}
Please find below the stack trace.
Can you try by removing noop=true? Actually, if noop is true, the file is not moved or deleted in any way. This option is good for readonly data, or for ETL type requirements.
Pass csv as a parameter like this:
public void configure()throws Exception
{
CsvDataFormat csv = new CsvDataFormat();
csv.setDelimiter(",");
from("file:test?noop=true")
.unmarshal(csv)
.split(body())
.to("file:out");
}
Or it will help you to set contain based routing:I filter according to header of CSV:
//Route 1 for filter CSV based on header
from("file:/home/r2/Desktop/csvFile?noop=true")
.choice().when(body().contains("partyName"))
.to("direct:partyNameCSV")
.when(body().contains("\"stuffName\""))
.to("direct:stuffNameCSV")
.otherwise().endChoice();
//Route 2 partyNameCSV
from("direct:partyNameCSV")
.unmarshal(csv)
.process(new PartyNameCSVProcessor())
.end();
//Route 3 stuffNameCSV
from("direct:stuffNameCSV")
.unmarshal(csv)
.process(new StuffCSVProcessor())
.end();

Parsing missing column values CSV in Spark

I have a huge DB imported table with ~270 columns, I created a JavaRDD and used it to fill a dataframe.
Scenario: if all the fields in the CSV file are present then everything is great. But if there some empty field in CSV eg.
Value1,,,,,,value7,,,,,
then on writing to parquet of hive table store fails due to Indexoutofbound exception (column>row size). I don't want to use the spark-csv library.
I tried using filters but of no use as I need all column even if there is no data in the CSV. Please let me know if I am missing on something.
JavaRDD<String> tLogRDD =jsc.textFile(dataFile);
String schema=tLogRDD.first();
List<StructField> columns =new ArrayList<StructField>();
for(String fieldName: schema.split(","))
{
columns.add(DataTypes.createStructField(fieldName,DataTypes.StringType,false));
}
StructType schemaStructType = DataTypes.createStructType(columns);
System.out.println("XXXXXXXXXXXX-Row Read Start-XXXXXXXXXXXXXXX");
#SuppressWarnings("serial")
JavaRDD<Row> rowRDD = tLogRDD.map(
new Function<String, Row>() {
#Override
public Row call(String record) throws Exception {
String[] fields = record.split(",");
Object[] fields_converted = fields;
return RowFactory.create(fields_converted);
}
});
//apply schema to rows
DataFrame tLogfDataFrame=hContext.createDataFrame(rowRDD, schemaStructType);
System.out.println("DataFrame Constructed Successfully");
tLogfDataFrame.show(10);
tLogfDataFrame.save("C:/Users/Documents/1001.csv","parquet");
You can use csv reader from spark,,
like:
sparkSession.read()
.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(--file path--)
That's easier and has set of options.

Import Flat File containing multiline fields in SSIS

I would like to import a flat file *.csv in SSIS. But one field is a multiline text. I do not have special record delimiter (and there is no way to get one), which is therefore the carriage return \r\n or CRLF.
The problem is : when SSIS meets a CRLF in a multiline field, he passes to the next line instead of continuing as the multiline field.
Here is the header and some first lines :
"name", "firstname", "description", "age"
"John", "Smith", "blablablablablabla", 25
"Fred", "Gordon", "blablabla
blablablabla", 33
"Bill", "Buffalo", "bllllllllllllaaaaaaa
blaaaaaaa
blaalalalaaaaaaaaaa", 44
This example above contains 1 header and 3 records. SSIS understands it as 1 header and 6 records and then get errors, of course.
I don't know how can i handle that problem.
Hope you should help me.
According to your example, the Description field values can contain multiple carriage returns that is causing the creation of new lines.
The following record appearing on multiple lines...
"Bill", "Buffalo", "bllllllllllllaaaaaaa
blaaaaaaa
blaalalalaaaaaaaaaa", 44
should appear like that below for SSIS to see the expected number of columns.
"Bill", "Buffalo", "bllllllllllllaaaaaaa blaaaaaaa blaalalalaaaaaaaaaa", 44
There are a couple of approaches to resolving the formatting issue.
If possible, the easiest approach is to follow up with the person who created the file and have them do it correctly. For example, assuming they're using SQL Server, then they can apply the following in their TSQL statement for the description field to replace the carriage returns with a blank. (Oracle also has a similar function.)
REPLACE(Description, CHAR(13),' ')
If you need to replace a line feed, then use CHAR(10).
Otherwise, I understand that contacting the source of the file is not always possible. In this case, you can modify the text file programmatically before feeding it into SSIS. The following link discusses how to apply Excel to do this where you can then save to a new csv file and then import that through SSIS.
http://www.mrexcel.com/forum/excel-questions/304939-importing-text-data-carriage-returns-into-excel.html
If you are looking at setting up the SSIS package in a job, then you can write a script task in the early part of your control flow that will do the same thing and bypass Excel. The VB code provided in the link can be easily adapted to a script task.
Hope this helps.
Given that the source of the text files cannot be contacted and that the number of columns in each csv will vary, the best option for performing an import is to proceed on a variation of option 2 of Answer #1. This will require some customization and the application of a script task in the control flow.
On the server where the SSIS package will be running, create a bucket folder where a temporary text file will be saved. Each time a CSV file is processed, a temporary file called "destFile.csv" will be created from it and this is what you will import. Each time a different csv file is processed by the script task, it will save to this temporary file and location.
Create two variables in the SSIS package. One for the source file and the second for the destination file.
Create a script task and define the two variables being sent to it.
Add the following C# to the script task and remember to replace at the top the assignments for source File and destination File. They should be set equal to the new user variables just created.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Diagnostics;
using System.IO;
using System.Data;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string sourceFile = #"C:\test\tempfile.csv";
string line;
int count = 0;
int commaCount = 0;
int HeaderCommaCount = 0;
string templine;
string destinationFile = #"C:\test\destFile.csv";
List lines = new List();
// Delete temporary destination file if it already exists
if (File.Exists(destinationFile))
{
File.Delete(destinationFile);
}
// Create temporary destination file
File.Create(destinationFile).Dispose();
if (File.Exists(sourceFile))
{
StreamReader file = null;
try
{
file = new StreamReader(sourceFile);
while ((line = file.ReadLine()) != null)
{
// If Header line, get the number of commas. This is the base by which all following rows will be compared.
if (count == 0)
{
HeaderCommaCount = line.Split(',').Length - 1;
lines.Add(line); //save to a string array
count++;
}
else // This is any row following header row
{
commaCount = line.Split(',').Length - 1;
if (commaCount == HeaderCommaCount) //Row following header contains the correct number of columns
{
lines.Add(line); //save to a string array
count++;
}
else
{
templine = line;
// If comma count is less than that of Header row, continue reading rows until it does and then write.
while (commaCount != HeaderCommaCount)
{
line = file.ReadLine();
templine = templine + " " + line;
commaCount = templine.Split(',').Length - 1;
line = templine;
if (commaCount == HeaderCommaCount)
{
lines.Add(line); //save to a string array
}
}
}
}
}
}
finally
{
if (file != null)
file.Close();
}
}
File.WriteAllLines(destinationFile, lines); //send contents of string array to destination file.
//Console.ReadLine();
}
}
}
I wrote this quickly as a console application so that it would be easier to convert over to a C# script task. The file tested successfully where I applied your initial file example. It will iterate through the source text file and concatenate the lines together that have been split apart and then save to a destination file. The destination file is recreated and populated each time it is run. You can test this out first as a console application in Visual Studio and also apply a console.writeline(line) command just above or below where you see the lines.Add(line) in the code.
After this, all you need to do is import from the temporary destination file to your database.
Hope this helps.

Apache camel + csv + header

I have csv file as follows:
A;B;C
1;test;22
2;test2;33
where first line is a kind of header, and others are data. I have an issue to import all data rows with respect to header and report how many rows are correct and how many are not.
My first idea is to split source file to multiple files in the form of:
file1:
A;B;C
1;test;22
file2:
A;B;C
2;test2;33
How can I do this in camel, and how can I collect data necessary to print a summary report?
Take a look at Bean IO, and the Camel BeanIO component.
Looks like a good fit for your scenario.
You could probably build upon the example code on the first page of bean IO
BeanIO
http://beanio.org/
Camel BeanIO component
http://camel.apache.org/beanio.html
You should not need to split your incoming file if the only thing you need to do is collect and count successful and unsuccessful records.
If the CSV is not too big and fits in memory, I would read and convert the CSV file to a list of Java objects. The latest Camel CSV component can convert a CSV file into a List<Map>, before Camel 2.13 it produced List<List>. After having read converted CSV file into List of something you can write your own processor to iterate over the List and check its content.
You can unmarshall the file as a CSV file, remove the first line (header) and then do your validations as desired. Follow an example of camel route implementation
from("file:mydir/filename?noop=true")
.unmarshal()
.csv()
.process(validateFile())
.to("log:my.package?multiline=true")
Then you need to define the validateFile() method using the camel Processor
class like this:
public Processor validateFile() {
return new Processor() {
#override
public void process(Exchange exchange) throws Exception {
List<List<String>> data = (List<List<String>>) exchange.getIn().getBody();
String headerLine = data.remove(0);
System.out.println("header: "+headerLine);
System.out.println("total lines: "+data.size());
// iterate over each line
for( List<String> line : data) {
System.out.println("Total columns: "+line.size());
System.out.println(line.get(0)); // first column
}
}
};
}
In this method you can validate each file line/columns as you wish and then print it out or even write this report in other output file
Use as reference the File and CSV component page from Apache camel docs;
http://camel.apache.org/file.html
http://camel.apache.org/csv.html

how to insert excel data in a database with java

i want to insert data from an excel file into a local database in a UNIX server with java without any manipulation of data.
1- someone told me that i've to convert the excel file extension into .csv to conform with unix. i created a CSV file for each sheet (i've 12) with a macro. the problem is it changed the date format from DD-MM-YYYY to MM-DD-YYYY. how to avoid this?
2- i used LOAD DATA command to insert data from the CSV files to my database. there's a date colonne that is optionnaly specified in the excel file. so in CSV it become ,, so the load data doesn't work (an argument is needed). how can i fix this?
thanks for your help
It should be quite easy to read out the values from Excel with Apache POI. Then you save yourself the extra step of converting to another format and possible problems when your data contains comma and you convert to CSV.
Save the EXCEL file as CSV (comma separated values) format. It will make it easy to read and parse with fairly simple use of StringTokenizer.
Use MySQL (or SQLite depending on your needs) and JDBC to load data into the database.
Here is a CSVEnumeration class I developed:
package com.aepryus.util;
import java.util.*;
public class CSVEnumeration implements Enumeration {
private List<String> tokens = new Vector<String>();
private int index=0;
public CSVEnumeration (String line) {
for (int i=0;i<line.length();i++) {
StringBuffer sb = new StringBuffer();
if (line.charAt(i) != '"') {
while (i < line.length() && line.charAt(i) != ',') {
sb.append(line.charAt(i));
i++;
}
tokens.add(sb.toString());
} else {
i++;
while(line.charAt(i) != '"') {
sb.append(line.charAt(i));
i++;
}
i++;
tokens.add(sb.toString());
}
}
}
// Enumeration =================================================================
public boolean hasMoreElements () {
return index < tokens.size();
}
public Object nextElement () {
return tokens.get(index++);
}
}
If you break the lines of the CSV file up using split and then feed them one by one into the CSVEnumeration class, you can then step through the fields. Or here is some code I have lying around that uses StringTokenizer to parse the lines. csv is a string that contains the entire contents of the file.
StringTokenizer lines = new StringTokenizer(csv,"\n\r");
lines.nextToken();
while (lines.hasMoreElements()) {
String line = lines.nextToken();
Enumeration e = new CSVEnumeration(line);
for (int i=0;e.hasMoreElements();i++) {
String token = (String)e.nextElement();
switch (i) {
case 0:/* do stuff */;break;
}
}
}
I suggest MySQL for its performance and obviously open source.
Here comes two situations:
If you want just to store the excel cell values into the database. You can convert the excel to CSV format, so that you can simply LOAD DATA command in MySQL command.
If you have to do some manipulation before the values to get into the tables, I suggest Apache POI. I've used, that works so fine, whatever you're format of Excel you just have to use the correct implementation.
We are using SQLite in our java application. It's serveless, really simple to use and very efficient.