Diffrent column names - csv

I have 74 csv files.
I am merging in SSIS. The data in all are the same, but the column names are different in half.
How can I do mapping?
The column name is [Latitude] in one sample file and [Lat] in the other.

If the column ordinal position for Latitude/Lat is always the same i.e. third column in the file, then you can modify your flat file connection manager to skip the first row and that your file does not have header rows. This will result in you having to manually define your file layout but it's entirely workable.
If lat is column 4 in one file and Latitude is always column 1, the best solution is to have two connection managers - one for each file type and segment your data files by folder and route them into different foreach file enumerators and dataflows for consumption by the target tables.

Related

Apache Nifi: Merge rows in two csv files

I have two csv files that are funnelled into a MergeContent Processor. I want them to be merged together. They both have the same columns. If the first and second csv's look like this:
First CSV:
id, name
12,John
11,Keels
Second CSV:
id, name
22,Kelly
25,Felder
My output should look like this:
id, name
12,John
11,Keels
22,Kelly
25,Felder
I have tried doing this through the MergeContent Processor. But it Changes the data into a different format I don't want that to happen. Both the Input files and the output files must be .csv and also contain the same name as the input files. (The input files have the same name)
Use MergeRecord processor with the common attribute. For example, both flow files have the same attribute such as filename = test.csv then you can set the MergeRecord processor as follows:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Merge Strategy Bin-Packing Algorithm
Correlation Attribute Name filename
Attribute Strategy Keep Only Common Attributes
Minimum Number of Records 3
The important thing is the minimum number of records, which is the number of rows to be merged. In this case, it should be larger than 2 because each CSV has 2 rows. Then, the CSV will wait for the other CSV to exceed the minimum.

Spark Dataframe : Is there any option to insert an extra header on a csv

I need to create a CSV file from an API which requires two lines at the top of the CSV file.
The first line would be the name of the program (one column) and the second one a header with a modified name of the columns. I managed to get the second line but I'm not sure I can easily create the first one.
What's the best way to do it?
dfb.select("name1","firstname1").write()
.format("csv")
.option("header",true)
.save("file:///home/dse/bin/results.csv");

NiFi : Regular Expression in ExtractText gets CSV header instead of data

I'm working on a flow where I get CSV files. I want to put the records into different directories based on the first field in the CSV record.
For ex, the CSV file would look like this
country,firstname,lastname,ssn,mob_num
US,xxxx,xxxxx,xxxxx,xxxx
UK,xxxx,xxxxx,xxxxx,xxxx
US,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
JP,xxxx,xxxxx,xxxxx,xxxx
I want to get the field value of the first field i.e, country. Put those records into a particular directory. US records goes to US directory, UK records goes to UK directory, and so on.
The flow that I have right now is:
GetFile ----> SplitText(line split count = 1 & header line count = 1) ----> ExtractText (line = (.+)) ----> PutFile(Directory = \tmp\data\${line:getDelimitedField(1)}). I need the header file to be replicated across all the split files for a different purpose. So I need them.
The thing is, the incoming CSV file gets split into multiple flow files with the header successfully. However, the regex that I have given in ExtractText processor evaluates it against the splitted flow files' CSV header instead of the record. So instead of getting US or UK in the "line" attribute, I always get "country". So all the files go to \tmp\data\country. Help me how to resolve this.
I believe getDelimitedField will only work off a singular line and is likely not moving past the newline in your split file.
I would advocate for a slightly different approach in which you could alter your ExtractText to find the country code through a regular expression and avoid the need to include the contents of the file as an attribute.
Using a regex of ^.*\n+(\w+) will capture the first line and the first set of word characters up to the comma and place them in the attribute name you specify in capture group 1. (e.g. country.1).
I have created a template that should get the value you are looking for available at https://github.com/apiri/nifi-review-collateral/blob/master/stackoverflow/42022249/Extract_Country_From_Splits.xml

SSIS - Process a flat file with varying data

I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.
The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.

mapping automatically columns into a destination flat file in SSIS

I need to export a table having more than 100 colonnes to a destination flat file, how can I map them automatically and not one by one one manually in SSIS SQL2012. There is any option as the automatic mapping existing in DTS?
Assuming the flat file already exists, then it's a simple matter of going into your flat file destination and there should be an option, might be right click, for auto map by column name.
Unless your text file has column names specified in it, it will be named column1 to column100. This will not automatically map to your destination (Unless destination is column1 to column100). You will have to put in the effort and rename the 'column1' names of the text source to your destination's column names.