How to split this csv file into multiple contents? - csv

I have CSV File which having below contents,
Input.csv
Sample NiFi Data demonstration for below
Due dates 20-02-2017,23-03-2017
My Input No1 inside csv,,,,,,
Animals,Today-20.02.2017,Yesterday-19-02.2017
Fox,21,32
Lion,20,12
My Input No2 inside csv,,,,
Name,ID,City
Mahi,12,UK
And,21,US
Prabh,32,LI
I need to split above whole csv(Input.csv) into two parts like InputNo1.csv and InputNo2.csv.
For InputNo1.csv should have below contents only.,
Animals,Today-20.02.2017,Yesterday-19-02.2017
Fox,21,32
Lion,20,12
For InputNo2.csv should have below contents.,
Name,ID,City
Mahi,12,UK
And,21,US
Prabh,32,LI
Is this possible to convert csv into Multiple parts in NiFi possible with existing processors?

Yes.
Use the ReplaceText processor to remove the global header, use SplitContent to split the resulting flowfile into multiple flowfiles, use another ReplaceText to remove the leftover comment string because SplitContent needs a literal byte string, not a regex, and then perform the normal SplitText operations.
Here is a template specific to the input you provided in your question.

Related

concern while importing/linking csv to access database

I have a csv file with delimiter as , (comma) and few of the data column of same file has comma in it .
Hence while linking / importing the file, data is getting jumbled in next column.
I have tried all possible means like skip column etc , but not getting any fruitful results.
Please let me know if this can be handled through VBA function in ms-access.
If the CSV file contains text fields that contain commas and are not surrounded by a text qualifier (usually ") then the file is malformed and cannot be parsed in a bulletproof way. That is,
1,Hello world!,1.414
2,"Goodbye, cruel world!",3.142
can be reliably parsed, but
1,Hello world!,1.414
2,Goodbye, cruel world!,3.142
cannot. However, if you have additional information about the file, e.g., that it should contain three columns
a Long Integer column,
a Short Text column, and
a Double column
then your VBA code could read the file line-by-line and split the string on commas into an array. The first array element would be the Long Integer, the last array element would be the Double value, and the remaining "columns" in between could be concatenated together to reconstruct the string.
As you can imagine, that approach could easily be confounded (e.g., if there was more than one text field that might contain commas). Therefore it is not particularly appealing.
(Also worth noting is that the CSV parser in Access has never been able to properly handle text fields that contain line breaks, but at least we can import those CSV files into Excel and then import into Access from the Excel file.)
TL;DR - If the CSV file contains unqualified text containing commas then the system that produced it is broken and should be fixed.

NIFI - Using one ReplaceText Processor how to add brackets at the beginning and end of each line

I have the following 10000 rows of log file every 5 seconds.
log_datetime1 host_name1 log_message1
log_datetime2 host_name2 log_message2
log_datetime3 host_name3 log_message3
I want to send them to kudu or parquet table as the following JSON
{"cureent_datetime":"datetime", "log_data":"log_datetime1 host_name1 log_message1"}
{"cureent_datetime":"datetime", "log_data":"log_datetime2 host_name2 log_message2"}
{"cureent_datetime":"datetime", "log_data":"log_datetime3 host_name3 log_message3"}
Currently I'm using Two ReplaceText Processors. One to add the
{"cureent_datetime":"datetime", "log_data":" at the beginning of each line of 10000 rows log file and the second one to add "} at the end of each line.
Was wondering if I could do the both step in one ReplaceText proecssor.
Using the search pattern (.+)(?=\n) and the replacement pattern {"current_datetime":"datetime", "log_data":"$1"} will result in the desired output. The search pattern looks for text which is followed by a newline, and the replacement includes the capture group inside the templated JSON structure.

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

How do I deal with commas/tabs that are part of the data in CSV/TSV in MarkLogic

I am trying to load a CSV file that have commas as part of the data into MarkLogic using RecordLoader. The data loads but MarkLogic takes commas that are part of the data as delimiters. I tried to escape commas by using backslashes but didn't work and the data remains dirty with the backslashes. I thought about replacing the data commas with other symbols so that I can change them back to commas after I load but I don't know if there is a way to modify the data after I load and I would have to reposition the XML tags line by line.
How can I load a CSV/TSV file and keep the commas/tabs that are part of the data as part of the data and not as delimiters?
Thanks in advance.
RecordLoader's DelimitedDataLoader doesn't support any escaping today. If you want to add it as a patch, https://github.com/marklogic/recordloader/blob/master/src/java/com/marklogic/recordloader/xcc/DelimitedDataLoader.java#L102 is the place to start looking at the code.
Although you asked about RecordLoader, you could also use the MarkLogic Content Pump. See Creating Documents from Delimited Text Files.