apache nifi - use different separators to process a text fie - csv

I have a text file with following content
2020-10-19 12:12:00.001;alan;male;{"id":"255","val":"22","type":"1","location":"12530,95823","status":1}
2020-10-19 12:12:00.001;anna;female;{"id":"256","val":"12","type":"1","location":"12140,25630","status":2}
I want to convert this to a csv file,
first separate by ; and the last column need to be treated as a json object and extract its values. the output need to be like in the below
date,name,gender,id,val,type,location1,location2,status
2020-10-19 12:12:00.001,alan,male,255,22,1,12530,95823,1
2020-10-19 12:12:00.001;anna,female,256,12,1,12140,25630,2
I am a beginner in nifi and I want to figure out the processors and their configuration to do this convert process. I have tried ConvertRecord and could only separate the content by ; .
It is great help if anyone could suggest a way to do this process.

Not an easy task ! but interesting.
I hope the structure is not changing, eg: the json column get more attributes !
So i would do this:
1 - SplitText by line(one line) - remove header if any
2 - ExtractText (create an attribute called body with a value of (?s)(^.*$))
3 - Update Attribute with two properties:
csv = ${body:substringBefore(';{'):replace(';',',')}
json = ${body:substringAfter(';{')}
4 - ReplaceText - and put {${json} as replacement value , Replacement Strategy : Always Replace
5 - EvaluateJson and extract all json attributes
6 - Attributestocsv with this def
csv,id,val,type,location,status
7 - mergecontent - add header (your col names), Delimiter Strategy = text and Demarcator Shift+Enter (newline)
Quite long walk and maybe not so optimal, you might wanna look into jolt for a better performance - but i ma lazy to think about a jolt spec :) .
I have the template for this - but i cannot loaded here as is to big and cannot use any file share service, so ?
Also if you have a mysql db on your hand , you cloud just load it as csv and use json_extract function
SELECT JSON_EXTRACT(name, "$.id") AS name
FROM table

Related

NIFI - Using one ReplaceText Processor how to add brackets at the beginning and end of each line

I have the following 10000 rows of log file every 5 seconds.
log_datetime1 host_name1 log_message1
log_datetime2 host_name2 log_message2
log_datetime3 host_name3 log_message3
I want to send them to kudu or parquet table as the following JSON
{"cureent_datetime":"datetime", "log_data":"log_datetime1 host_name1 log_message1"}
{"cureent_datetime":"datetime", "log_data":"log_datetime2 host_name2 log_message2"}
{"cureent_datetime":"datetime", "log_data":"log_datetime3 host_name3 log_message3"}
Currently I'm using Two ReplaceText Processors. One to add the
{"cureent_datetime":"datetime", "log_data":" at the beginning of each line of 10000 rows log file and the second one to add "} at the end of each line.
Was wondering if I could do the both step in one ReplaceText proecssor.
Using the search pattern (.+)(?=\n) and the replacement pattern {"current_datetime":"datetime", "log_data":"$1"} will result in the desired output. The search pattern looks for text which is followed by a newline, and the replacement includes the capture group inside the templated JSON structure.

convert json text entries to a dataframe in r

I have a text file with json like structure that contains values for certain variables as below.
[{"variable1":"111","variable2":"666","variable3":"11","variable4":"aaa","variable5":"0"}]
[{"variable1":"34","variable2":"12","variable3":"78","variable4":"qqq","variable5":"-9"}]
Every line is a new set of values for the same variables 1 through 5. There can be 1000s of lines in a text file but the variables would always remain the same. I want to extract variable 1 through 5 along with their values and convert into a dataframe. Currently I perform these operations in excel using string manipulation and transpose. Here is what it looks like in excel -
How to do this in R? Much appreciated. Thanks.
J
There is a package named jsonlite that you can use.
library("jsonlite")
df<- fromJSON("YourPathToTheFile")
You can find more info here.

Apache Nifi - store lines into 1 file

Using Apache Nifi, I created a flow that read a Json file and splits it line by line in order to verify if the content is correct. After that I have 2 outputs: 1 - for successful line and 2-for unsuccessful ones and the output is a Json file.
For the moment, all the lines are stored into separate files but what I want to do is to store each "good" line into 1 file and each "bad" one in another.
What processor should I use?
The RouteText processor was designed for exactly what you are trying to do. It allows you to route lines of text to different relationships based on expressions you create. It bundles the lines from each FlowFile together for each relationship.
You can see the documentation for it here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.RouteText/index.html
You can get an example template (doing almost exactly what you would like to do) using RouteText here: https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/SplitRouteMergeVsRouteText.xml

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

How to read a file where one column data is present in other column using Talend Data Integration

I get data from a CSV format daily.
Example data looks like:
Emp_ID emp_leave_id EMP_LEAVE_reason Emp_LEAVE_Status Emp_lev_apprv_cnt
E121 E121- 21 Head ache, fever, stomach-ache Approved 16
E139 E139_ 5 Attending a marraige of my cousin Approved 03
Here you can see that emp_leave_id and EMP_LEAVE_reason column data is shifted/scattered into the next columns.
So the problem by using tFileInputDelimited and various reading patterns I couldn't load data correctly into my target database. Mainly I'm not able to read the data correctly with that component in Talend.
Is there a way that I can properly parse this CSV to get my data in the format that I want?
This is probably a TSV file. Not sure about Talend, but uniVocity can parse these files for you:
TsvDataStoreConfiguration tsv = new TsvDataStoreConfiguration("my_TSV_datastore");
tsv.setLimitOfRowsLoadedInMemory(10000);
tsv.addEntities("/some/dir/with/your_files", "ISO-8859-1"); //all files in the given directory path will accessible entities.
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("my_Database", myDataSource);
database.setLimitOfRowsLoadedInMemory(10000);
Univocity.registerEngine(new EngineConfiguration("My_ETL_Engine", tsv, database));
DataIntegrationEngine engine = Univocity.getEngine("My_ETL_Engine");
DataStoreMapping dataStoreMapping = engine.map("my_TSV_datastore", "my_Database");
EntityMapping entityMapping = dataStoreMapping.map("your_TSV_filename", "some_database_table");
entityMapping.identity().associate("Emp_ID", "emp_leave_id").toGeneratedId("pk_leave"); //assumes your database does not keep the original ids.
entityMapping.value().copy("EMP_LEAVE_reason", "Emp_LEAVE_Status").to("reason", "status"); //just copies whatever you need
engine.executeCycle(); //executes the mapping.
Do not use a CSV parser to parse TSV inputs. It won't handle escape sequences properly (such as \t inside the value, you will get the escape sequence instead of a tab character), and will surely break if your value has a quote in it (a CSV parser will try to find the closing quote character and will keep reading chars until it finds another quote)
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).