Apache NiFi: Creating new column using a condition - csv

I have asked a similar question. Yet I wasn't able to find a solution for my problem through that approach. I have a csv which looks like this:
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
I want to add a new column and add values according to the data in the number field.This is the logic.
If the first 4 digits of data in the number column is equal to "0763" then the new column named (status) must be set as INSIDE or if it is any other value its OUTSIDE
As mentioned in the logic the output must look like this:
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE
My Approach
I tried to achieve this by first duplicating the number column to the status column. And then trying to take the first 4 digits and dealing with it.
Hope you would be able to suggest a way to Nifi Workflow to make this possible.

I used the UpdateRecord processor twice and got the results that you want.
Input
I started with your input data.
studentID,regger,age,number
123,west,12,076392367
456,nort,77,098123124
231,west,33,076346325
Process
First, set the UpdateRecord processor as follows:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Record Path Value
/status /number
it will create the new column status with the value of number column.
Second, the first output should go to another UpdateRecord processor with the options
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/status ${field.value:substring(0,4):equals('0763'):ifElse(${field.value:replace(${field.value},'INSIDE')},${field.value:replace(${field.value},'OUTSIDE')})}
and this will give you the final results.
Be aware that the number column is not an integer column, so you have to set the record reader CSVReader with the option Schema Access Strategy to the Use String Fields From Header.
Output
studentID,regger,age,number,status
123,west,12,076392367,INSIDE
456,nort,77,098123124,OUTSIDE
231,west,33,076346325,INSIDE

You can try below logic :-
SplitText ->
ExtractText Processor ->
RouteOnAttribute(Add condition if first four number is 0763)
-----Match Relation--> ReplaceText(Extracted Attribute from file + "INSIDE") -> PutFile
-----Unmatch Relation--> ReplaceText(Extracted Attribute from file + "OUTSIDE") -> PutFile
Hope this will help you.

Related

JSON flattening in AWS Glue ETL job creates inferred schema with duplicated columns

I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...
Context:
I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.
Approach:
I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table).
Then I'm trying to use the approach of unboxing and unnesting.
Let's call that json column data:
def transformTable (glueContext, dfc) -> DynamicFrameCollection:
dyf = dfc.select(list(dfc.keys())[0])
dyf = Unbox.apply(frame=dyf, path="data", format="json")
dyf = UnnestFrame.apply(frame=dyf)
return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)
(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)
My issue:
Glue automatically infers the data types of the my frame schema (rather successfully)
but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.
So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).
I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.
Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)
Questions/help needed:
Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...
Thanks in advance!

AWS glue: crawler does not identify metadata when CSV contains string and timestamp/date values

I have come across one thing when we consider CSV as input to crawler
crawler doesn't identify the columns header when all the data is in string format in CSV.
#P1 Headers are displayed as col0,col1...colN.
#P2 And actual column names are considered as data.
#P3 Metadata (i.e. column datatype is shown as string even the CSV dataset consists of date/timestamp value)
If we are going to consider custom (CSV) classifier then we are manually mentioning the column header.
#P2 will get covered i.e. column names will be removed however
#P1 still remain same. column header will be displayed as col0,col1...colN.
There are 3 things I want to avoid and achieve expected result.
CSV with strings only should show actual column names instead of col0,col1...colN.
Metadata of generated table should show correctly (i.e. date/timestamp, string) once it is crawled by crawler.
If Custom classifier is used, we need to mention column header names manually in classifier, yet result is not satisfactory.
Need generic solution instead of manual interventions.
Have gone through this document: here
If anyone has already implemented the solution, Please help.
I got solution to one of the above points. Headers i.e. first line of CSV is displayed by using 'Has heading' in CSV classifier.
However, Solution for following is yet to figure out.
Metadata of CSV file is shown as string even if column contains timestamp/date value. Crawler is reading these datatypes as string.
Custom classifier needs manual interventions. I have mentioned all column names in classifier. Is there generic solution?
If we are using pd.to_csv to write the dataframe, then to avoid getting column names as col1, col2 and so on, add the parameter
index_label='index' such as:
pd.to_csv(df,index_label='index')

Apache NiFi: Mapping an external file to get a new column

I was following the answer in this stack overflow question. But I couldn't get the output I am expecting to get. I have a sample csv, which looks like this:
id,name,city
12,Jimmy,Ontario
33,Kimmel,York
Every city has a unique code, which I have stored in another csv.This is how my csv to be used to map looks like. (I have separated the two values in the row using a tab in the real txt file)
California 5435
Ontario 2342
York 3456
The final output must be like the following:
id,name,city,code
12,Jimmy,Ontario,2342
33,Kimmel,York,3456
This csv has much more data therefore the replacement cannot be achieved by using ReplaceText Processor. So it can be done only by using the ReplaceTextwithMapping Processor.
I have followed the exact steps used as an answer in this question. But it seems like the Replace TextwithMapping is not working as expected. Since a new column is made successfully. But it's just contains the same content of the city column not the codes I want.
Its greatly appreciated if you could submit an answer that I can follow to finally succeed in getting the desired output using ReplaceTextwithMapping Processor
I could make it work kind of using LookupRecord. The overall flow looks like this:
GenerateFlowFile:
LookupRecord:
Configure CSVReader and CSVRecordSetWriter to treat the first line as header line. For the other properties leave the defaults. Configure CSVRecordLookupService:
The mapping file has following content:
city,code
Ontario,2342
California,5435
The LookupRecord processor will take the city value and lookup the proper record from the mapping file. It will extract the code and add the field to the current record. Result:
As you can see the code has been added to the CSV file. However it is enclosed in an object, although I have selected Insert Record Fields setting. I could not solve this problem. This could be another question to ask explicitly.

Extract comma-separated values from JSON Records within a List with PowerQuery

As part of a tool I am creating for my team I am connecting to an internal web service via PowerQuery.
The web service returns nested JSON, and I have trouble parsing the JSON data to the format I am looking for. Specifically, I have a problem with extracting the content of records in a column to a comma separated list.
The data
As you can see, the data contains details related to a specific "race" (race_id). What I want to focus on is the information in the driver_codes which is a List of Records. The amount of records varies from 0 to 4 and each record is structured as id: 50000 (50000 could be any 5 digit number). So it could be:
id: 10000
id: 20000
id: 30000
As requested, an example snippet of the raw JSON:
<race>
<race_id>ABC123445</race_id>
<begin_time>2018-03-23T00:00:00Z</begin_time>
<vehicle_id>gokart_11</vehicle_id>
<driver_code>
<id>90200</id>
</driver_code>
<driver_code>
<id>90500</id>
</driver_code>
</race>
I want it to be structured as:
10000,20000,30000
The problem
When I choose "Extract values" on the column with the list, then I get the following message:
Expression.Error: We cannot convert a value of type Record to type
Text.
If I instead choose "Expand to new rows", then duplicate rows are created for each unique driver code. I now have several rows per unique race_id, but what I wanted was one row per unique race_id and a concatenated list of driver codes.
What I have tried
I have tried grouping the data by the race_id, but the operations allowed when grouping data do not include concatenating rows.
I have also tried unpivoting the column, but that leaves me with the same problem: I still get multiple rows.
I have googled (and Stack Overflowed) this issue extensively without luck. It might be that I am using the wrong keywords, however, so I apologize if a duplicate exists.
UPDATE: What I have tried based on the answers so far
I tried Alexis Olson's excellent and very detailed method, but I end up with the following error:
Expression.Error: We cannot convert the value "id" to type Number. Details:
Value=id
Type=Type
The error comes from using either of these lines of M code (one with a List.Transform and one without):
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine([driver_code][id], ","), type text}})
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine(List.Transform([driver_code][id], each Number.ToText(_)), ","), type text}})
NB: if I do not write [driver_code][id] but only [id] then I get another error saying that column [id] does not exist.
Here's the JSON equivalent to the XML example you gave:
{"race": {
"race_id": "ABC123445",
"begin_time": "2018-03-23T00:00:00Z",
"vehicle_id": "gokart_11",
"driver_code": [
{ "id": "90200" },
{ "id": "90500" }
]}}
If you load this into the query editor, convert it to a table, and expand out the Value record, you'll have a table that looks like this:
At this point, choose Expand to New Rows, and then expand the id column so that your table looks like this:
At this point, you can apply the trick #mccard suggested. Group by the first columns and aggregate over the last using, say, max.
This last step produces M code like this:
= Table.Group(#"Expanded driver_code1",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each List.Max([id]), type text}})
Instead of this, you want to replace List.Max with Text.Combine as follows:
= Table.Group(#"Changed Type",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each Text.Combine([id], ","), type text}})
Note that if your id column is not in the text format, then this will throw an error. To fix this, insert a step before you group rows using Transform Tab > Data Type: Text to convert the type. Another options is to use List.Transform inside your Text.Combine like this:
Text.Combine(List.Transform([id], each Number.ToText(_)), ",")
Either way, you should end up with this:
An approach would be to use the Advanced Editor and change the operation done when grouping the data directly there in the code.
First, create the grouping using one of the operations available in the menu. For instance, create a column"Sum" using the Sum operation. It will give an error, but we should get the starting code to work on.
Then, open the Advanced Editor and find the code corresponding to the operation. It should be something like:
{{"Sum", each List.Sum([driver_codes]), type text}}
Change it to:
{{"driver_codes", each Text.Combine([driver_codes], ","), type text}}

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.