Parse JSON in Google Refine - json

I'm trying to pull out specific elements from results from the Data Science Toolkit coordinates2politics API, using Google Refine.
Here is sample cell #1:
[{"politics":[
{"type":"admin2","friendly_type":"country","code":"usa","name":"United States"},
{"type":"admin6","friendly_type":"county","code":"55_025","name":"Dane"},
{"type":"constituency","friendly_type":"constituency","code":"55_02","name":"Second district, WI"},
{"type":"admin5","friendly_type":"city","code":"55_48000","name":"Madison"},
{"type":"admin5","friendly_type":"city","code":"55_53675","name":"Monona"},
{"type":"admin4","friendly_type":"state","code":"us55","name":"Wisconsin"},
{"type":"neighborhood","friendly_type":"neighborhood","code":"Eastmorland|Madison|WI","name":"Eastmorland"}
],"location":{"longitude":"-89.3259404","latitude":"43.0859191"}}]
I added a column based on this column using this GREL syntax to pull out the county, Dane:
value.parseJson()[0]["politics"][1]["name"]
But when I got to Sample Cell #2, the syntax no longer works because the JSON result is a little different:
[{"politics":[
{"type":"admin2","friendly_type":"country","code":"usa","name":"United States"},
{"type":"constituency","friendly_type":"constituency","code":"55_05","name":"Fifth district, WI"},
{"type":"admin4","friendly_type":"state","code":"us55","name":"Wisconsin"},
{"type":"admin6","friendly_type":"county","code":"55_079","name":"Milwaukee"},
{"type":"admin5","friendly_type":"city","code":"55_84675","name":"Wauwatosa"},
{"type":"constituency","friendly_type":"constituency","code":"55_04","name":"Fourth district, WI"}
],"location":{"longitude":"-88.0075875","latitude":"43.0494572"}}]
Is there some way to sort the JSON or phrase my syntax so that I can find the county in either case?
Update
Here's the magic GREL that allowed me to find elements in the JSON string by name, not just position:
filter(value.parseJson()[0]["politics"], item, item["type"]=="admin6")[0]["name"]

The field named politics is an array, which you return with:
value.parseJson()[0]["politics"]
One element of that array is associated with the county (it's the one whose friendly_type field is "county"). So you need to filter the politics field to find the one whose friendly_type is county, like this:
filter(value.parseJson()[0]["politics"], item, item["friendly_type"]=="county")
That returns an array with one element. You want to get the name field out of that one element, so you need to extract the name of the zeroth array element, making your complete expression:
filter(value.parseJson()[0]["politics"], item, item["friendly_type"]=="county")[0]["name"]

Related

How to convert Excel to JSON in Azure Data Factory?

I want to convert this Excel file which contains two tables in a single worksheet
Into this JSON format
{
parent:
{
"P1":"x1",
"P2":"y1",
"P3":"z1"
}
children: [
{"C1":"a1", "C2":"b1", "C3":"c1", "C4":"d1"},
{"C1":"a2", "C2":"b2", "C3":"c2", "C4":"d2"},
...
]
}
And then post the JSON to a REST endpoint.
How to perform the mapping and posting to REST service?
Also, it appears that I need to sink the JSON to a physical JSON file before I can post as a payload to REST service - is this physical sink step necessary or can it be held in memory?
I cannot use Lookup activity to read in the Excel file because it is limited to 5,000 rows and 4MB.
I managed to do it in ADF, the solution is a bit long, but you can use azure functions to do it programmatically.
Here is a quick demo that i built:
the main idea is to split data, add headers as requested and then re-join data and add relevant keys like parents and children.
ADF:
added Conditional join to split data (see attached pictures).
add surrogate key for each table.
filtered first row to get red off the headers in the csv.
map children/parents' columns: renaming columns using derived column activity
added constant value in children data flow so i can aggregate by it and convert the CSV into a complex data type.
childrenArray: in a derived column,added subcolumn to a new column named Children and in values i added relevant columns.
aggregated children Jsons by using the constant value.
in parents dataFlow: after mapping columns , i created jsons using derived column.(please see attached pictures).
joined the children array and parents jsons into one table so it will be converted to the requested Json.
wrote to cached sink(here you can do the post request instead of writing to sink).
DataFlow:
![enter image description here
Activities:
Conditional Split:
AddSurrogateKey:
(it's the same for parents data flow just change the name of incoming stream as shown in dataflow above)
FilterFirstRow:
MapChildrenColumns:
MapParentColumns:
AddConstantValue:
PartentsJson:
Here i added subcolumn in Expression Builder and sent column name as value,this will build the parents json.
ChildrenArray:
Again in a derived column, added column with a name "children"
and in Expression Builder i added relevant columns.
Aggregate:
the purpose of this activity is to aggregate children Json's and build the array, without it you will not get an array.
the aggregation function is collect().
Join Activity:
Here i added an outer join to join the parents json and the children array.
Select Relevant columns:
Output:

Extract comma-separated values from JSON Records within a List with PowerQuery

As part of a tool I am creating for my team I am connecting to an internal web service via PowerQuery.
The web service returns nested JSON, and I have trouble parsing the JSON data to the format I am looking for. Specifically, I have a problem with extracting the content of records in a column to a comma separated list.
The data
As you can see, the data contains details related to a specific "race" (race_id). What I want to focus on is the information in the driver_codes which is a List of Records. The amount of records varies from 0 to 4 and each record is structured as id: 50000 (50000 could be any 5 digit number). So it could be:
id: 10000
id: 20000
id: 30000
As requested, an example snippet of the raw JSON:
<race>
<race_id>ABC123445</race_id>
<begin_time>2018-03-23T00:00:00Z</begin_time>
<vehicle_id>gokart_11</vehicle_id>
<driver_code>
<id>90200</id>
</driver_code>
<driver_code>
<id>90500</id>
</driver_code>
</race>
I want it to be structured as:
10000,20000,30000
The problem
When I choose "Extract values" on the column with the list, then I get the following message:
Expression.Error: We cannot convert a value of type Record to type
Text.
If I instead choose "Expand to new rows", then duplicate rows are created for each unique driver code. I now have several rows per unique race_id, but what I wanted was one row per unique race_id and a concatenated list of driver codes.
What I have tried
I have tried grouping the data by the race_id, but the operations allowed when grouping data do not include concatenating rows.
I have also tried unpivoting the column, but that leaves me with the same problem: I still get multiple rows.
I have googled (and Stack Overflowed) this issue extensively without luck. It might be that I am using the wrong keywords, however, so I apologize if a duplicate exists.
UPDATE: What I have tried based on the answers so far
I tried Alexis Olson's excellent and very detailed method, but I end up with the following error:
Expression.Error: We cannot convert the value "id" to type Number. Details:
Value=id
Type=Type
The error comes from using either of these lines of M code (one with a List.Transform and one without):
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine([driver_code][id], ","), type text}})
= Table.Group(#"Renamed Columns", {"race_id", "begin_time", "vehicle_id"},
{{"DriverCodes", each Text.Combine(List.Transform([driver_code][id], each Number.ToText(_)), ","), type text}})
NB: if I do not write [driver_code][id] but only [id] then I get another error saying that column [id] does not exist.
Here's the JSON equivalent to the XML example you gave:
{"race": {
"race_id": "ABC123445",
"begin_time": "2018-03-23T00:00:00Z",
"vehicle_id": "gokart_11",
"driver_code": [
{ "id": "90200" },
{ "id": "90500" }
]}}
If you load this into the query editor, convert it to a table, and expand out the Value record, you'll have a table that looks like this:
At this point, choose Expand to New Rows, and then expand the id column so that your table looks like this:
At this point, you can apply the trick #mccard suggested. Group by the first columns and aggregate over the last using, say, max.
This last step produces M code like this:
= Table.Group(#"Expanded driver_code1",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each List.Max([id]), type text}})
Instead of this, you want to replace List.Max with Text.Combine as follows:
= Table.Group(#"Changed Type",
{"Name", "race_id", "begin_time", "vehicle_id"},
{{"id", each Text.Combine([id], ","), type text}})
Note that if your id column is not in the text format, then this will throw an error. To fix this, insert a step before you group rows using Transform Tab > Data Type: Text to convert the type. Another options is to use List.Transform inside your Text.Combine like this:
Text.Combine(List.Transform([id], each Number.ToText(_)), ",")
Either way, you should end up with this:
An approach would be to use the Advanced Editor and change the operation done when grouping the data directly there in the code.
First, create the grouping using one of the operations available in the menu. For instance, create a column"Sum" using the Sum operation. It will give an error, but we should get the starting code to work on.
Then, open the Advanced Editor and find the code corresponding to the operation. It should be something like:
{{"Sum", each List.Sum([driver_codes]), type text}}
Change it to:
{{"driver_codes", each Text.Combine([driver_codes], ","), type text}}

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

How to find last item in a repeated structure in bigquery

I have a nested repeated structure, the repeated structure is of variable length. For example, it could be a person object with a repeated structure that holds cities the person has lived in. I'd like to find the last item in that list say to find current city person lives in. Is there an easy way to do this, I tried looking around jsonpath functions but I'm not sure how to use it with "within". Any help please?
1) You can use LAST and WITHIN
SELECT
FIRST(cell.value) within record ,
LAST(cell.value) within record
FROM [publicdata:samples.trigrams]
where ngram = "! ! That"
2) or if you want something more advanced you can use POSITION
POSITION(field) - Returns the one-based, sequential position of field within a set of repeated fields.
You can check the samples from trigrams (click on Details to see the unflatten schema)
https://bigquery.cloud.google.com/table/publicdata:samples.trigrams?pli=1
And when you run POSITION, you get the ordering of that field.
SELECT
ngram,
cell.value,
position(cell.volume_count) as pos,
FROM [publicdata:samples.trigrams]
where ngram = "! ! That"
Now that you have the position, you can query for last one.

Filter a rest service by category or field

I am using the extension library's rest control to provide a json data feed. Is it possible to filter by a category or a field with a URL parameter?
I understand that I can use a search string "&search=something" but that can provide me with erroneous results. I have tried searching for a field equal to some value but that doesn't seem to work for me.
If I cannot do this with the rest control, is it possible with Domino Data Services?
You can filter by a category or field value in a viewJsonService if you add ?keys=yourValue to URL.
The REST service returns the same documents as you would get with view.getAllDocumentsByKey("yourValue").
Default is non-exact-match filtering which means that only the beginning of column value has to match. If you want the exact match then add &keysexactmatch=true to URL which would be the equivalent to view.getAllDocumentsByKey("yourValue", true).
Example:
Assuming, we have a view "Forms" with a first sorted column "Form".
With the REST service
<xe:restService
id="restService1"
pathInfo="DocsByForm">
<xe:this.service>
<xe:viewJsonService
viewName="Forms"
defaultColumns="true">
</xe:viewJsonService>
</xe:this.service>
</xe:restService>
and the URL
http://server/database.nsf/RestServices.xsp/DocsByForm?keys=Memo&keysexactmatch=true
we'd get all documents with Form="Memo" as JSON
[
{
"#entryid":"7-D5029CB83351A9A6C1257D820031E927",
"#unid":"D5029CB83351A9A6C1257D820031E927",
"#noteid":"11DA",
"#position":"7",
"#siblings":14,
"#form":"Memo",
"Form":"Memo",
... other columns ...
},
... other documents
]
We'd get the same result if the first column is categorized.