I have a bunch of json files which have an array with column names and a separate array for the rows.
I want a dynamic way of retrieving column names and merge them with the rows for each json file.
Been playing around with derived columns and column patterns, but struggling to get it working.
I want the column names from [data.column.shortText] and values for each corresponding [data.rows.value] according to the order.
Example format
{
"messages":{
},
"data":{
"columns":[
{
"columnName":"SelectionCriteria1",
"shortText":"Case no."
},
{
"columnName":"SelectionCriteria2",
"shortText":"Period for periodical values",
},
{
"columnName":"SelectionCriteria3",
"shortText":"Location"
},
{
"columnName":"SelectionCriteriaAggregate",
"shortText":"Value"
}
],
"rows":[
[
{
"value":"23523"
},
{
"value":12342349
},
{
"value":"234234",
"code":3342
},
{
"value":234234234
}
]
]
}
}
First, you need to fix your Json data, i can see you have an extra comma in columns second Json and in rows you have value as int and as string so when i tried to parse it in ADF i got an error.
i don't quite understand why you're trying to do merge by position because normally we get rows more than columns, and if you'll get 5 rows and 3 columns you will get an error.
Here is my approach to your problem:
the main idea is that i added index column to both arrays and joined the jsons by Inner Join.
created a Source Data (its 2 but you can make it one to simplify your data flow)
added Select activity to select relevant arrays from the data.
flattened the array(in order to add index column)
added index by using rank activity (please read more about rank and dense rank and what is the difference between the two)
added a Join activity , inner join by index column.
Select activity to remove index column from the result.
saved output to sink.
Json Data that i worked with:
Data Flow:
SelectRows Activity:
Flatten Activity:
Rank actitity:
Join activity:
please check these links:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#mapAssociation
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-map-functions
Related
I'm starting to explore the JSON1 library for sqlite and have been so far successful in the basic queries I've created. I'm now looking to create a more complicated query that pulls data from multiple levels.
Here's the example JSON object I'm starting with (and most of the data is very similar).
{
"height": 140.0,
"id": "cp",
"label": {
"bind": "cp_label"
},
"type": "color_picker",
"user_data": {
"my_property": 2
},
"uuid": "948cb959-74df-4af8-9e9c-c3cb53ac9915",
"value": {
"bind": "cp_color"
},
"width": 200.0
}
This json object is buried about seven levels deep in a json structure and I pulled it from the larger json construct using an sql statement like this:
SELECT value FROM forms, json_tree(forms.formJSON, '$.root')
WHERE type = 'object'
AND json_extract(value, '$.id') = #sControlID
// In this example, #sControlID is a variable that represents the `id` value we're looking for, which is 'cp'
But what I really need to pull from this object are the following:
the value from key type ("color_picker" in this example)
the values from keys bind ("cp_color" and "cp_label" in this example)
the keys value and label (which have values of {"bind":"<string>"} in this example)
For that last item, the key name (value and label in this case) can be any number of keywords, but no matter the keyword, the value will be an object of the form {"bind":"<some_string>"}. Also, there could be multiple keys that have a bind object associated with them, and I'd need to return all of them.
For the first two items, the keywords will always be type and bind.
With the json example above, I'd ideally like to retrieve two rows:
type key value
color_picker value cp_color
color_picker label cp_label
When I use json_extract methods, I end up retrieving the object {"bind":"cp_color"} from the json_tree table, but I also need to retrieve the data from the parent object. I feel like I need to do some kind of union, but my attempts have so far been unsuccessful. Any ideas here?
Note: if the {"bind":"<string>"} object doesn't exist as a child of the parent object, I don't want any rows returned.
Well, I was on the right track and eventually figured out it. I created a separate query for each of the items I was looking for, then INNER JOINed all the json_tree tables from each of the queries to have all the required fields available. Then I json_extracted the required data from each of the json fields I needed data from. In the end, it gave me exactly what I was looking for, though I'm sure it could be written more efficiently.
For anyone interested, this is what hte final query ended up looking like:
SELECT IFNULL(json_extract(parent.value, '$.type'), '_window_'), child.key, json_extract(child.value, '$.bind') FROM (SELECT json_tree.* FROM nui_forms, json_tree(nui_forms.formJSON, '$') WHERE type = 'object' AND json_extract(nui_forms.formJSON, '$.id') = #sWindowID) parent INNER JOIN (SELECT json_tree.* FROM nui_forms, json_tree(nui_forms.formJSON, '$') WHERE type = 'object' AND json_extract(value, '$.bind') != 'NULL' AND json_extract(nui_forms.formJSON, '$.id') = #sWindowID) child ON child.parent = parent.id;
If you have any tips on reducing its complexity, feel free to comment!
A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:
Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you
You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I have a field in a table called duration that contains a JSON string like this:
{
"videos": {
"en":"00:03:11",
"es":"00:03:11"
},
"audios": {
"en":"00:00:03",
"es":"00:00:03"
}
}
Is it possible to execute a query to sum all the values of both videos and audio keys using the possible langauges? Meaning that given this JSON structure, I'd like a query to return:
00:06:28
EDIT:
Don't mind the format of the values for the moment, there are ways to sum datetime values in SQL. What I'm struggling with now is to traverse the values in the JSON to actually sum them.
I am trying to create a relationship between two different graphs, using information in a CSV file. I built the query the way I did because the size of each graph, one being 500k+ and the other 1.5m+.
This is the query I have:
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row WITH row
MATCH (m:Main) WITH m
MATCH (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
This is the error I receive:
Variable `row` not defined (line 4, column 16 (offset: 164))
"WHERE m.ASIN = row.asin AND c.customer = row.customer"
^
An example of the Main table is:
{
"ASIN": "0827229534",
"totalreviews": "2",
"categories": "2",
"title": "Patterns of Preaching: A Sermon Sampler",
"avgrating": "5",
"group": "Book"
}
And an example of a customer is:
{
"customer": "A2FMUVHRO76A32"
}
And inside the customers table csv, I have:
Customer, ASIN, rating
A2FMUVHRO76A32, 0827229534, 5
I can't seem to figure out why it's throwing back that error.
The first WITH clause in your query (WITH row) is unnecessary, but you have to add the variable to the WITH clause. So this version compiles.
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row
MATCH (m:Main)
WITH m, row
MATCH (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
The reason for this is, that, in essence, WITH chains two query parts together, while limiting the scope to its variables (and in some cases, also performing calculations, aggregations, etc.).
Having said that, you do not even need the second WITH clause, you can just omit it and even merge the two MATCH clauses to a single one:
LOAD CSV WITH HEADERS FROM "file:///customers_table.csv" AS row
MATCH (m:Main), (c:Customers) USING INDEX c:Customers(customer)
WHERE m.ASIN = row.asin AND c.customer = row.customer
CREATE (c)-[:RATED]->(m)
I have a table that contains a json type column history, and the structure is this:
[
{
"admin_id": "1",
"process_time": "2017-6-6 14:14:14"
},
{
"admin_id": "2",
"process_time": "2017-6-6 14:14:14"
}
]
for every record's history column it may contain multiple elements in array. Now I want to build a query to select records which has specific id in history array. For example, I want to select all records which history array has admin_id equals to 1. I don't know how to write this query, someone can help me? Thanks.