I am looking into using AWS Athena to do queries against a mass of JSON files.
My JSON files have this format (prettyprinted for convenience):
{
"data":[
{<ROW1>},
{<ROW2>},
...
],
"foo":[...],
"bar":[...]
}
The ROWs contained in the "data" array are what should be queried. The rest of the JSON file is unimportant.
Can this be done without modifying the JSON files? If yes, how? From what I've been able to find, looks like the SerDes (or is it Hive itself?) assume one row of output per line of input, which would mean that I'm stuck with modifying all my JSON files (and turning them into JSONL?) before uploading them to S3.
(Athena uses the Hive JSON SerDe and the OpenX JSON SerDe; AFAICT, there is no option to write my own SerDe or file format...)
You can't make the serde do it automatically, but you can achieve what you're after in a query. You can then create a view to simulate a table with the data elements unwrapped.
The way you do this is to use the UNNEST keyword. This produces one new row per element in an array:
SELECT
foo,
bar,
element
FROM my_table, UNNEST(data) AS t(element)
If your JSON looked like this:
{"foo": "f1", "bar": "b1", "data": [1, 2, 3]}
{"foo": "f2", "bar": "b2", "data": [4, 5]}
The result of the query would look like this:
foo | bar | element
----+-----+--------
f1 | b1 | 1
f1 | b1 | 2
f1 | b1 | 3
f2 | b2 | 4
f2 | b2 | 5
Related
I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]
It seems like there should be a function for this in Spark SQL similar to pivoting, but I haven't found any solution to transforming a JSON key into a a value. Suppose I have a badly formed JSON (the format of which I cannot change):
{"A long string containing serverA": {"x": 1, "y": 2}}
how can I process it to
{"server": "A", "x": 1, "y": 2}
?
I read the JSONs into an an sql.dataframe and would then like to process them as described above:
val cs = spark.read.json("sample.json")
.???
If we want to use only spark functions and no UDFs, you could use from_json to parse the json into a map (we need to specify a schema). Then you just need to extract the information with spark functions.
One way to do it is as follows:
val schema = MapType(
StringType,
StructType(Array(
StructField("x", IntegerType),
StructField("y", IntegerType)
))
)
spark.read.text("...")
.withColumn("json", from_json('value, schema))
.withColumn("key", map_keys('json).getItem(0))
.withColumn("value", map_values('json).getItem(0))
.withColumn("server",
// Extracting the server name with a regex
regexp_replace(regexp_extract('key, "server[^ ]*", 0), "server", ""))
.select("server", "value.*")
.show(false)
which yields:
+------+---+---+
|server|x |y |
+------+---+---+
|A |1 |2 |
+------+---+---+
I've a Spark (v.3.0.1) job written in Java that reads Json from Kafka, does some transformation and then writes it back to Kafka. For now, the incoming message structure in Kafka is something like:
{"catKey": 1}. The output from the Spark job that's written back to Kafka is something like: {"catKey":1,"catVal":"category-1"}. The code for processing input data from Kafka goes something as follows:
DataFrameReader dfr = putSrcProps(spark.read().format("kafka"));
for (String key : srcProps.stringPropertyNames()) {
dfr = dfr.option(key, srcProps.getProperty(key));
}
Dataset<Row> df = dfr.option("group.id", getConsumerGroupId())
.load()
.selectExpr("CAST(value AS STRING) as value")
.withColumn("jsonData", from_json(col("value"), schemaHandler.getSchema()))
.select("jsonData.*");
// transform df
df.toJSON().write().format("kafka").option("key", "val").save()
I want to change the message structure in Kafka. Now, it should be of the format: {"metadata": <whatever>, "payload": {"catKey": 1}}. While reading, we need to read only the contents of the payload, so the dataframe remains similar. Also, while writing back to Kafka, first I need to wrap the msg in payload, add a metadata. The output will have to be of the format: {"metadata": <whatever>, "payload": {"catKey":1,"catVal":"category-1"}}. I've tried manipulating the contents of the selectExpr and from_json method, but no luck so far. Any pointer on how to achieve the functionality would be very much appreciated.
To extract the content of payload in your JSON you can use get_json_object. And to create the new output you can use the built-in functions struct and to_json.
Given a Dataframe:
val df = Seq(("""{"metadata": "whatever", "payload": {"catKey": 1}}""")).toDF("value").as[String]
df.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|{"metadata": "whatever", "payload": {"catKey": 1}}|
+--------------------------------------------------+
Then creating the new column called "value"
val df2 = df
.withColumn("catVal", lit("category-1")) // whatever your logic is to fill this column
.withColumn("payload",
struct(
get_json_object(col("value"), "$.payload.catKey").as("catKey"),
col("catVal").as("catVal")
)
)
.withColumn("metadata",
get_json_object(col("value"), "$.metadata"),
).select("metadata", "payload")
df2.show(false)
+--------+---------------+
|metadata|payload |
+--------+---------------+
|whatever|[1, category-1]|
+--------+---------------+
val df3 = df2.select(to_json(struct(col("metadata"), col("payload"))).as("value"))
df3.show(false)
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"metadata":"whatever","payload":{"catKey":"1","catVal":"category-1"}}|
+----------------------------------------------------------------------+
I'd like to use jq to output in CSV format, but for multiple headers, followed by multiple details. The solutions that I've already seen on Stack Overflow provide a way to insert a single header, but I haven't found anything for multiple headers.
To give you an idea of what I'm talking about, here is some sample JSON input:
[
{
"HDR": [1, "abc"],
"DTL": [ [101,"Descr A"], [102,"Descr B"] ]
}, {
"HDR": [2, "def"],
"DTL": [ [103,"Descr C"], [104,"Descr D"] ]
}
]
Desired output:
HDR|1|abc
DTL|101|Descr A
DTL|102|Descr B
HDR|2|def
DTL|103|Descr C
DTL|104|Descr D
I don't know if it's possible, but my approach so far has been to try to create a filter to give me the following, since transforming this to what I need would be trivial:
["HDR", 1, "abc"]
["DTL", 101, "Descr A"]
["DTL", 102, "Descr B"]
["HDR", 2, "def"]
["DTL", 103, "Descr C"]
["DTL", 104, "Descr D"]
To be clear, I know how to do this in any number of scripting languages, but I'm really trying to stick with a single jq filter, if it's at all possible.
Edit: I should clarify that I don't necessarily need to copy the "HDR" and "DTL" keys into the CSV (I can hard-code those), so the sample JSON could look like this, if it makes the problem easier.
[
[
[1, "abc"],
[[101,"Descr A"], [102,"Descr B"]]
], [
[2, "def"],
[[103,"Descr C"], [104,"Descr D"]]
]
]
Edit: This filter technically answers the question with the second sample data I provided (the last one, that's only arrays and no objects), but I would still appreciate a better answer, if for no other reasons than the header length has to be hard-coded, and putting the HDR into two sets of arrays so that it can be flatten()'d later feels wrong. But I'll leave it here for reference.
.[] | flatten(1) | [[["HDR"] + .[0:2]]] as $hdr | .[2:] as $dtl | $dtl | map([["DTL"] + .]) as $dtl | $hdr + $dtl | flatten(1) | .[] | join("|")
This works for your original input, assuming you chose | as the delimiter because none of your fields can contain |.
jq -r 'map(["HDR"]+.HDR, ["DTL"] + .DTL[])[] | join("|")' data.json
map produces multiple array elements per object.
.DTL[] ensures "DTL" is prefixed to each sublist
[] flattens the result of the map
I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])