I am using python spark sql to read streaming files sent from Kinesis. The nested Json streams are saved and zipped in S3 bucket. These gzip files are not standard json format, it just has a bunch of json objects without comma to separate. And it does not have [ ] at the beginning nor the end of the file. Here is the sample, there are 2 json objects in the example, I expect each object could be a row in the output:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
}
}
{
"id": "0002",
"type": "donut",
"name": "Cake",
"ppu": 0.65,
"batters":
{
"batter":
[
{ "id": "1221", "type": "Regular" },
{ "id": "1223", "type": "Chocolate" },
{ "id": "1225", "type": "Blueberry" },
{ "id": "1228", "type": "Devil's Food" }
]
}
}
I can use pyspark sqlContext.read.json('streaming.json') to read this file, but the function only returns first object. When I add [ ] and comma seperate to those json objects, it reads those 2 objects succeesfully. But in my case, I have 1 TB of files for each day, its hard to translate the streaming into standard json files. I am new to spark, is there a way to read those streaming json objects using spark or sparkSQL? The output I expect is json flatten csv or dataframe saved back to S3. Thanks a lot.
Related
In an overview, let's say I have a CSV file that has 5 entries of data (I will have a large number of entries in the CSV file) that I need to use dynamically while building the JSON payload using python (in Databricks).
test.csv
1a2b3c
2n3m6g
333b4c
2m345j
123abc
payload.json
{
"records": {
"id": "37c8323c",
"names": [
{
"age": "1",
"identity": "Dan",
"powers": {
"key": "plus",
"value": "1a2b3c"
}
},
{
"age": "2",
"identity": "Jones",
"powers": {
"key": "minus",
"value": "2n3m6g"
}
},
{
"age": "3",
"identity": "Kayle",
"powers": {
"key": "multiply",
"value": "333b4c"
}
},
{
"age": "4",
"identity": "Donnis",
"powers": {
"key": "divide",
"value": "2m345j"
}
},
{
"age": "5",
"identity": "Layla",
"powers": {
"key": "power",
"value": "123abc"
}
}
]
}
}
The above payload that I need to construct as a result of multiple names objects in the array and I also would like the value property to read dynamically from the CSV file.
I basically need to append the below JSON object to the existing names array considering the value for the power object from the CSV file.
{
"age": "1",
"identity": "Dan",
"powers": {
"key": "plus",
"value": "1a2b3c"
}
}
Since I'm a newbie in Python, any guides would be appreciated. Thanks to the StackOverflow team in advance.
Our Data is in GCS bucket in new line delimited jsons. We create external and native tables in bigquery using those buckets as source.
While loading the data we are getting an error the json can not be loaded.
While analysing further we found that the jsons have same keys with different cases. The conflict happens because the json parsing is case sensistive and the bigquery column name is case insensitive.
Eg:
{
"metrics": [
{
"labels": {
"__name__": "name",
"cluster_id": "cluster1",
"cluster_profile": "dev",
"cluster_region": "cloud",
"container": "POD",
"endpoint": "https-metrics",
"id": "ID",
"image": "IMAGE",
"instance": "Instance",
"namespace": "default",
"node": "node",
"pod": "pod"
},
"samples": [
{
"value": 0.04,
"timestamp": 1654756143044
}
]
},
{
"labels": {
"__name__": "name",
"cluster_id": "cluster1",
"cluster_profile": "dev",
"cluster_region": "cloud",
"container": "POD",
"endpoint": "https-metrics",
"id": "ID",
"Image": "IMAGE",
"instance": "Instance",
"namespace": "default",
"node": "node",
"pod": "pod"
},
"samples": [
{
"value": 0.04,
"timestamp": 1654756143044
}
]
}
]
}
Here as you can see there are 2 labels image and Image.
Error:
Error in query string: Duplicate(Case Insensitive) field names: image and Image. Table: table1
Consider below approach:
Format the keys of your json file to be in lowercase by running the command gsutil cat gs://[YOUR-BUCKET-NAME]/[FILENAME].json | jq 'walk(if type=="object" then with_entries(.key|=ascii_downcase) else . end)' > [FILENAME].json
Then run gsutil cp command to upload or store your formatted json file from your local to your GCS bucket.
Sample output:
Formatted json file uploaded in GCS bucket:
I am trying to filter my GET response(JSON) based on a value in a nested JSON array. For eg: In the following JSON i want to filter an JSON array and print the names of cakes using Chocolate as batter.
{
"id": "0001",
"type": "donut",
"name": "Choco Blueberry Cake",
"ppu": 0.55,
"batter":[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" }
]
}
I have tried something like:
List<String> chocolateCakeList =jsonPath.getList("findAll{it.batter.type=='chocolate'}.name");
and
List<String> chocolateCakeList =jsonPath.getList("findAll{it.batter.it.type=='chocolate'}.name");
Both return empty Lists.
Problem
First, if you want to use findAll then the object to be axtract must be an array. Your json is object {}, not an array []. Actually, your nested object batter is an array [].
keyword to search is Chocolate, not chocolate. Case-sensitive is applied here.
Solution
-If your whole response is exactly what you posted, then the path to extract is
String name = jsonPath.getString("name")
-If your response has structrure like this.
[
{
"id": "0001",
"type": "donut",
"name": "Choco Blueberry Cake",
"ppu": 0.55,
"batter": [
{
"id": "1001",
"type": "Regular"
},
{
"id": "1002",
"type": "Chocolate"
},
{
"id": "1003",
"type": "Blueberry"
}
]
},
{
"id": "0002",
"type": "donut",
"name": "Choco Blueberry Cake",
"ppu": 0.55,
"batter": [
{
"id": "1001",
"type": "Regular"
},
{
"id": "1002",
"type": "Chocolate 2"
},
{
"id": "1003",
"type": "Blueberry"
}
]
}
]
Then the extraction is
List<String> list = jsonPath.getList("findAll {it.batter.findAll {it.type == 'Chocolate'}}.name");
I have a json file to store my data and I convert it to CSV to edit my data. But when i convert it to json again it all goes unconstructed. How can i convert my csv to same structure as my old json.
JSON
{
"product": [
{
"id": "item0001",
"category": "12",
"name": "Name1",
"tag": "tag1",
"more": [
{
"id": "1",
"name": "AL"
},
{
"id": "1",
"name": "BS"
}
],
"active": true
},
{
"id": "item0002",
"categoryId": "13",
"name": "Name2",
"tag": "tag2",
"size": "2",
"more": [
{
"id": "2",
"name": "DL"
},
{
"id": "2",
"name": "AS"
}
],
"active": true
}
]
}
CSV
id,categoryId,name,shortcut,more/0/optionId,more/0/price,more/1/optionId,more/1/price,active,more/2/optionId,more/2/price,spanSize
item0001,ab92d2c6-010e-4182-844d-65050e746617,Name1,Shortcut1,1,60,1,70,TRUE,,,
item0002,ab92d2c6-010e-4182-844d-65050e746617,Name2,Shortcut2,2,60,2,70,TRUE,2,2,4
You can use Miller (mlr) to convert you file both ways
https://miller.readthedocs.io/en/latest/flatten-unflatten/
first from JSON to CSV
mlr --ijson --ocsv cat test.json > test.csv
then edit CSV (Visidata is a very nice command line tool for the job)
then convert it back to CSV
mlr --icsv --ojson cat test.csv > test_v2.json
If you want to have some JSON lines structure instead, use --ojonl
I have a roughly 10G JSON file. Each line contains exactly one JSON document. I was wondering what is the best way to convert this to Avro. Ideally I would like to keep several documents (like 10M) per file. I think Avro supports having multiple documents in the same file.
You should be able to use Avro tools' fromjson command (see here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)).
The easiest way to convert a large JSON file to Avro is using avro-tools from the Avro website.
After creating a simple schema the file can be directly converted.
java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro
The example schema:
{
"type": "record",
"name": "cpc_schema",
"namespace": "com.streambright.avro",
"fields": [{
"name": "section",
"type": "string",
"doc": "Section of the CPC"
}, {
"name": "class",
"type": "string",
"doc": "Class of the CPC"
}, {
"name": "subclass",
"type": "string",
"doc": "Subclass of the CPC"
}, {
"name": "main_group",
"type": "string",
"doc": "Main-group of the CPC"
}, {
"name": "subgroup",
"type": "string",
"doc": "Subgroup of the CPC"
}, {
"name": "classification_value",
"type": "string",
"doc": "Classification value of the CPC"
}, {
"name": "doc_number",
"type": "string",
"doc": "Patent doc_number"
}, {
"name": "updated_at",
"type": "string",
"doc": "Document update time"
}],
"doc:": "A basic schema for CPC codes"
}