Exclude column header when writing DataFrame to json - json

I have the following dataframe df1
SomeJson
=================
[{
"Number": "1234",
"Color": "blue",
"size": "Medium"
}, {
"Number": "2222",
"Color": "red",
"size": "Small"
}
]
and I am trying to write just the contents of this column to blob storage as a json.
df1.select("SomeJson")
.write
.option("header", false)
.mode("append")
.json(blobStorageOutput)
This Code works but it creates the following json in blob storage.
{
"SomeJson": [{
"Number": "1234",
"Color": "blue",
"size": "Medium"
}, {
"Number": "2222",
"Color": "red",
"size": "Small"
}
]
}
But I just want the contents of the column not the column Header as well, I dont want the "SomeJson" in my final Json. Any Suggestions?

If you don't want dataframe column to get appended, write your dataframe as text and not as json. It will only write the content of your column.
df1.select("SomeJson")
.write
.option("header", false)
.mode("append")
.text(blobStorageOutput)

Just an additional assumption to this question,
We derive JSON structure itself from the dataset and then we encounter this header scenario like here. We can follow the below approach.
spark.sql("SELECT COLLECT_SET(STRUCT(<field_name>)) AS `` FROM <table_name> LIMIT 1").coalesce(1).write.format("org.apache.spark.sql.json").mode("overwrite").save(<Blob Path1/ ADLS Path1>)
Output will be like,
{"":[{<field_name>:<field_value>}]}
Here the header can be avoided by following 3 lines (Assumption No Tilda in data),
jsonToCsvDF=spark.read.format("com.databricks.spark.csv").option("delimiter", "~").load(<Blob Path1/ ADLS Path1>)
jsonToCsvDF.createOrReplaceTempView("json_to_csv")
spark.sql("SELECT SUBSTR(`_c0`,5,length(`_c0`)-5) FROM json_to_csv").coalesce(1).write.option("header",false).mode("overwrite").text(<Blob Path2/ ADLS Path2>)

Related

Retrieve specific value from a JSON blob in MS SQL Server, using a property value?

In my DB I have a column storing JSON. The JSON looks like this:
{
"views": [
{
"id": "1",
"sections": [
{
"id": "1",
"isToggleActive": false,
"components": [
{
"id": "1",
"values": [
"02/24/2021"
]
},
{
"id": "2",
"values": []
},
{
"id": "3",
"values": [
"5393",
"02/26/2021 - Weekly"
]
},
{
"id": "5",
"values": [
""
]
}
]
}
]
}
]
}
I want to create a migration script that will extract a value from this JSON and store them in its own column.
In the JSON above, in that components array, I want to extract the second value from the component with an ID of "3" (among other things, but this is a good example). So, I want to extract the value "02/26/2021 - Weekly" to store in its own column.
I was looking at the JSON_VALUE docs, but I only see examples for specifing indexes for the json properties. I can't figure out what kind of json path I'd need. Is this even possible to do with JSON_VALUE?
EDIT: To clarify, the views and sections components can have static array indexes, so I can use views[0].sections[0] for them. Currently, this is all I have with my SQL query:
SELECT
*
FROM OPENJSON(#jsonInfo, '$.views[0].sections[0]')
You need to use OPENJSON to break out the inner array, then filter it with a WHERE and finally select the correct value with JSON_VALUE
SELECT
JSON_VALUE(components.value, '$.values[1]')
FROM OPENJSON (#jsonInfo, '$.views[0].sections[0].components') components
WHERE JSON_VALUE(components.value, '$.id') = '3'

pandas json_normalize nested json where dictionary only exists on some records

I am trying to run pandas.json_normalize on a data file that has highly varied, nested json, where the content of the records can vary considerably.
I am processing a house listing file and trying to pull out prices. The prices data is stored as follows, and 'prices' is at the first nesting level within the json file:
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}, # followed by additional entries
I am using the following line of code, which works if I edit the input file down to a single record that includes a 'prices' section:
df3 = pd.json_normalize(df['records'], record_path='prices',
meta=['id'],
errors='ignore'
)
However, the full file includes many records that do not include a prices section. If I run the code against a file with 2 records (one with, one without), it fails with KeyError: 'prices'
Clearly the 'errors='ignore'' in the json_normalize is not enough to handle the error.
What can I do? I would just like to skip the records without prices entirely.
A list comprehension on your JSON will do it. I've synthesized some JSON to match your description of input data.
js = {
"records": [
{
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}
],
"id": 1
},{"id":2}
]
}
pd.json_normalize({"records":[r for r in js["records"] if "prices" in r.keys()]}["records"],record_path="prices",meta="id")

PySpark - Getting list of dicts and converting its keys/values to columns

I have the following json (located in my local file system in path_json):
[
{
"name": "John",
"email": "john#hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary#heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
As you can see, this is a nested json.
I am reading it with the following command:
df = spark.read.option("multiline", "true").json(path_json)
Ok. Now, it produces me the following DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john#hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary#heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
I want to know if there is a way to produces the following dataframe:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john#hisemail.com|Male |guitar|null |null |John|
|null|mary#heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
A few comments:
My real data has thousands and thousands of lines
I don't know all the column_name in my dataset (there are many of them)
email is unique for each line, so it can be used as key if a join is necessary. I tried this approach before: create a main dataframe with columns [name,gender,email] and other dataframes for each row containing the dictionaries. But without success (and it doesn`t have good performance).
Thanks you so much!

Read first line of huge Json file with Spark using Pyspark

I'm pretty new to Spark and to teach myself I have been using small json files, which work perfectly. I'm using Pyspark with Spark 2.2.1 However I don't get how to read in a single data line instead of the entire json file. I have been looking for documentation on this but it seems pretty scarce. I have to process a single large (larger than my RAM) json file (wikipedia dump: https://archive.org/details/wikidata-json-20150316) and want to do this in chuncks or line by line. I thought Spark was designed to do just that but can't find out how to do it and when I request the top 5 observations in a naive way I run out of memory. I have tried RDD .
SparkRDD= spark.read.json("largejson.json").rdd
SparkRDD.take(5)
and Dataframe
SparkDF= spark.read.json("largejson.json")
SparkDF.show(5,truncate = False)
So in short:
1) How do I read in just a fraction of a large JSON file? (Show first 5 entries)
2) How do I filter a large JSON file line by line to keep just the required results?
Also: I don't want to predefine the datascheme for this to work.
I must be overlooking something.
Thanks
Edit: With some help I have gotten a look at the first observation but it by itself is already too huge to post here so I'll just put a fraction of it here.
[
{
"id": "Q1",
"type": "item",
"aliases": {
"pl": [{
"language": "pl",
"value": "kosmos"
}, {
"language": "pl",
"value": "\\u015bwiat"
}, {
"language": "pl",
"value": "natura"
}, {
"language": "pl",
"value": "uniwersum"
}],
"en": [{
"language": "en",
"value": "cosmos"
}, {
"language": "en",
"value": "The Universe"
}, {
"language": "en",
"value": "Space"
}],
...etc
That's very similar to Select only first line from files under a directory in pyspark
Hence something like this should work :
def read_firstline(filename):
with open(filename, 'rb') as f:
return f.readline()
# files is a list of filenames
rdd_of_firstlines = sc.parallelize(files).flatMap(read_firstline)

MySQL regexp search JSON array

I am storing json data to one of the fields in a table and I am having trouble using REGEXP to return the correct entry
Basically, it matches other attributes in the JSON object, that it should not
Sample JSON
{
"data": {
"en": {
"containers": [
{
"id": 1441530944931,
"template": "12",
"columns": {
"column1": [
"144",
"145",
"148"
],
"column2":[
"135",
"148",
"234"
]
}
}
],
"left": "152",
"right": "151"
},
}
}
Now, I would like to search the columns array against a specific value (ie 148)
Right now I have the below query
WHERE (w.`_attrs` REGEXP '"column[0-9]":.*\\[.*"148".*\\]'
which works just fine
However, if I change the value from 148 to 152 or 151, it also works
For some reason the query matches the attribute left and right as well, but this is not desirable
Any help?
Thanks
Or... Switch to MariaDB 10 and index the components of the JSON.