how to extract value from a column which in json format using pyspark

how to extract value from a column which in json format using pyspark - json

I have a pyspark dataframe, where there is one column(quite long strings) in json string, which has many keys, where I am only interested in one key. May I know how to extract the value for that key?
here is the example of the string of the column userbehavior:
[{"num":"1234","Projections":"test", "intent":"test", "Mtime":11333.....}]
I wish to extract the value for "Mtime" only, i tried using:
user_hist_df=user_hist_df.select(get_json_object(user_hist_df.userbehavior, '$.Mtime').alias("Time"))
However it does not work.

You are almost right, it isn't working because your JSON is an array of objects. Just change to this:
get_json_object('userbehavior', '$[*].Mtime').alias("Time")

In order to extract from a json column you can use - from_json() and specify the schema
e.g. df = df.withColumn("parsed_col", from_json($"Body",MapType(StringType,StringType)))
Once you parse the json as per the schema - just extract the column as per your need
df = df.withColumn("col_1", col("parsed_col").getItem("col_1"))

Related

Pyspark to concatenate/combine string columns with an escaped json column into a single json column

I am trying to accomplish the following with PySpark.
Here is an input dataframe with 2 columns:
string
escaped_json
Contact
{"id":"27","person":{"firstName":"Dan","lastName":"Jones"}}
I need to transform the above dataframe into this dataframe. Note, the schema of the escaped_json column above varies (it is not fixed), elements of the escaped_json are escaped with \ like {\id\:\27\,\person\:{\firstName\:\Dan\,\lastName\:\Jones\}}.
string_with_regular_json
{"string":"Contact","id":"27","person":{"firstName":"Dan","lastName":"Jones"}}"
So far I figured how to transform just the "escaped_json" column into its own dataframe, but I need the string column (and possibly more columns) to be "concatenated" or included too:
json_column_df = spark.read.json(input_df.rdd.map(lambda row: row.escaped_json))
Please help :)

Since the json does not have a fixed schema, we can edit it with regexp_replace. Let's replace the leading { by what you want. Note that using regexp_replace with the content of a column only works within expr.
from pyspark.sql import functions as F
df.select(F.expr(
"regexp_replace(escaped_json, '^\\\{', concat('{\"string\":\"', string, '\",'))"
).alias("string_with_regular_json")).show(truncate=False)
+------------------------------------------------------------------------------+
|string_with_regular_json |
+------------------------------------------------------------------------------+
|{"string":"Contact","id":"27","person":{"firstName":"Dan","lastName":"Jones"}}|
+------------------------------------------------------------------------------+

How to store dynamically generated JSON object in Big Query Table?

I have a use case to store dynamic JSON objects in a column in Big Query. The schema of the object is dynamically generated by the source and not known beforehand. The number of key value pairs in the object can differ as well, as shown below.
Example JSON objects:
{"Fruit":"Apple","Price":"10","Sale":"No"}
{"Movie":"Avatar","Genre":"Fiction"}
I could achieve the same in Hive by defining the column as map<string, string> object and I could query the data in the column like col_name["Fruit"] or col_name["Movie"] for that corresponding row.
Is there an equivalent way of above usage in Big Query? I came across 'RECORD' data type but the schema needs to be same for all the objects in the column.
Note: Storing the column as string datatype is not an option as the users need to query the data on the keys directly without parsing after retrieving the data.

Storing the data as a JSON string seems to be the only way to implement your requirement, at the moment. As a workaround, you can create a JavaScript UDF that parses the JSON string and extracts the necessary information. Below is a sample UDF.
CREATE TEMP FUNCTION extract_from_json(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
const obj = JSON.parse(json);
return obj[key];
""";
WITH json_table AS (
SELECT '{"Fruit":"Apple","Price":"10","Sale":"No"}' json_data UNION ALL
SELECT '{"Movie":"Avatar","Genre":"Fiction"}' json_data
)
SELECT extract_from_json(json_data, 'Movie') AS photos
FROM json_table
You can also check out the newly introduced JSON data type in BigQuery. The data type offers more flexibility when handling JSON data but please note that the data type is still in preview and is not recommended for production. You will have to enroll in this preview. For more information on working with JSON data, refer to this documentation.

How to set value in MySQL(5.6) column if that contains json document as string

How to set value in MySQL(5.6) column if that contains JSON document as a string
For example, if we have a table - user in that we have three columns id, name and jsonConfig and column jsonConfig contains data as a JSON document
{"key1":"val1","key2":"val2","key3":"val3"}
I would like to replace the value of val1 let's say to val4 for jsonConfig column
Can we do that using MySQL(5.6) queries?

I don't thing their is direct way to do this like in later version alot of json support was added like JSON_EXTRACT, JSON_CONTAINS etc.You might have to write your own custom function.

With MySQL 5.6, since it does not have the JSON data type or the supporting functions, you are going to have to replace the entire string via an UPDATE query if you want to change any part of the JSON document in your string.

JSON update single value in MySQL table

I have a JSON array in the MySQL payment table details column. I need to update a single value of this JSON array. What is the procedure to update JSON using MySQL?
JSON Array
{"items":[{"ca_id":18,"appointment_date":"2018-09-15 15:00:00","service_name":"Software Installation / Up-gradation","service_price":165}],"coupon":{"code":"GSSPECIAL","discount":"10","deduction":"0.00"},"subtotal":{"price":165,"deposit":0},"tax_in_price":"included","adjustments":[{"reason":"Over-time","amount":"20","tax":"0"}]}
I need to update the appointment _date 2018-09-15 15:00:00 to 2018-09-28 15:00:00.

Here is a pure MySQL JSON way of doing this:
UPDATE yourTable
SET col = JSON_REPLACE(col, '$.items[0].appointment_date', '2018-09-28 15:00:00');
The best I could come up with is to address the first element of the JSON array called items, and then update the appointment_date field in that array element.
Here is a demo showing that the JSON replacement syntax/logic is working:
Demo
But, you could equally as well have done this JSON work in your PHP layer. It might make more sense to do this in PHP.

If you want to do this in php then, steps to follow:
Select the respective column from the table
Use json_decode to convert the string to array
Now you have the json object, apply your modifications
Use json_encode to convert your json object back to string
Save this string in table

Complex Json schema into custom spark dataframe

Ok so I'm getting a big Json string from an API call, and I want to save some of that string into Cassandra. I'm trying to parse the Json string into a more table like structure, but with only some fields. The overall schema looks like this:
And I want my table structure using regnum, date and value fields.
With sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.#attributes.regnum","reg.data.date","reg.data.value").show I can get a table like this:
But as you can see date and value fields are arrays. I would like to have one element per record, and duplicate the corresponding regnum for each record. Any help is very much appreciated.

You can cast your DataFrame to Dataset then flatMap on it.
df.select("reg.#attributes.regnum","reg.data.date","reg.data.value")
.as[(Long, Array[String], Array[String])]
.flatMap(s => s._2.zip(s._3).map(p => (s._1, p._1, p._2)))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to extract value from a column which in json format using pyspark - json

You are almost right, it isn't working because your JSON is an array of objects. Just change to this: get_json_object('userbehavior', '$[*].Mtime').alias("Time")

Related

Pyspark to concatenate/combine string columns with an escaped json column into a single json column

How to store dynamically generated JSON object in Big Query Table?

How to set value in MySQL(5.6) column if that contains json document as string

JSON update single value in MySQL table

Complex Json schema into custom spark dataframe

Categories

Resources