Convert string column to json and parse in pyspark - json

My dataframe looks like
|ID|Notes|
---------------
|1|'{"Country":"USA","Count":"1000"}'|
|2|{"Country":"USA","Count":"1000"}|
ID : int
Notes : string
When i use from_json to parse the column Notes, it gives all Null values.
I need help in parsing this column Notes into columns in pyspark

When you are using from_json() function, make sure that the column value is exactly a json/dictionary in String format. In the sample data you have given, the Notes column value with id=1 is not exactly in json format (it is a string but enclosed within additional single quotes). This is the reason it is returning NULL values. Implementing the following code on the input dataframe gives the following output.
df = df.withColumn("Notes",from_json(df.Notes,MapType(StringType(),StringType())))
You need to change your input data such that the entire Notes column is in same format which is json/dictionary as a string and nothing more because it is the main reason for the issue. The below is the correct format that helps you to fix your issue.
| ID | Notes |
---------------
| 1 | {"Country":"USA","Count":"1000"} |
| 2 | {"Country":"USA","Count":"1000"} |
To parse Notes column values as columns in pyspark, you can simply use function called json_tuple() (no need to use from_json()). It extracts the elements from a json column (string format) and creates the result as new columns.
df = df.select(col("id"),json_tuple(col("Notes"),"Country","Count")) \
.toDF("id","Country","Count")
df.show()
Output:
NOTE: json_tuple() also returns null if the column value is not in the correct format (make sure the column values are json/dictionary as a string without additional quotes).

Related

Pull data from JSON column and create new output with ABSENT ON NULL option

I have a JSON column in an Oracle DB where it was populated without the ABSENT ON NULL option and there are some pretty long lengths because of this.
I would like to trim things down and have created a new table similar to the first but I would like to select the JSON from form the old, add the ABSENT ON NULL option and place the new values in reducing the column length.
So I can see the JSON easy enough like
SELECT json_query(json_data,'$') FROM table;
This will give a result like:
{
"REC_TYPE_IND":"1",
"ID":"1234",
"OTHER_ID":"4321",
"LOCATION":null,
"EFF_BEG_DT":"19970101",
"EFF_END_DT":"99991231",
"NAME":"Joe",
"CITY":null
}
When I try to remove the null values like
SELECT json_object (json_query(json_data,'$') ABSENT ON NULL
RETURNING VARCHAR2(4000)
) AS col1 FROM table;
I get the following:
ORA-02000: missing VALUE keyword
I assume this is because the funcion json_object is expecting the format:
json_object ('REC_TYPE_IND' VALUE '1',
'ID' VALUE '1234')
Is there a way around this, to turn the JSON back into values that JSON_OBJECT can recognize like above or is there a function I am missing?

Spark Window function, is this a bug in Spark?

Would someone be able to verify if this is a bug in spark? or am I doing something horrible wrong with PySpark Window function:
Here is the dataframe:
Here is the code that I am running to replace the null values in the post_evar8 column:
win_mid_desc_ts = Window.partitionBy('post_visid_high_low').orderBy(desc('hit_time_gmt'))
step3win = step3win.withColumn("post_evar8", last(col('post_evar8'), ignorenulls=True).over(win_mid_desc_ts))
step3win.orderBy("visit_page_num").show(100, truncate=False)
After running the above code, I get the following results:
As you can see, the window function updated the null values in post_evar8 column but also replaced 184545857 with 32526519(visit_page_num 26 and 27). Not sure why 184545857 value was replaced.
Here is the same dataframe in JSON(can copy and paste into file)
{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524187","visit_page_num":1}
{"post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":2}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524197","visit_page_num":3}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":14}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524568","visit_page_num":15}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":18}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":19}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":20}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590524599","visit_page_num":21}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":26}
{"post_evar8":"184545857","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":27}
{"post_evar8":"32526519","post_visid_high_low":"3283497750620215155_4391202461688050070","hit_time_gmt":"1590525921","visit_page_num":28}
<------------ Updates with more example: ---------------->
Here is an example of unique hit_time_gmt(yellow box) values and the post_evar8 looks correct(red box).
Here is the example of modifying just the hit_time_gmt(yellow box), so there are 2 similar(1590525922) and 1 unique(1590525921). The post_evar8 value in the middle gets updated from 184545857 to 32526519(red box). This is wrong.
In this window function, I just want to update the NULL values in post_evar8(not values already populated). In all cases that that looks correct(green box). hit_time_gmt is just providing the order, why does the value in hit_time_gmt changing the value of post_evar8 (in the red box)?
No, It's not bug.
Because applying partitionBy on post_visid_high_low column which has same values in your dataframe will treat entire data in dataframe as one partition on that you are applying order by hit_time_gmt descending, final result will be ordered like below.
>>> df.orderBy(F.desc("hit_time_gmt")).show(truncate=False)
+------------+----------+---------------------------------------+--------------+
|hit_time_gmt|post_evar8|post_visid_high_low |visit_page_num|
+------------+----------+---------------------------------------+--------------+
|1590525921 |184545857 |3283497750620215155_4391202461688050070|26 |
|1590525921 |32526519 |3283497750620215155_4391202461688050070|28 |
|1590525921 |184545857 |3283497750620215155_4391202461688050070|27 |
|1590524568 |32526519 |3283497750620215155_4391202461688050070|15 |
|1590524568 |32526519 |3283497750620215155_4391202461688050070|14 |
|1590524197 |null |3283497750620215155_4391202461688050070|2 |
|1590524197 |32526519 |3283497750620215155_4391202461688050070|3 |
|1590524187 |null |3283497750620215155_4391202461688050070|1 |
+------------+----------+---------------------------------------+--------------+
from the above result last value of post_evar8 is 32526519 same value is replace for other values in post_evar8 column.
Add some more different values in this post_visid_high_low column try running same code & check.

Kusto KQL reference first object in an JSON array

I need to grab the value of the first entry in a json array with Kusto KQL in Microsoft Defender ATP.
The data format looks like this (anonymized), and I want the value of "UserName":
[{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}]
How do I split or in any other way get the "UserName" value?
In WDATP/MSTAP, for the "LoggedOnUsers" type of arrays, you want "mv-expand" (multi-value expand) in conjunction with "parsejson".
"parsejson" will turn the string into JSON, and mv-expand will expand it into LoggedOnUsers.Username, LoggedOnUsers.DomainName, and LoggedOnUsers.Sid:
DeviceInfo
| mv-expand parsejson(LoggedOnUsers)
| project DeviceName, LoggedOnUsers.UserName, LoggedOnUsers.DomainName
Keep in mind that if the packed field has multiple entries (like DeviceNetworkInfo's IPAddresses field often does), the entire row will be expanded once per entry - so a row for a machine with 3 entries in "IPAddresses" will be duplicated 3 times, with each different expansion of IpAddresses:
DeviceNetworkInfo
| where Timestamp > ago(1h)
| mv-expand parsejson(IPAddresses)
| project DeviceName, IPAddresses.IPAddress
to access the first entry's UserName property you can do the following:
print d = dynamic([{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}])
| extend result = d[0].UserName
to get the UserName for all entries, you can use mv-expand/mv-apply:
print d = dynamic([{"UserName":"xyz","DomainName":"xyz","Sid":"xyz"}])
| mv-apply d on (
project d.UserName
)
thanks for the reply, but the proposed solution didn't work for me. However instead I found the following solution:
project substring(split(split(LoggedOnUsers,',',0),'"',4),2,9)
The output of this is: UserName

need to add "_corrupt_record" column explicitly in the schema if you need to do schema validation when reading json via spark

When I read JSON through spark( using scala )
val rdd = spark.sqlContext.read.json("/Users/sanyam/Downloads/data/input.json")
val df = rdd.toDF()
df.show()
println(df.schema)
//val schema = df.schema.add("_corrupt_record",org.apache.spark.sql.types.StringType,true)
//val rdd1 = spark.sqlContext.read.schema(schema).json("/Users/sanyam/Downloads/data/input_1.json")
//rdd1.toDF().show()
this results in following DF:
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
| appId| appTimestamp|appVersion| bankCode|bankLocale| data|date| environment| event| id| logTime| logType| msid| muid| owner|recordType| uuid|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
StructType(StructField(appId,StringType,true), StructField(appTimestamp,StringType,true), StructField(appVersion,StringType,true), StructField(bankCode,StringType,true), StructField(bankLocale,StringType,true), StructField(data,StringType,true), StructField(date,StringType,true), StructField(environment,StringType,true), StructField(event,StringType,true), StructField(id,LongType,true), StructField(logTime,LongType,true), StructField(logType,StringType,true), StructField(msid,StringType,true), StructField(muid,StringType,true), StructField(owner,StringType,true), StructField(recordType,StringType,true), StructField(uuid,StringType,true))
If I want to apply validation for any further json I read then I take schema as a variable and parse that in .schema as an argument [refer the commented lines of code], but even the corrupt records don't go into _corrupt_record column(which should happen by default), instead it parses that bad records as null in all columns and this is resulting into data loss as theie is no record of it.
Although when you add _corrupt_record column in schema explicitly everything works fine and the corrupt_record goes into the respective column, I want to know the reason why this is so?
(Also, if you give a malformed Json, spark automatically handles it by making a _corrupt_record column, so how come schema validation needs explicit column addition earlier) ??
Reading corrupt json data returns schema as [_corrupt_record: string]. But you are reading the corrupt data with schema which is wrong and hence you are getting the whole row as null.
But when you add _corrupt_record explicitly you get whole json record in that column and I assume getting null in all other columns.

JSON path parent object, or equivalent MongoDB query

I am selecting nodes in a JSON input but can't find a way to include parent object detail for each array entry that I am querying. I am using pentaho data integration to query the data using JSON input form a mongodb input.
I have also tried to create a mongodb query to achieve the same but cannot seem to do this either.
Here are the two fields/paths that display the data:
$.size_break_costs[*].size
$.size_break_costs[*].quantity
Here is the json source format:
{
"_id" : ObjectId("4f1f74ecde074f383a00000f"),
"colour" : "RAVEN-SMOKE",
"name" : "Authority",
"size_break_costs" : [
{
"quantity" : NumberLong("80"),
"_id" : ObjectId("518ffc0697eee36ff3000002"),
"size" : "S"
},
{
"quantity" : NumberLong("14"),
"_id" : ObjectId("518ffc0697eee36ff3000003"),
"size" : "M"
},
{
"quantity" : NumberLong("55"),
"_id" : ObjectId("518ffc0697eee36ff3000004"),
"size" : "L"
}
],
"sku" : "SK3579"
}
I currently get the following results:
S,80
M,14
L,55
I would like to get the SKU and Name as well as my source will have multiple products (SKU/Description):
SK3579,Authority,S,80
SK3579,Authority,M,14
SK3579,Authority,L,55
When I try To include using $.sku, I the process errors.
The end result i'm after is a report of all products and the available quantities of their various sizes. Possibly there's an alternative mongodb query that provides this.
EDIT:
It seems the issue may be due to the fact that not all lines have the same structure. For example the above contains 3 sizes - S,M,L. Some products come in one size - PACK. Other come in multiple sizes - 28,30,32,33,34,36,38 etc.
The error produced is:
*The data structure is not the same inside the resource! We found 1 values for json path [$.sku], which is different that the number retourned for path [$.size_break_costs[].quantity] (7 values). We MUST have the same number of values for all paths.
I have tried the following mongodb query separately which gives the correct results, but the corresponding export of this doesn't work. No values are returned for the Size and Quantity.
Query:
db.product_details.find( {}, {sku: true, "size_break_costs.size": true, "size_break_costs.quantity": true}).pretty();
Export:
mongoexport --db brandscope_production --collection product_details --csv --out Test01.csv --fields sku,"size_break_costs.size","size_break_costs.quantity" --query '{}';
Shortly after I added my own bounty, I figured out the solution. My problem has the same basic structure, which is a parent identifier, and some number N child key/value pairs for ratings (quality, value, etc...).
First, you'll need a JSON Input step that gets the SKU, Name, and size_break_costs array, all as Strings. The important part is that size_break_costs is a String, and is basically just a stringified JSON array. Make sure that under the Content tab of the JSON Input, that "Ignore missing path" is checked, in case you get one with an empty array or the field is missing for some reason.
For your fields, use:
Name | Path | Type
ProductSKU | $.sku | String
ProductName | $.name | String
SizeBreakCosts | $.size_break_costs | String
I added a "Filter rows" block after this step, with the condition "SizeBreakCosts IS NOT NULL", which is then passed to a second JSON Input block. This second JSON block, you'll need to check "Source is defined in a field?", and set the value of "Get source from field" to "SizeBreakCosts", or whatever you named it in the first JSON Input block.
Again, make sure "Ignore missing path" is checked, as well as "Ignore empty file". From this block, we'll want to get two fields. We'll already have ProductSKU and ProductName with each row that's passed in, and this second JSON Input step will further split it into however many rows are in the SizeBreakCosts input JSON. For fields, use:
Name | Path | Type
Quantity | $.[*].quantity | Integer
Size | $.[*].size | String
As you can see, these paths use "$.[*].FieldName", because the JSON string we passed in has an array as the root item, so we're getting every item in that array, and parsing out its quantity and size.
Now every row should have the SKU and name from the parent object, and the quantity and size from each child object. Dumping this example to a text file, I got:
ProductSKU;ProductName;Size;Quantity
SK3579;Authority;S; 80
SK3579;Authority;M; 14
SK3579;Authority;L; 55