Dynamic replacement of Spark dataframe columns per row - json

I have here 2 JSON strings per record.
The first JSON string is variable, indicating column(s) to be replaced in second JSON string.
E.g. the input.
import spark.implicits._
// Leaving timestamp aspect out and multiple occurrence of changes in same run, for simplicity.
val df = Seq((1, """{"b": "new", "c": "new"}""", """{"k": 1, "b": "old", "c": "old", "d": "old"}""" ),
(2, """{"b": "new", "d": "new"}""", """{"k": 2, "b": "old", "c": "old", "d": "old"}""" )).toDF("id", "chg", "before")
Like so:
+---+------------------------+--------------------------------------------+
|id |chg |before |
+---+------------------------+--------------------------------------------+
|1 |{"b": "new", "c": "new"}|{"k": 1, "b": "old", "c": "old", "d": "old"}|
|2 |{"b": "new", "d": "new"}|{"k": 2, "b": "old", "c": "old", "d": "old"}|
+---+------------------------+--------------------------------------------+
What I want is - but I cannot get it completed - is this output, thus on a per row basis".
+---+--------------------------------------------+
|id |after |
+---+------------------------+-------------------+
|1 |{"k": 1, "b": "new", "c": "new", "d": "old"}|
|2 |{"k": 2, "b": "new", "c": "old", "d": "new"}|
+---+------------------------+-------------------+
We could have an array in the JSON, that's fine as well.

You can go the strongly typed route, which gives you a lot of control over the exact transformations you want to have occur.
I made the following assumptions:
your before column contains all of the fields in each row
the chg column never contains the k field.
import spark.implicits._
import org.apache.spark.sql.Encoders
// Creating case classes with the schema of your json objects. We're making
// these to make use of strongly typed Datasets. Notice that the MyChgClass has
// each field as an Option: this will enable us to choose between "chg" and
// "before"
case class MyChgClass(b: Option[String], c: Option[String], d: Option[String])
case class MyFullClass(k: Int, b: String, c: String, d: String)
case class MyEndClass(id: Int, after: MyFullClass)
// Creating schemas for the from_json function
val chgSchema = Encoders.product[MyChgClass].schema
val beforeSchema = Encoders.product[MyFullClass].schema
// Your dataframe from the example
val df = Seq(
(1, """{"b": "new", "c": "new"}""", """{"k": 1, "b": "old", "c": "old", "d": "old"}""" ),
(2, """{"b": "new", "d": "new"}""", """{"k": 2, "b": "old", "c": "old", "d": "old"}""" )
).toDF("id", "chg", "before")
// Parsing the json string into our case classes and finishing by creating a
// strongly typed dataset with the .as[] method
val parsedDf = df
.withColumn("parsedChg",from_json(col("chg"), chgSchema))
.withColumn("parsedBefore",from_json(col("before"), beforeSchema))
.drop("chg")
.drop("before")
.as[(Int, MyChgClass, MyFullClass)]
// Mapping over our dataset with a lot of control of exactly what we want. Since
// the "chg" fields are options, we can use the getOrElse method to choose
// between either the "chg" field or the "before" field
val output = parsedDf.map{
case (id, chg, before) => {
MyEndClass(id, MyFullClass(
before.k,
chg.b.getOrElse(before.b),
chg.c.getOrElse(before.c),
chg.d.getOrElse(before.d)
))
}
}
output.show(false)
+---+------------------+
|id |after |
+---+------------------+
|1 |[1, new, new, old]|
|2 |[2, new, old, new]|
+---+------------------+
So there you have it, the output dataset contains the values you want. Since you asked for it, we can turn it into a json string again (Opinion incoming: this is less useful within Spark IMO, so unless you absolutely need to output this json somewhere I would not do this) by doing the following:
output.select($"id", to_json($"after")).show(false)
+---+-------------------------------------+
|id |structstojson(after) |
+---+-------------------------------------+
|1 |{"k":1,"b":"new","c":"new","d":"old"}|
|2 |{"k":2,"b":"new","c":"old","d":"new"}|
+---+-------------------------------------
Hope this helps!

Related

Count elements in nested JSON with jq

I am trying to count all elements in a nested JSON-document with jq?
Given the following JSON-document
{"a": true, "b": [1, 2], "c": {"a": {"aa":1, "bb": 2}, "b": "blue"}}
I want to calculate the result 6.
In order to do this, I tried the following:
echo '{"a": true, "b": [1, 2], "c": {"a": {"aa":1, "bb": 2}, "b": "blue"}}' \
| jq 'reduce (.. | if (type == "object" or type == "array")
then length else 0 end) as $counts
(1; . + $counts)'
# Actual output: 10
# Desired output: 6
However, this counts the encountered objects and arrays as well and therefore yields 10 opposing to the desired output: 6
So, how can I only count the document's elements/leaf-nodes?
Thanks already in advance for you help!
Edit: What would be an efficient approach to count empty arrays and objects as well?
You can use the scalars filter to find leaf nodes. Scalars are all "simple" JSON values, i.e. null, true, false, numbers and strings. Alternatively you can compare the type of each item and use length to determine if an object or array has children.
I've expanded your input data a little to distinguish a few more corner cases:
Input:
{
"a": true,
"b": [1, 2],
"c": {
"a": {
"aa": 1,
"bb": 2
},
"b": "blue"
},
"d": [],
"e": [[], []],
"f": {}
}
This has 15 JSON entities:
5 of them are arrays or objects with children.
4 of them are empty arrays or objects.
6 of them are scalars.
Depending on what you're trying to do, you might consider only scalars to be "leaf nodes", or you might consider both scalars and empty arrays and objects to be leaf nodes.
Here's a filter that counts scalars:
[..|scalars]|length
Output:
6
And here's a filter that counts all entities which have no children. It just checks for all the scalar types explicitly (there are only six possible types for a JSON value) and if it's not one of those it must be an array or object, where we can check how many children it has with length.
[
..|
select(
(type|IN("boolean","number","string","null")) or
length==0
)
]|
length
Output:
10

Error while reading JSON file in chunksizes with python

I have a large json file, so I want to read the file in chunks while testing. I have implemented the code below:
if fpath.endswith('.json'):
with open(fpath, 'r') as f:
read_query = pd.read_json(f, lines=True, chunksize=100)
for chunk in read_query:
print(chunk)
I get the error:
File "nameoffile.py", line 168, in read_queries_func
for chunk in read_query:
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 798, in __next__
obj = self._get_object_parser(lines_json)
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 770, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 885, in parse
self._parse_no_numpy()
File "C:\Users\Me\Python38\lib\site-packages\pandas\io\json\_json.py", line 1159, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
Why am I getting an error?
The JSON file looks like this:
[
{
"a": "13",
"b": "55"
},
{
"a": "15",
"b": "16"
},
{
"a": "18",
"b": "45"
},
{
"a": "1650",
"b": "26"
},
.
.
.
{
"a": "214",
"b": "23"
}
]
Also, is there a way to extract just the 'a' attribute's values while reading the file? Or can that only be done after I've read the file?
Your json file contains just one object. As per the line-delimited json doc to which the doc of the chunksize argument points:
pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark.
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can be useful for large files or to read from a stream.
It also implies that lines=True, and the doc for lines says:
Read the file as a json object per line.
This means that files like this work:
{"a": 1, "b": 2}
{"a": 3, "b": 4}
{"a": 5, "b": 6}
{"a": 7, "b": 8}
{"a": 9, "b": 10}
These don’t:
[
{"a": 1, "b": 2},
{"a": 3, "b": 4},
{"a": 5, "b": 6},
{"a": 7, "b": 8},
{"a": 9, "b": 10}
]
So you have to read the file in one go, or modify it as you go to have one object per line.

How can I emit delimited text (like CSV) from Jq?

When using Jq for data processing, it's often more convenient to emit the processed text in some kind of "delimited" form that other CLI tools can consume, such as Awk, Cut, and the read builtin in Bash.
Is there a straightforward way to achieve this?
Sample data:
[
{"a": 11, "b": 12, "c": 13},
{"a": 21, "b": 22, "c": 23},
{"a": 31, "b": 32, "c": 33},
{"a": 41, "b": 42, "c": 43}
]
Desired output:
a,c
11,13
21,21
31,33
41,43
jq --raw-output 'map({ a, c }) | ( .[0] | keys_unsorted), (.[] | [.[]]) | #csv'
Will produce:
"a","c"
11,13
21,23
31,33
41,43
Online JqPlay Demo
If you can assume that the attribute names are the same in all array elements, you can use the #csv formatter along with --raw-output:
Put this in a script like json-records-to-csv.jq, adjusting the shebang as needed:
#!/usr/bin/jq --raw-output -f
# Like `keys`; extracts object values as an array.
def values:
to_entries | map(.value)
;
# Get the column names from the first array element keys
(.[0] | keys | #csv)
,
# Get the values from every array element values
(.[] | values | #csv)
Usage example:
json-records-to-csv.jq <<'JSON'
[
{"a": 11, "b": 12, "c": 13},
{"a": 21, "b": 22, "c": 23},
{"a": 31, "b": 32, "c": 33},
{"a": 41, "b": 42, "c": 43}
]
JSON
Output:
"a","b","c"
11,12,13
21,22,23
31,32,33
41,42,43

JQ: how can I remove keys based on regex?

I would like to remove all keys that start with "hide". Important to note that the keys may be nested at many levels. I'd like to see the answer using a regex, although I recognise that in my example a simple contains would suffice. (I don't know how to do this with contains, either, BTW.)
Input JSON 1:
{
"a": 1,
"b": 2,
"hideA": 3,
"c": {
"d": 4,
"hide4": 5
}
}
Desired output JSON:
{
"a": 1,
"b": 2,
"c": {
"d": 4
}
}
Input JSON 2:
{
"a": 1,
"b": 2,
"hideA": 3,
"c": {
"d": 4,
"hide4": 5
},
"e": null,
"f": "hiya",
"g": false,
"h": [{
"i": 343.232,
"hide9": "private",
"so_smart": true
}]
}
Thanks!
Since you're just checking the start of the keys, you could use startswith/1 instead in this case, otherwise you could use test/1 or test/2. Then you could pass those paths to be removed to delpaths/1.
You might want to filter the key by strings (or convert to strings) beforehand to account for arrays in your tree.
delpaths([paths | select(.[-1] | strings | startswith("hide"))])
delpaths([paths | select(.[-1] | strings | test("^hide"; "i"))])
A straightforward approach to the problem is to use walk in conjunction with with_entries, e.g.
walk(if type == "object"
then with_entries(select(.key | test("^hide") | not))
else . end)
If your jq does not have walk/1 simply include its def (available e.g. from https://raw.githubusercontent.com/stedolan/jq/master/src/builtin.jq) before invoking it.

Merge several json arrays in circe

Let's say we have 2 json arrays. How to merge them into a single array with circe? Example:
Array 1:
[{"id": 1}, {"id": 2}, {"id": 3}]
Array 2:
[{"id": 4}, {"id": 5}, {"id": 6}]
Needed:
[{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}, {"id": 6}]
I've tried deepMerge, but it only keeps the contents of the argument, not of the calling object.
Suppose we've got the following set-up (I'm using circe-literal for convenience, but your Json values could come from anywhere):
import io.circe.Json, io.circe.literal._
val a1: Json = json"""[{"id": 1}, {"id": 2}, {"id": 3}]"""
val a2: Json = json"""[{"id": 4}, {"id": 5}, {"id": 6}]"""
Now we can combine them like this:
for { a1s <- a1.asArray; a2s <- a2.asArray } yield Json.fromValues(a1s ++ a2s)
Or:
import cats.std.option._, cats.syntax.cartesian._
(a1.asArray |#| a2.asArray).map(_ ++ _).map(Json.fromValues)
Both of these approaches are going to give you an Option[Json] that will be None if either a1 or a2 don't represent JSON arrays. It's up to you to decide what you want to happen in that situation .getOrElse(a2) or .getOrElse(a1.deepMerge(a2)) might be reasonable choices, for example.
As a side note, the current contract of deepMerge says the following:
Null, Array, Boolean, String and Number are treated as values, and values from the argument JSON completely replace values from this JSON.
This isn't set in stone, though, and it might not be unreasonable to have deepMerge concatenate JSON arrays—if you want to open an issue we can do some more thinking about it.