This is an extension to In Python, how to concisely get nested values in json data?
I have data loaded from JSON and am trying to replace arbitrary nested values using a list as input, where the list corresponds to the names of successive children. I want a function replace_value(data,lookup,value) that replaces the value in the data by treating each entry in lookup as a nested child.
Here is the structure of what I'm trying to do:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def replace_value(data,lookup,value):
DEFINITION
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
# The following should return [2,3]
print(json_data['alldata']['TimeSeries']['rates'])
I was able to make a start with get_value(), but am stumped about how to do replacement. I'm not fixed to this code structure, but want to be able to programatically replace a value in the data given the list of successive children and the value to replace.
Note: it is possible that lookup can be of length 1
Follow the lookups until we're second from the end, then assign the value to the last lookup in the current object
def get_value(data,lookup): # Or whatever definition you like
res = data
for item in lookup:
res = res[item]
return res
def replace_value(data, lookup, value):
obj = get_value(data, lookup[:-1])
obj[lookup[-1]] = value
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates']) # [2, 3]
If you're worried about the list copy lookup[:-1], you can replace it with an iterator slice:
from itertools import islice
def replace_value(data, lookup, value):
it = iter(lookup)
slice = islice(it, len(lookup)-1)
obj = get_value(data, slice)
final = next(it)
obj[final] = value
You can obtain the parent to the final sub-dict first, so that you can reference it to alter the value of that sub-dict under the final key:
def replace_value(data, lookup, replacement):
*parents, key = lookup
for parent in parents:
data = data[parent]
data[key] = replacement
so that:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates'])
outputs:
[2, 3]
Once you have get_value
get_value(json_data, lookup[:-1])[lookup[-1]] = value
Related
I have datasets with identical schemas stored in folders that denote an id for the dataset, e.g.:
\11111\dataset
\11112\dataset
Where the '11111' etc. indicates the dataset id. I am trying to write a transform in code repository to loop through the datasets and append them all together. The following code works for this:
def create_outputs(dataset_ids):
transforms = []
for id in dataset_ids:
#transform_df(
Output(output_path + "/appended_dataset"),
input_path=Input(input_path + id + "/dataset"),
)
def compute(input_path):
return input_path
transforms.append(compute)
return transforms
id_list = ['11111','11112']
TRANSFORMS = create_outputs(id_list)
However, rather than having the id's hardcoded in the id_list, I would like to have a separate dataset that holds the dataset id's that need to be appended. I am having difficulty getting something that works.
I have tried the following code, where the id_list_dataset holds the ids to be included in the append:
# input dataset
id_list_dataset = ["ri.foundry.main.dataset.abcdefg"]
schema = T.StructType([
T.StructField('ID', T.StringType())
])
sc = SparkContext.getOrCreate()
rdd = sc.parallelize(id_list_dataset)
sqlContext = SQLContext(sc)
# define dataframe
temp_df = sqlContext.createDataFrame(rdd, schema)
# get list of ID's
id_list = temp_df.select('ID').collect
TRANSFORMS = create_outputs(id_list)
However, this is giving the following error:
TypeError: 'method' object is not iterable
Goal
I've got some complex json data with nested data in it which I am retrieving from an API I'm working with. In order to pull out the specific values I care about, I've created a function that will pull out all the values for a specific key that I can define. This is working well to retrieve the values in a list, however I am running into an issue where I need to return multiple values and associate them with one another so that I can get each result into a row in a csv file. Currently the code just returns separate arrays for each key. How would I go about associating them together? I've messed with the zip function in Python but can't seem to get it working properly. I sincerely appreciate any input you can give me.
Extract Function
def json_extract(obj, key):
"""Recursively fetch values from nested JSON."""
arr = []
def extract(obj, arr, key):
"""Recursively search for values of key in JSON tree."""
if isinstance(obj, dict):
for k, v in obj.items():
if isinstance(v, (dict, list)):
extract(v, arr, key)
elif k == key:
arr.append(v)
elif isinstance(obj, list):
for item in obj:
extract(item, arr, key)
return arr
values = extract(obj, arr, key)
return values
Main.py
res = requests.get(prod_url, headers=prod_headers, params=payload)
record_id = json_extract(res.json(), 'record_id')
status = json_extract(res.json(), 'status')
The solution was simple....just use the zip function ex: zip(record_id, status)
I had a syntax error that was preventing it from working before.
I have a JSON object, data, I need to modify.
Right now, I am modifying the object as follows:
data['Foo']['Bar'] = 'ExampleString'
Is it possible to use a string variable to do the indexing?
s = 'Foo/Bar'
data[s.split('/')] = 'ExampleString'
The above code does not work.
How can I achieve the behavior I am after?
NB : I am looking for a solution which supports arbitrary number of key "levels", for instance the string variable may be Foo/Bar/Baz, or Foo/Bar/Baz/Foo/Bar, which would correspond to data['Foo']['Bar']['Baz'] and data['Foo']['Bar']['Baz']['Foo']['Bar'].
Without changing completely the data class you want to use, this might be easisest:
def jsonSetPath(jobj, path, item):
prev = None
y = jobj
for x in path.split('/'):
prev = y
y = y[x]
prev[x] = item
A wrapper Python to descend iteratively into the object. Then you can use
jsonSetPath(data, 'foo/obj', 3)
normally. You can add this functionality to your dictionary by inheriting dict if you prefer:
class JsonDict(dict):
def __getitem__(self, path):
# We only accept strings in this dictionary
y = self
for x in path.split('/'):
y = dict.get(y, x)
return y
def __setitem__(self, path, item):
# We only accept strings in this dictionary
y = self
prev = None
for x in path.split('/'):
prev = y
y = dict.get(y, x)
prev[x] = item
note using UserDict from collections may be advised, but seems to much of a hassle without converting all the inner dictionaries to user dictionaries. Now you wrap your data (data = JsonDict(data)) and use it as you wanted. If you want to use non-strings as your keys, you need to handle that (though I am not sure that makes sense in this specific dictionary implementation).
Note only the "outer" dictionary is your custom dictionary. If the use case is more advanced you would need to convert all the inner ones as well, and then you might as well use UserDictionary.
A very naive solution to get you going in the correct direction.
You need to add error handling, for example what happens if somewhere a long the path a key is missing? You can either bail out or add a new dict on the fly.
def update(path, d, value):
for nested_key in path.split('/'):
temp = d[nested_key]
if isinstance(temp, dict):
d = d[nested_key]
d[nested_key] = value
one_level_path = 'Foo/Bar'
one_level_dict = {'Foo': {'Bar': None}}
print(one_level_dict)
update(one_level_path, one_level_dict, 1)
print(one_level_dict)
two_level_path = 'Foo/Bar/Baz'
two_level_dict = {'Foo': {'Bar': {'Baz': None}}}
print(two_level_dict)
update(two_level_path, two_level_dict, 1)
print(two_level_dict)
Outputs
{'Foo': {'Bar': None}}
{'Foo': {'Bar': 1}}
{'Foo': {'Bar': {'Baz': None}}}
{'Foo': {'Bar': {'Baz': 1}}}
Using recursion:
x = {'foo': {'in':{'inner':9}}}
path = "foo/in/inner";
def setVal(obj,pathList,val):
if len(pathList) == 1:
obj[pathList[0]] = val
else:
return setVal(obj[pathList[0]],pathList[1:],val)
print(x)
setVal(x,path.split('/'),10)
print(x)
I have data loaded from JSON and am trying to extract arbitrary nested values using a list as input, where the list corresponds to the names of successive children. I want a function get_value(data,lookup) that returns the value from data by treating each entry in lookup as a nested child.
In the example below, when lookup=['alldata','TimeSeries','rates'], the return value should be [1.3241,1.3233].
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def get_value(data,lookup):
res = data
for item in lookup:
res = res[item]
return res
lookup = ['alldata','TimeSeries','rates']
get_value(json_data,lookup)
My example works, but there are two problems:
It's inefficient - In my for loop, I copy the whole TimeSeries object to res, only to then replace it with the rates list. As #Andrej Kesely explained, res is a reference at each iteration, so data isn't being copied.
It's not concise - I was hoping to be able to find a concise (eg one or two line) way of extracting the data using something like list comprehension syntax
If you want one-liner and you are using Python 3.8, you can use assignment expression ("walrus operator"):
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def get_value(data,lookup):
return [data:=data[item] for item in lookup][-1]
lookup = ['alldata','TimeSeries','rates']
print( get_value(json_data,lookup) )
Prints:
[1.3241, 1.3233]
I don't think you can do it without a loop, but you could use a reducer here to increase readability.
functools.reduce(dict.get, lookup, json_data)
I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)