How to use values in a dataset to transform another dataset in code repositories - palantir-foundry

I have datasets with identical schemas stored in folders that denote an id for the dataset, e.g.:
\11111\dataset
\11112\dataset
Where the '11111' etc. indicates the dataset id. I am trying to write a transform in code repository to loop through the datasets and append them all together. The following code works for this:
def create_outputs(dataset_ids):
transforms = []
for id in dataset_ids:
#transform_df(
Output(output_path + "/appended_dataset"),
input_path=Input(input_path + id + "/dataset"),
)
def compute(input_path):
return input_path
transforms.append(compute)
return transforms
id_list = ['11111','11112']
TRANSFORMS = create_outputs(id_list)
However, rather than having the id's hardcoded in the id_list, I would like to have a separate dataset that holds the dataset id's that need to be appended. I am having difficulty getting something that works.
I have tried the following code, where the id_list_dataset holds the ids to be included in the append:
# input dataset
id_list_dataset = ["ri.foundry.main.dataset.abcdefg"]
schema = T.StructType([
T.StructField('ID', T.StringType())
])
sc = SparkContext.getOrCreate()
rdd = sc.parallelize(id_list_dataset)
sqlContext = SQLContext(sc)
# define dataframe
temp_df = sqlContext.createDataFrame(rdd, schema)
# get list of ID's
id_list = temp_df.select('ID').collect
TRANSFORMS = create_outputs(id_list)
However, this is giving the following error:
TypeError: 'method' object is not iterable

Related

editing (xarray) object in a function

I'm trying to assign extra variables to an existing dataset "ds" using a function which takes the dataset as argument and should return the adjusted dataset as well:
def Assign_Variable(dataset):
dataset = dataset.assign(new_var = dataset.x + dataset.y)
#(and some more manipulations)
return dataset
and then use this function to loop through some datasets to manipulate them:
for dataset in [ds]:
dataset = Assign_Variable(dataset)
Yet, if I now check my dataset ds, the function is not doing anything.
How can I adjust my datasets in a function, and return them?
I think what you are looking for can be done by treating the dataset like a dictionary (e.g. dataset[new_var] = ...)
As an example I have some code that calculates windspeed from two wind components u and v and saves the output windspeed (ws) in the existing dataset (ds):
import xarray as xr
import numpy as np
def windspeed(dataset):
dataset['ws'] = np.sqrt(dataset['u']**2 + dataset['v']**2)
return dataset
# Create sample data
lon = np.arange(129.4, 153.75+0.05, 0.25)
lat = np.arange(-43.75, -10.1+0.05, 0.25)
data = 10 * np.random.rand(len(lat), len(lon))
ds = xr.Dataset({"u": (["lat", "lon"], data), "v": (["lat", "lon"], data)}, coords={"lon": lon,"lat": lat})
# Calc wind speed
ds = windspeed(ds)
# Print output
print(ds)
The output dataset now contains the three variables u, v and ws.

SQLAlchemy query db with filter for all tables

I have SQLAlchemy models on top of the MySQL db. I need to query almost all models (string or text fields) and find everything that contains a specific substring. And also, apply common filtering like object_type=type1. For exsmple:
class Model1(Model):
name = Column(String(100), nullable=False, unique=True)
version = Column(String(100))
description = Column(String(100))
updated_at = Column(TIMESTAMP(timezone=True))
# other fields
class Model2(Model):
name = Column(String(100), nullable=False, unique=True)
version = Column(String(100))
description = Column(String(100))
updated_at = Column(TIMESTAMP(timezone=True))
# other fields
class Model3(Model):
name = Column(String(100), nullable=False, unique=True)
version = Column(String(100))
description = Column(String(100))
updated_at = Column(TIMESTAMP(timezone=True))
# other fields
And then do query something like:
db.query(
Model1.any_of_all_columns.contains('sub_string') or
Model2.any_of_all_columns.contains('sub_string') or
Model3.any_of_all_columns.contains('sub_string')
).all()
Is it possible to build such an ORM query in one SQL to the db and dynamically add Model(table) names and columns?
For applying common filtering for all the columns, you can subscribe to sqlachemy events as following:
#event.listens_for(Query, "before_compile", retval=True)
def before_compile(query):
for ent in query.column_descriptions:
entity = ent['entity']
if entity is None:
continue
inspect_entity_for_mapper = inspect(ent['entity'])
mapper = getattr(inspect_entity_for_mapper, 'mapper', None)
if mapper and has_tenant_id:
query = query.enable_assertions(False).filter(
ent['entity’].object == object)
return query
This function will be called whenever you do Model.query() and add filter for your object.
I eventually gave up and did one big loop in which I make a separate request for each model:
from sqlalchemy import or_
def db_search(self, model, q, object_ids=None, status=None, m2m_ids=None):
"""
Build a query to the db for given model using 'q' search substring
and filter it by object ids, its status and m2m related model.
:param model: a model object which columns will be used for search.
:param q: the query substring we are trying to find in all
string/text columns of the model.
:param object_ids: list of ids we want to include in the search.
If the list is empty, the search query will return 0 results.
If object_ids is None, we will ignore this filter.
:param status: name of object status.
:param m2m_ids: list of many-to-many related object ids.
:return: sqlalchemy query result.
"""
# Filter out private columns and not string/text columns
string_text_columns = [
column.name for column in model.__table__.columns if
isinstance(column.type, (db.String, db.Text))
and column.name not in PRIVATE_COLUMN_NAMES
]
# Find only enum ForeignKey columns
foreign_key_columns = [
column.name for column in model.__table__.columns if
column.name.endswith("_id") and column.name in ENUM_OBJECTS
)
]
query_result = model.query
# Search in all string/text columns for the required query
# as % LIKE %
if q:
query_result = query_result.join(
# Join related enum tables for being able to search in
*[enum_tables_to_model_map[col]["model_name"] for col in
foreign_key_columns]
).filter(
or_(
# Search 'q' substring in all string/text columns
*[
getattr(model, col_name).like(f"%{q}%")
for col_name in string_text_columns
],
# Search 'q' substring in the enum tables
*[
enum_tables_to_model_map[col]["model_field"]
.like(f"%{q}%") for col in foreign_key_columns
]
)
)
# Apply filter by object ids if given and it's not None.
# If the object ids filter exist but it's empty, we should
# return an empty result
if object_ids is not None:
query_result = query_result.filter(model.id.in_(object_ids))
# Apply filter by status if given and if the model has the status
# column
if status and 'status_id' in model.__table__.columns:
query_result = query_result.filter(model.status_id == status.id)
if m2m_ids:
query_result = query_result.filter(
model.labels.any(Label.id.in_(m2m_ids)))
return query_result.all()
And call it:
result = {}
for model in db.Model._decl_class_registry.values():
# Search only in the public tables
# sqlalchemy.ext.declarative.clsregistry._ModuleMarker object
# located in the _decl_class_registry that is why we check
# instance type and whether it is subclass of the db.Model
if isinstance(model, type) and issubclass(model, db.Model) \
and model.__name__ in PUBLIC_MODEL_NAMES:
query_result = self.db_search(
model, q, object_ids.get(model.__name__), status=status,
m2m_ids=m2m_ids)
result[model.__tablename__] = query_result
This is far from the best solution, but it works for me.

Why must use DataParallel when testing?

Train on the GPU, num_gpus is set to 1:
device_ids = list(range(num_gpus))
model = NestedUNet(opt.num_channel, 2).to(device)
model = nn.DataParallel(model, device_ids=device_ids)
Test on the CPU:
model = NestedUNet_Purn2(opt.num_channel, 2).to(dev)
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
model_old = torch.load(path, map_location=dev)
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
This will get the correct result, but when I delete:
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
the result is wrong.
nn.DataParallel wraps the model, where the actual model is assigned to the module attribute. That also means that the keys in the state dict have a module. prefix.
Let's look at a very simplified version with just one convolution to see the difference:
class NestedUNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
model = NestedUNet()
model.state_dict().keys() # => odict_keys(['conv1.weight', 'conv1.bias'])
# Wrap the model in DataParallel
model_dp = nn.DataParallel(model, device_ids=range(num_gpus))
model_dp.state_dict().keys() # => odict_keys(['module.conv1.weight', 'module.conv1.bias'])
The state dict you saved with nn.DataParallel does not line up with the regular model's state. You are merging the current state dict with the loaded state dict, that means that the loaded state is ignored, because the model does not have any attributes that belong to the keys and instead you are left with the randomly initialised model.
To avoid making that mistake, you shouldn't merge the state dicts, but rather directly apply it to the model, in which case there will be an error if the keys don't match.
RuntimeError: Error(s) in loading state_dict for NestedUNet:
Missing key(s) in state_dict: "conv1.weight", "conv1.bias".
Unexpected key(s) in state_dict: "module.conv1.weight", "module.conv1.bias".
To make the state dict that you have saved compatible, you can strip off the module. prefix:
pretrained_dict = {key.replace("module.", ""): value for key, value in pretrained_dict.items()}
model.load_state_dict(pretrained_dict)
You can also avoid this issue in the future by unwrapping the model from nn.DataParallel before saving its state, i.e. saving model.module.state_dict(). So you can always load the model first with its state and then later decide to put it into nn.DataParallel if you wanted to use multiple GPUs.
You trained your model using DataParallel and saved it. So, the model weights were stored with a module. prefix. Now, when you load without DataParallel, you basically are not loading any model weights (the model has random weights). As a result, the model predictions are wrong.
I am giving an example.
model = nn.Linear(2, 4)
model = torch.nn.DataParallel(model, device_ids=device_ids)
model.state_dict().keys() # => odict_keys(['module.weight', 'module.bias'])
On the other hand,
another_model = nn.Linear(2, 4)
another_model.state_dict().keys() # => odict_keys(['weight', 'bias'])
See the difference in the OrderedDict keys.
So, in your code, the following three-line works but no model weights are loaded.
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
Here, model_dict has keys without the module. prefix but pretrained_dict has when you do not use DataParalle. So, essentially pretrained_dict is empty when DataParallel is not used.
Solution: If you want to avoid using DataParallel, or you can load the weights file, create a new OrderedDict without the module prefix, and load it back.
Something like the following would work for your case without using DataParallel.
# original saved file with DataParallel
model_old = torch.load(path, map_location=dev)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in model_old.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)

In Python, how to concisely replace nested values in json data?

This is an extension to In Python, how to concisely get nested values in json data?
I have data loaded from JSON and am trying to replace arbitrary nested values using a list as input, where the list corresponds to the names of successive children. I want a function replace_value(data,lookup,value) that replaces the value in the data by treating each entry in lookup as a nested child.
Here is the structure of what I'm trying to do:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
def replace_value(data,lookup,value):
DEFINITION
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
# The following should return [2,3]
print(json_data['alldata']['TimeSeries']['rates'])
I was able to make a start with get_value(), but am stumped about how to do replacement. I'm not fixed to this code structure, but want to be able to programatically replace a value in the data given the list of successive children and the value to replace.
Note: it is possible that lookup can be of length 1
Follow the lookups until we're second from the end, then assign the value to the last lookup in the current object
def get_value(data,lookup): # Or whatever definition you like
res = data
for item in lookup:
res = res[item]
return res
def replace_value(data, lookup, value):
obj = get_value(data, lookup[:-1])
obj[lookup[-1]] = value
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates']) # [2, 3]
If you're worried about the list copy lookup[:-1], you can replace it with an iterator slice:
from itertools import islice
def replace_value(data, lookup, value):
it = iter(lookup)
slice = islice(it, len(lookup)-1)
obj = get_value(data, slice)
final = next(it)
obj[final] = value
You can obtain the parent to the final sub-dict first, so that you can reference it to alter the value of that sub-dict under the final key:
def replace_value(data, lookup, replacement):
*parents, key = lookup
for parent in parents:
data = data[parent]
data[key] = replacement
so that:
json_data = {'alldata':{'name':'CAD/USD','TimeSeries':{'dates':['2018-01-01','2018-01-02'],'rates':[1.3241,1.3233]}}}
lookup = ['alldata','TimeSeries','rates']
replace_value(json_data,lookup,[2,3])
print(json_data['alldata']['TimeSeries']['rates'])
outputs:
[2, 3]
Once you have get_value
get_value(json_data, lookup[:-1])[lookup[-1]] = value

DynamoDB JSON response parsing prints vertically

I have a script that scans a DynamoDB table that stores my instance IDs. Then I try to query another table to see if it also has that same instance and get all of the metadata attributes in a master table. When I iterate through the query using the instance ID from the initial scan of the first table, I am noticing each character of the instance id string is being printed to a new line, instead of the entire string on one line. I am confused how to fix this. Below is my code, sample output, and the expected output.
CODE:
import boto3
import json
from boto3.dynamodb.conditions import Key, Attr
def table_diff():
dynamo = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
table_missing = dynamodb.Table('RunningInstances')
missing_response = dynamo.scan(TableName='CWPMissingAgent')
for instances in missing_response['Items']:
instance_id = instances['missing_instances']['S']
# This works how I want, prints i-xxxxx
print(instance_id)
for id in instance_id:
# This does not print how I want (vertically)
print(id)
query_response = table_missing.query(KeyConditionExpression=Key('ID').eq(id))
OUTPUT:
i
-
x
x
x
x
x
EXPECTED OUTPUT:
i-xxxxx
etc etc
instance_id is a string. Thus, when you loop over it (for id in instance_id), you are actually looping over each character in the string, and printing them out individually.
Why do you try to loop over it, when you say that just printing it produces the correct result?