How can I get my dataset's name in code repository - palantir-foundry

When combining multiple datasets in Python in code repository, I want to put the dataset name in the first column. But I couldn't figure it out by accessing its path
#transform_df(
Output("/folder/folder1/datasets/mydatset"),
df1 = Input("A"),
df2 = Input("B"),
)
def compute(df1, df2, df3):
print(list(filter(os.path.isfile, os.listdir())))
How can I get my dataset name from within a transform?

This is not possible using the #transform_df decorator. However it is possible using the more powerful #transform decorator.
API Documentation for #transform
Using #transform will cause your function arguments to become of type TransformInput rather than dataframes directly, which have a property path. Note that you will also need to reference and write to the output dataset manually when using #transform.
For example:
#transform(
out=Output("/path/to/my/output"),
inp1=Input("/path/to/my/input1"),
inp2=Input("/path/to/my/input2"),
)
def compute(out, inp1, inp2):
# Add columns containing dataset paths.
df1 = inp1.dataframe().withColumn("dataset_path", F.lit(inp1.path))
df2 = inp2.dataframe().withColumn("dataset_path", F.lit(inp2.path))
# For example.
result = union_many(df1, df2, how="strict")
# Write output manually
out.write_dataframe(result)
However note that a dataset's path is an unstable identifier. If someone were to move or rename these inputs, it could cause unintended behaviour in your pipeline.
For this reason, for a production pipeline I would generally recommend using a more stable identifier. Either a manually chosen hard-coded one (in this case you can use #transform_df again):
#transform_df(
df1=Input("/path/to/my/input1"),
df2=Input("/path/to/my/input2"),
)
def compute(df1, df2):
df1 = df1.withColumn("input_dataset", F.lit("input_1"))
df2 = df2.withColumn("input_dataset", F.lit("input_2"))
# ...etc
or the dataset's RID, using inp1.rid instead of inp1.path.
Note that if you have a large number of inputs, all of these methods can be made neater using python's varargs syntax and comprehensions:
# Using path or rid
#transform(
out=Output("/path/to/my/output"),
inp1=Input("/path/to/my/input1"),
inp2=Input("/path/to/my/input2"),
# and many more...
)
def compute(out, **inps):
# Add columns containing dataset rids (or paths).
dfs = [
inp.dataframe().withColumn("dataset_rid", F.lit(inp.rid))
for key, inp in inps.items()
]
# For example
result = union_many(*dfs, how="strict")
out.write_dataframe(result)
# Using manual keys, we can reuse the argument names as keys.
#transform_df(
Output("/path/to/my/output"),
df1=Input("/path/to/my/input1"),
df2=Input("/path/to/my/input2"),
# and many more...
)
def compute(**dfs):
# Add columns containing dataset keys.
dfs = [
df.withColumn("dataset_key", F.lit(key))
for key, df in dfs.items()
]
# For example
return union_many(*dfs, how="strict")

Related

Save json list in a text file

I got a json log file that i rearrange to be correct, after this i am trying to save the results to the same file. The results are a list. but the problem that i am unable to save and will give me the following error:
write() argument must be str, not list
Here is the code it self:
import regex as re
import re
f_name = 'test1.txt'
splitter = r'"Event\d+":{(.*?)}' # a search pattern to capture the stuff in braces
#Open the file as Read.
with open(f_name, 'r') as src:
data = src.readlines()
# tokenize the data source...
tokens = re.findall(splitter, str(data))
#print(tokens)
# now we can operate on the tokens and split them up into key-value pairs and put them into a list
result = []
for token in tokens:
# make an empty dictionary to hold the row elements
line_dict = {}
# we can split the line (token) by comma to get the key-value pairs
pairs = token.split(',')
for pair in pairs:
# another regex split needed here, because the timestamps have colons too
splitter = r'"(.*)"\s*:\s*"(.*)"' # capture two groups of things in quotes on opposite sides of colon
parts = re.search(splitter, pair)
key, value = parts.group(1), parts.group(2)
line_dict[key] = value
# add the dictionary of line elements to the result
result.append(line_dict)
with open(f_name, 'w') as src:
for line in result:
src.write(result)
i.e the code it self was not written by me -> Log file management with python (thanks AirSquid)
Thanks for the assistance, New at Python here.
Tried to import json and use json.dump, also tried to append the text, but in most cases i end up with just [] or empty file.

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Pytorch: Overfitting on a small batch: Debugging

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.

Apache NiFi: How to compare multiple rows in a csv and create new column

I have a csv which looks like this.
Jc,TXF,timer,alpha,beta
15,44,55,12,33
18,87,33,111
9,87,61,29,77
Alpha and Beta combined makes up a city code. I want to add the name of the city to the csv as a new column.
Jc,TXF,timer,alpha,beta,city
15,44,55,12,33,York
18,87,33,111,London
9,87,61,29,77,Sydney
I have another csv with only the columns alpha,beta,city. Which looks like this:
alpha,beta,city
12,33,York
33,111,London
29,77,Sydney
How can I achieve this using Apache NiFi. Please suggest the processors and workflow needed to be used to achieve this.
I see two ways of solving this.
First by using CsvLookupService. However the CsvLookupService only supports a single key, but you have two, alpha and beta. So to use this solution you have to concatenate both keys into a single key, like 12_33.
Second by using ExecuteScript processor. This one is better, because you don't have to modify your source data. Strategy:
Split the CSV text into lines
Enrich each line with the city column by looking up the alpha and beta keys in the mapping file
Merge the individual lines into a single CSV file.
Overall flow:
GenerateFlowFile:
SplitText:
Set header line count to 1 to include the header line in the split content. For the ExecuteScript processor set python as scripting engine and provide following script body:
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import csv
# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
# fetch the mapping CSV file
with open('/home/nifi/mapping.csv', 'r') as mapping:
# read the mapping file
mappingContent = csv.reader(mapping, delimiter=',')
# flowfile content is CSV text with two lines, header and actual content
# split by newline to get access to each inidvidual line
lines = IOUtils.toString(inputStream, StandardCharsets.UTF_8).split('\n')
# the result will contain the header line
# the result will have the additional city column
result = lines[0] + ',city\n'
# take the second line and split it
# to get access to alpha, beta and city values
lineSplit = lines[1].split(',')
# Go through the mapping file
# item[0] -> alpha
# item[1] -> beta
# item[2] -> city
# See if you find alpha and beta on the line content
for item in mappingContent:
if item[0] == lineSplit[3] and item[1] == lineSplit[4]:
result += lines[1] + ',' + item[2]
break
if result is None:
raise Exception('No matching found.')
else:
outputStream.write(bytearray(result.encode('utf-8')))
# end class
flowFile = session.get()
if(flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
except Exception as e:
session.transfer(flowFile, REL_FAILURE)
See comments for a detailed description of the script. /home/nifi/mapping.csv has to be available on your NiFi instance. If you want to learn more about the ExecuteScript processor, refer to the ExecuteScript Cookbook. Finally you merge all the lines into a single CSV file:
Set CSV reader and writer. Leave their default properties. Adjust MergeContent properties to control how many lines should be in each resulting CSV file. Result:

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?
I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)
Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example