How to implement modulo operation using PyArrow Expression API so that I can use it in filter? - pyarrow

I want to shard Arrow Dataset. To achieve that, I'd like to use a monotonously increasing field and implement a sharding operation in the following filter, which I can use in pyarrow Scanner: pc.field('id') % num_shards == shard_id
Any ideas on how to do this using PyArrow compute API?

Although there is not yet a modulo function there is a bit_wise_and function which can achieve the same thing:
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
arr = pa.array(range(100))
tab = pa.Table.from_arrays([arr], names=['x'])
my_filter = pc.bit_wise_and(pc.field('x'), 7) == 0
filtered = ds.dataset(tab).to_table(filter=my_filter)
print(filtered)
# pyarrow.Table
# x: int64
# ----
# x: [[0,8,16,24,32,...,64,72,80,88,96]]

Related

how can i solve "empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType)" in MushroomRL?

I'm using MushroomRL for a Deep Reinforcement Learning project, and I'm using a Graph representation as RL Environment where the number of nodes represents the number of actions now in my neural network the input is one value ex: tensor([[5.]], and the output Q is the number of nodes which is ten ex: tensor([[5972.4927, 8562.3330, 7443.6479, 7326.1587, 6615.2090, 6617.3145,6911.8672, 8233.7930, 6821.0093, 7000.1182,]] now I'm using a new framework called MushroomRL, and this is the code
if __name__ == '__main__':
from mushroom_rl.core import Core
from mushroom_rl.algorithms.value import TrueOnlineSARSALambda
from mushroom_rl.policy import EpsGreedy
from mushroom_rl.features import Features
from mushroom_rl.features.tiles import Tiles
from mushroom_rl.utils.dataset import compute_J
from mushroom_rl.utils.parameters import LinearParameter, Parameter
from mushroom_rl.approximators.parametric import TorchApproximator
from mushroom_rl.algorithms.value import DQN
# Set the seed
np.random.seed(1)
# Create the toy environment with default parameters
#mdp = Environment.make('graph_env')
mdp=graph_env()
# Using an epsilon-greedy policy
epsilon = Parameter(value=0.1)
pi = EpsGreedy(epsilon=epsilon)
# Policy
epsilon = LinearParameter(value=1.,
threshold_value=.1,
n=1000000)
epsilon_test = Parameter(value=.05)
epsilon_random = Parameter(value=1)
pi = EpsGreedy(epsilon=epsilon_random)
approximator_params = dict(
network=Network,
input_shape=(1,),
output_shape=(1,),
n_actions=mdp.info.action_space.n,
optimizer=optimizer,
loss=F.mse_loss
)
approximator = TorchApproximator
algorithm_params = dict(
batch_size=32,
target_update_frequency=target_update_frequency // train_frequency,
replay_memory=True,
initial_replay_size=initial_replay_size,
max_replay_size=max_replay_size
)
agent=agent = DQN(mdp.info, pi, approximator,
approximator_params=approximator_params,
**algorithm_params)
# Algorithm
core = Core(agent, mdp)
# RUN
# Fill replay memory with random dataset
print_epoch(0)
core.learn(n_steps=initial_replay_size,n_steps_per_fit=initial_replay_size)
# Evaluate initial policy
pi.set_epsilon(epsilon_test)
#mdp.set_episode_end(False)
dataset = core.evaluate(n_steps=test_samples)
scores.append(get_stats(dataset))
for n_epoch in range(1, max_steps // evaluation_frequency + 1):
print_epoch(n_epoch)
print('- Learning:')
# learning step
pi.set_epsilon(epsilon)
mdp.set_episode_end(True)
core.learn(n_steps=evaluation_frequency,
n_steps_per_fit=train_frequency)
print('- Evaluation:')
# evaluation step
pi.set_epsilon(epsilon_test)
mdp.set_episode_end(False)
dataset = core.evaluate(n_steps=test_samples)
scores.append(get_stats(dataset))
it givs me this error when i run the code
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
i beleve the proplem in part of the code can any one help to fix it ?

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Pytorch: Overfitting on a small batch: Debugging

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?
I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)
Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

FiPy Simple Convection

I am trying to understand how FiPy works by working an example, in particular I would like to solve the following simple convection equation with periodic boundary:
$$\partial_t u + \partial_x u = 0$$
If initial data is given by $u(x, 0) = F(x)$, then the analytical solution is $u(x, t) = F(x - t)$. I do get a solution, but it is not correct.
What am I missing? Is there a better resource for understanding FiPy than the documentation? It is very sparse...
Here is my attempt
from fipy import *
import numpy as np
# Generate mesh
nx = 20
dx = 2*np.pi/nx
mesh = PeriodicGrid1D(nx=nx, dx=dx)
# Generate solution object with initial discontinuity
phi = CellVariable(name="solution variable", mesh=mesh)
phiAnalytical = CellVariable(name="analytical value", mesh=mesh)
phi.setValue(1.)
phi.setValue(0., where=x > 1.)
# Define the pde
D = [[-1.]]
eq = TransientTerm() == ConvectionTerm(coeff=D)
# Set discretization so analytical solution is exactly one cell translation
dt = 0.01*dx
steps = 2*int(dx/dt)
# Set the analytical value at the end of simulation
phiAnalytical.setValue(np.roll(phi.value, 1))
for step in range(steps):
eq.solve(var=phi, dt=dt)
print(phi.allclose(phiAnalytical, atol=1e-1))
As addressed on the FiPy mailing list, FiPy is not great at handling convection only PDEs (absent diffusion, pure hyperbolic) as it's missing higher order convection schemes. It is better to use CLAWPACK for this class of problem.
FiPy does have one second order scheme that might help with this problem, the VanLeerConvectionTerm, see an example.
If the VanLeerConvectionTerm is used in the above problem, it does do a better job of preserving the shock.
import numpy as np
import fipy
# Generate mesh
nx = 20
dx = 2*np.pi/nx
mesh = fipy.PeriodicGrid1D(nx=nx, dx=dx)
# Generate solution object with initial discontinuity
phi = fipy.CellVariable(name="solution variable", mesh=mesh)
phiAnalytical = fipy.CellVariable(name="analytical value", mesh=mesh)
phi.setValue(1.)
phi.setValue(0., where=mesh.x > 1.)
# Define the pde
D = [[-1.]]
eq = fipy.TransientTerm() == fipy.VanLeerConvectionTerm(coeff=D)
# Set discretization so analytical solution is exactly one cell translation
dt = 0.01*dx
steps = 2*int(dx/dt)
# Set the analytical value at the end of simulation
phiAnalytical.setValue(np.roll(phi.value, 1))
viewer = fipy.Viewer(phi)
for step in range(steps):
eq.solve(var=phi, dt=dt)
viewer.plot()
raw_input('stopped')
print(phi.allclose(phiAnalytical, atol=1e-1))