How to handle large JSON file in Pytorch? - deep-learning

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?

I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)

Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

Related

Is there a good way to access batch id in pytorch datasets using dataloaders?

In pytorch datasets, the way to access individual samples is given by implementing the __getitem__ method, but there seems to me that there are no natural way to get the batch id for the extracted sample. One may argue that batch ids should be handled outside the dataset (e.g. in training loops or similar), but I want to modify the processing of the sample when it is retrieved based on the batch id.
I have a hacked solution given below, but I am wondering if there are better ways of doing this.
The "solution" below doesn't work when using num_workers > 1, so it is non-functional.
from typing import List, Tuple, Iterator
from torch.utils.data import RandomSampler, Dataset, DataLoader, BatchSampler
class intwithbtx(int):
def __new__(cls, theint: int, btx: int):
x = int.__new__(cls, theint)
x.btx_number = btx
return x
class IdBatchSampler(BatchSampler):
def __iter__(self) -> Iterator[List[int]]:
batch = []
iii = 0
for idx in self.sampler:
batch.append(intwithbtx(idx, iii))
if len(batch) == self.batch_size:
yield batch
iii += 1
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
class RangeDataset(Dataset):
def __init__(self, lgt: int):
self.data = list(range(lgt))
def __getitem__(self, item: intwithbtx):
dt = self.data[item] + item.btx_number*1000
return dt
def __len__(self):
return len(self.data)
if __name__ == '__main__':
ds = RangeDataset(30)
smp = IdBatchSampler(RandomSampler(range(len(ds))), batch_size=3, drop_last=False)
loader = DataLoader(ds, batch_sampler=smp)
for btx in loader:
print(btx)
loader2 = DataLoader(ds, batch_sampler=smp, num_workers=2) # Fails.
for btx in loader2:
print(btx)

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

Pytorch: Overfitting on a small batch: Debugging

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.

How to load all 5 batches of CIFAR10 in a single data structure as on MNIST in PyTorch?

With Mnist I have a single file with the labels and a single file for the train, so I simply do:
self.data = datasets.MNIST(root='./data', train=True, download=True)
Basically I create a set of labels (from 0-9) and save the i-th position of the image in the data structure, to create my custom tasks:
def make_tasks (self):
        self.task_to_examples = {} #task 0-9
        self.all_tasks = set (self.data.train_labels.numpy ())
        for i, digit in enumerate (self.data.train_labels.numpy ()):
            if str(digit) not in self.task_to_examples:
                self.task_to_examples[str(digit)] = []
            self.task_to_examples[str(digits)].append(i)
I don't understand how to do the same thing using CIFAR10 because it is divided into 5 batches, I would like all the data in a single structure.
If your desired structure is {"class_id": [indices of the samples]}, then for CIFAR10 you can do something like this:
import numpy as np
import torchvision
# set root accordingly
cifar = torchvision.datasets.CIFAR10(root=".", train=True, download=True)
task_to_examples = {
str(task_id): np.where(cifar.targets == task_id)[0].tolist()
for task_id in np.unique(cifar.targets)
}

Simple LSTM in PyTorch with Sequential module

In PyTorch, we can define architectures in multiple ways. Here, I'd like to create a simple LSTM network using the Sequential module.
In Lua's torch I would usually go with:
model = nn.Sequential()
model:add(nn.SplitTable(1,2))
model:add(nn.Sequencer(nn.LSTM(inputSize, hiddenSize)))
model:add(nn.SelectTable(-1)) -- last step of output sequence
model:add(nn.Linear(hiddenSize, classes_n))
However, in PyTorch, I don't find the equivalent of SelectTable to get the last output.
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
# what to put here to retrieve last output of LSTM ?,
nn.Linear(hiddenSize, classe_n))
Define a class to extract the last cell output:
# LSTM() returns tuple of (tensor, (recurrent state))
class extract_tensor(nn.Module):
def forward(self,x):
# Output shape (batch, features, hidden)
tensor, _ = x
# Reshape shape (batch, hidden)
return tensor[:, -1, :]
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
extract_tensor(),
nn.Linear(hiddenSize, classe_n)
)
According to the LSTM cell documentation the outputs parameter has a shape of (seq_len, batch, hidden_size * num_directions) so you can easily take the last element of the sequence in this way:
rnn = nn.LSTM(10, 20, 2)
input = Variable(torch.randn(5, 3, 10))
h0 = Variable(torch.randn(2, 3, 20))
c0 = Variable(torch.randn(2, 3, 20))
output, hn = rnn(input, (h0, c0))
print(output[-1]) # last element
Tensor manipulation and Neural networks design in PyTorch is incredibly easier than in Torch so you rarely have to use containers. In fact, as stated in the tutorial PyTorch for former Torch users PyTorch is built around Autograd so you don't need anymore to worry about containers. However, if you want to use your old Lua Torch code you can have a look to the Legacy package.
As far as I'm concerned there's nothing like a SplitTable or a SelectTable in PyTorch. That said, you are allowed to concatenate an arbitrary number of modules or blocks within a single architecture, and you can use this property to retrieve the output of a certain layer. Let's make it more clear with a simple example.
Suppose I want to build a simple two-layer MLP and retrieve the output of each layer. I can build a custom class inheriting from nn.Module:
class MyMLP(nn.Module):
def __init__(self, in_channels, out_channels_1, out_channels_2):
# first of all, calling base class constructor
super().__init__()
# now I can build my modular network
self.block1 = nn.Linear(in_channels, out_channels_1)
self.block2 = nn.Linear(out_channels_1, out_channels_2)
# you MUST implement a forward(input) method whenever inheriting from nn.Module
def forward(x):
# first_out will now be your output of the first block
first_out = self.block1(x)
x = self.block2(first_out)
# by returning both x and first_out, you can now access the first layer's output
return x, first_out
In your main file you can now declare the custom architecture and use it:
from myFile import MyMLP
import numpy as np
in_ch = out_ch_1 = out_ch_2 = 64
# some fake input instance
x = np.random.rand(in_ch)
my_mlp = MyMLP(in_ch, out_ch_1, out_ch_2)
# get your outputs
final_out, first_layer_out = my_mlp(x)
Moreover, you could concatenate two MyMLP in a more complex model definition and retrieve the output of each one in a similar way.
I hope this is enough to clarify, but in case you have more questions, please feel free to ask, since I may have omitted something.