How to extract images, labels from csv file and create a trainset using torch? - deep-learning

I downloaded a dataset for facial key point detection the image and the labels were in a CSV file I extracted it using pandas but I don't know how to convert it into a tensor and load it into a data loader for training.
dataframe = pd.read_csv("training_facial_keypoints.csv")
dataframe['Image'] = dataframe['Image'].apply(lambda i: np.fromstring(i, sep=' '))
dataframe= dataframe.dropna()
images_array = np.vstack(dataframe['Image'].values)/255.0
images_array = images_array.astype(np.float32)
images_array = images_array.reshape(-1, 96, 96, 1)
print(images_array.shape)
labels_array = dataframe[dataframe.columns[:-1]].values
labels_array = (labels_array-48)/48
labels_array = labels_array.astype(np.float32)
I have the images and labels in two arrays. How do I create a training set from this and use transforms.
Then load it using a dataloader.

Create a subclass of torch.utils.data.Dataset, fill it with your data.
You can pass desired torchvision.transforms to it and apply them to your data in __getitem__(self, index).
Than you can pass it to torch.utils.data.DataLoader which allows multi-threaded loading of data.
And PyTorch has an overwhelming documentation you should first refer to.

Related

Best approach for geospatial indexes in Palantir Foundry

What the recommended approach is for building a pipeline that needs to find a point contained in a polygon (shape) in Planatir Foundry? In the past, this has been pretty difficult in Spark. GeoSpark has been pretty popular, but can still lag. If there is nothing specific to Foundry I can implement something with Geospark. I have ~13k shapes and batches of thousands of points.
How large are the datasets? With a big enough driver and some optimizations, I previously got it working using geopandas. Just make sure that the coordinate point is the same projection as the polygon.
Here is a helper function:
from shapely import geometry
import json
import geopandas
from pyspark.sql import functions as F
def geopandas_spatial_join(df_left, df_right, geometry_left, geometry_right, how='inner', op='intersects'):
'''
Computes a spatial join of two Geopandas dataframes. Implemetns the Geopandas "sjoin" method, reference: https://geopandas.org/reference/geopandas.sjoin.html.
Expects both dataframes to contain a GeoJSON geometry column, whose names are passed as the 'geometry_left' and 'geometry_right' arguments/
Inputs:
df_left (PANDAS_DATAFRAME): Left input dataframe.
df_right (PANDAS_DATAFRAME): Right input dataframe.
geometry_left (string): Name of the geometry column of the left dataframe.
geometry_right (string): Name of the geometry column of the right dataframe.
how (string): The type of join, one of {'left', 'right', 'inner'}.
op (string): Binary predicate, one of {‘intersects’, ‘contains’, ‘within’}.
Outputs:
(PANDAS_DATAFRAME): Joined dataframe.
'''
df1 = df_left
df1["geometry_left_shape"] = df1[geometry_left].apply(json.loads)
df1["geometry_left_shape"] = df1["geometry_left_shape"].apply(geometry.shape)
gdf_left = geopandas.GeoDataFrame(df1, geometry="geometry_left_shape")
df2 = df_right
df2["geometry_right_shape"] = df2[geometry_right].apply(json.loads)
df2["geometry_right_shape"] = df2["geometry_right_shape"].apply(geometry.shape)
gdf_right = geopandas.GeoDataFrame(df2, geometry="geometry_right_shape")
joined = geopandas.sjoin(gdf_left, gdf_right, how=how, op=op)
joined = joined.drop(joined.filter(items=["geometry_left_shape", "geometry_right_shape"]).columns, axis=1)
return joined
We can then run the join:
import pandas as pd
left_df = points_df.toPandas()
left_geo_column = "point_geometry"
right_df = polygon_df.toPandas()
right_geo_column = "polygon_geometry"
pdf = geopandas_spatial_join(left_df,right_df,left_geo_column,right_geo_column)
return_df = spark.createDataFrame(pdf).dropDuplicates()
return return_df

Python: Reading and Writing HUGE Json files

I am new to python. So please excuse me if I am not asking the questions in pythonic way.
My requirements are as follows:
I need to write python code to implement this requirement.
Will be reading 60 json files as input. Each file is approximately 150 GB.
Sample structure for all 60 json files is as shown below. Please note each file will have only ONE json object. And the huge size of each file is because of the number and size of the "array_element" array contained in that one huge json object.
{
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"array_element":[]
}
Transformation logic is simple. I need to merge all the array_element from all 60 files and write it into one HUGE json file. That is almost 150GB X 60 will be the size of the output json file.
Questions for which I am requesting your help on:
For reading: Planning on using "ijson" module's ijson.items(file_object, "array_element"). Could you please tell me if ijson.items will "Yield" (that is NOT load the entire file into memory) one item at a time from "array_element" array in the json file? I dont think json.load is an option here because we cannot hold such a huge dictionalry in-memory.
For writing: I am planning to read each item using ijson.item, and do json.dumps to "encode" and then write it to the file using file_object.write and NOT using json.dump since I cannot have such a huge dictionary in memory to use json.dump. Could you please let me know if f.flush() applied in the code shown below is needed? To my understanding, the internal buffer will automatically get flushed by itself when it is full and the size of the internal buffer is constant and wont dynamically grow to an extent that it will overload the memory? please let me know
Are there any better approach to the ones mentioned above for incrementally reading and writing huge json files?
Code snippet showing above described reading and writing logic:
for input_file in input_files:
with open("input_file.json", "r") as f:
objects = ijson.items(f, "array_element")
for item in objects:
str = json.dumps(item, indent=2)
with open("output.json", "a") as f:
f.write(str)
f.write(",\n")
f.flush()
with open("output.json", "a") as f:
f.seek(0,2)
f.truncate(f.tell() - 1)
f.write("]\n}")
Hope I have asked my questions clearly. Thanks in advance!!
The following program assumes that the input files have a format that is predictable enough to skip JSON parsing for the sake of performance.
My assumptions, inferred from your description, are:
All files have the same encoding.
All files have a single position somewhere at the start where "array_element":[ can be found, after which the "interesting portion" of the file begins
All files have a single position somewhere at the end where ]} marks the end of the "interesting portion"
All "interesting portions" can be joined with commas and still be valid JSON
When all of these points are true, concatenating a predefined header fragment, the respective file ranges, and a footer fragment would produce one large, valid JSON file.
import re
import mmap
head_pattern = re.compile(br'"array_element"\s*:\s*\[\s*', re.S)
tail_pattern = re.compile(br'\s*\]\s*\}\s*$', re.S)
input_files = ['sample1.json', 'sample2.json']
with open('result.json', "wb") as result:
head_bytes = 500
tail_bytes = 50
chunk_bytes = 16 * 1024
result.write(b'{"JSON": "fragment", "array_element": [\n')
for input_file in input_files:
print(input_file)
with open(input_file, "r+b") as f:
mm = mmap.mmap(f.fileno(), 0)
start = head_pattern.search(mm[:head_bytes])
end = tail_pattern.search(mm[-tail_bytes:])
if not (start and end):
print('unexpected file format')
break
start_pos = start.span()[1]
end_pos = mm.size() - end.span()[1] + end.span()[0]
if input_files.index(input_file) > 0:
result.write(b',\n')
pos = start_pos
mm.seek(pos)
while True:
if pos + chunk_bytes >= end_pos:
result.write(mm.read(end_pos - pos))
break
else:
result.write(mm.read(chunk_bytes))
pos += chunk_bytes
result.write(b']\n}')
If the file format is 100% predictable, you can throw out the regular expressions and use mm[:head_bytes].index(b'...') etc for the start/end position arithmetic.

Get value of object in json file if var is equal to object name - python

I have a function that sees what card is in a player's hand and will add to their score depending on the card in their hand. I have all the card values stored in a JSON file. I have code this so far:
with open("values.json") as values:
value = json.load(values)
for i in range(0, len(hand)):
card = hand[i]
values.json
{
"3Hearts": 3
}
if the card is 3Hearts how could I get the 3 to be returned?
Or is there a better way to store the data?
I will admit I am not very familiar with json files. However if the json file is not a necessity you could just store the data in another .py file (Cards.py for example).
Also, because you are using python, you would be better off making a Card class and make Card objects.
This is what it would look like:
# Make Card Class
class Card:
def __init__(self, name, number):
self.name = name
self.number = number
# Make Card Objects
threehearts = Card("3Hearts", "3")
Here I used threehearts instead of 3Hearts because making an object name starting with a number is not good practice. To compensate I made an attribute Card.name where you can "name" the card "3Hearts" as you did in the question.
So assuming you are going to use that .py file to store your data, this is what I would propose:
# Import data here
from Cards import*
# Make the player's hand
hand = [threehearts]
# Display the number corresponding to the player's hand
for i in range(0, len(hand)):
card = hand[i]
print(card.number)
The output of this code will be:
3
You can also store hand = [threehearts] in the Cards.py file as well if you need to.

Is it bad practice to have more than 1 geometry column in a GeoDataFrame?

I'm trying to create a GeoDataFrame with 2 zip codes per row, whose distances from each other I want to compare.
I took a list of approx 220 zip codes and ran an itertools combination on them to get all combo's, then unpacked the tuples into two columns
code_combo = list(itertools.combinations(df_with_all_zip_codes['code'], 2))
df_distance_ctr = pd.DataFrame(code_combo, columns=['first_code','second_code'])
Then I did some standard pandas merges and column renaming to get the polygon/geometry column from the original geodataframe into this new one, right beside the respective zip code columns.
The problem is I can't seem to get the polygon columns to be read as geometry, even after 1.) attempting to convert the dataframe to a geodataframe - AttributeError: No geometry data set yet, 2.) applying wkt.loads to the geometry column - AttributeError: 'MultiPolygon' object has no attribute 'encode'
.
I've tried to look for a way to convert a series to a geoseries but can't find anything on SO nor the documentation. Can anyone please point out where I'm likely going wrong?
Looking at the __init__ method of a GeoDataFrame at https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py, it looks like a GDF can only have one column at a time. The other columns you've created should still have geometry objects in them though.
Since you still have geometry objects in each column, you could write a method that uses Shapely's distance method, like so:
import pandas as pd
import geopandas
from shapely.geometry import Point
import matplotlib.pyplot as plt
lats = [-34.58, -15.78, -33.45, 4.60, 10.48]
lons = [-58.66, -47.91, -70.66, -74.08, -66.86]
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': lats,
'Longitude': lons})
df['Coordinates'] = list(zip(df.Longitude, df.Latitude))
df['Coordinates'] = df['Coordinates'].apply(Point)
df['Coordinates_2'] = list(zip(lons[::-1], lats[::-1]))
df['Coordinates_2'] = df['Coordinates_2'].apply(Point)
gdf = geopandas.GeoDataFrame(df, geometry='Coordinates')
def get_distance(row):
distance = row.Coordinates.distance(row.Coordinates_2)
print(distance)
return distance
gdf['distance'] = gdf.apply(lambda row: get_distance(row), axis=1)
As for the AttributeError: 'MultiPolygon' object has no attribute 'encode'. MultiPolygon is a Shapely geometry class. encode is usually a method on string objects so you can probably remove the call to wkt.loads.

Getting alignment/attention during translation in OpenNMT-py

Does anyone know how to get the alignments weights when translating in Opennmt-py? Usually the only output are the resulting sentences and I have tried to find a debugging flag or similar for the attention weights. So far, I have been unsuccessful.
I'm not sure if this is a new feature, since I did not come across this when looking for alignments a few months back, but onmt seems to have added a flag -report_align to output word alignments along with the translation.
https://opennmt.net/OpenNMT-py/FAQ.html#raw-alignments-from-averaging-transformer-attention-heads
Excerpt from opennnmt.net -
Currently, we support producing word alignment while translating for Transformer based models. Using -report_align when calling translate.py will output the inferred alignments in Pharaoh format. Those alignments are computed from an argmax on the average of the attention heads of the second to last decoder layer.
You can get the attention matrices. Note that it is not the same as alignment which is a term from statistical (not neural) machine translation.
There is a thread on github discussing it. Here is a snippet from the discussion. When you get the translations from the mode, the attentions are in the attn field.
import onmt
import onmt.io
import onmt.translate
import onmt.ModelConstructor
from collections import namedtuple
# Load the model.
Opt = namedtuple('Opt', ['model', 'data_type', 'reuse_copy_attn', "gpu"])
opt = Opt("PATH_TO_SAVED_MODEL", "text", False, 0)
fields, model, model_opt = onmt.ModelConstructor.load_test_model(
opt, {"reuse_copy_attn" : False})
# Test data
data = onmt.io.build_dataset(
fields, "text", "PATH_TO_DATA", None, use_filter_pred=False)
data_iter = onmt.io.OrderedIterator(
dataset=data, device=0,
batch_size=1, train=False, sort=False,
sort_within_batch=True, shuffle=False)
# Translator
translator = onmt.translate.Translator(
model, fields, beam_size=5, n_best=1,
global_scorer=None, cuda=True)
builder = onmt.translate.TranslationBuilder(
data, translator.fields, 1, False, None)
batch = next(data_iter)
batch_data = translator.translate_batch(batch, data)
translations = builder.from_batch(batch_data)
translations[0].attn # <--- here are the attentions