image classification own data - deep-learning

I have my own dataset folder which has 40 images, I want to perform image classification on my data. I don't understand how to upload datasets images into jupyter and how to perform image classification.

Let's say name of your data directory is train
create a new folder inside train directory let's name it dog
Datadirectory = 'train/'
Classes = ['dog']
for category in Classes:
path = os.path.join(Datadirectory,category)
for img in os.listdir(path):
img_array = cv2.imread(os.path.join(path,img))
plt.imshow(cv2.cvtColor(img_array,cv2.COLOR_BGR2RGB))
plt.show()
break
break
This will work

Related

Palantir foundry code workbook, export individual xmls from dataset

I have a dataset which have an xml column and i am trying to export individual xmls as files with filename being in another column using codeworkbook
I filtered the rows i want using below code
def prepare_input(xml_with_debug):
from pyspark.sql import functions as F
filter_column = "key"
filter_value = "test_key"
df_filtered = xml_with_debug.filter(filter_value == F.col(filter_column))
approx_number_of_rows = 1
sample_percent = float(approx_number_of_rows) / df_filtered.count()
df_sampled = df_filtered.sample(False, sample_percent, seed=0)
important_columns = ["key", "xml"]
return df_sampled.select([F.col(c).cast(F.StringType()).alias(c) for c in important_columns])
It works till here. Now for the last part i tried this in a python task, but was complaining about the parameters (i should have set it up wrongly). But even if it works it will be as a single file i think.
from transforms.api import transform, Input, Output
#transform(
output=Output("/path/to/python_csv"),
my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
output.write_dataframe(my_input.dataframe().coalesce(1), output_format="csv", options={"header": "true"})
I am trying to set it up in GUI like below
My question i guess is, what will be the code in the last Python task (write_file) after the prepare input so that i extract individual xmls (And if possible zip them into single file for download)
You can access the output dataset filesystem and write files into it in whatever format you want.
The documentation for that can be found here: https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/#writing-files
(If you want to do it from a code repository it's very similar https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files)
By doing that you can create multiple different files or you can create a single zip file and write it into a dataset.

I am training Yolov5. I have labels.txt files that contain 60 labels, but I want to train the model only on 3 classes, how can I do that?

I am training YOLOv5 on xView dataset, and it contains of 60 classes. The label.txt files contains 60 labels. But I want to train the model on only 3 classes to be faster. anyone knows how can I do that. Should I change the name of classes on data.yaml ?
Delete all the classes you don't want to use from the txt files which belongs to images in datasets. (You can write a shell script to do it) Modify the label.txt and data.yaml with your new (3 classes) situation. It should be work.

Embedding Altair htmls in google sites, how to `mark_image` using private google drive links?

I am trying to embed an interactive plot made using Altair into google-site. In this plot, I want to interactively display an image at a time that is stored on google-drive. When I gave this an attempt, mark_image failed silently, presumably because it did not read the image. This is not a surprise because google-drive images were private. With publically shared images I won't have this issue. For the purpose of this plot, I would like to keep the images private. Plus, there are a lot of images in total (~1K), so I probably should not encode them in data/bytes. I suspect that would probably make my HTML file very big and slow. Please correct me if I am wrong on this.
I wonder if mark_image could read the images from the google-drive links, probably using a "reader" of some sort (an upstream JS or python library), and then feed the read image to mark_image. If anybody has experience with this, solutions/suggestions/workarounds would be greatly helpful.
Here's a demo code to test this:
Case 1: Publically accessible image (no problem). Displayed using mark_image, saved in HTML format.
import altair as alt
import pandas as pd
path="https://vega.github.io/vega-datasets/data/gimp.png"
source = pd.DataFrame([{"x": 0, "y": 0, "img": path},])
chart=alt.Chart(source).mark_image(width=100,height=100,).encode(x='x',y='y',url='img')
chart.save('test.html')
Then I embed the HTML in a google-site (private, not shared to the public), using this option and then paste the content of the HTML file in the Embed code tab.
Case 2: Image on Google-drive (problem!). The case of an image stored on google-drive (private, not shared to the public).
# Please use the code above with `path` variable generated like this:
file_id='' # google drive file id
path=f"https://drive.google.com/uc?export=view&id={file_id}"
In this case, apparently mark_image fails silently and the image is not shown on the plot. ​
After searching for an optimal solution, I decided to rely on a sort of a workaround of encoding the images in data/bytes. This eliminates the issue of reading the URLs from google drive, which I could not find a solution for.
Encoding images in data/bytes, as I suspected, made the HTML big in size, however, surprisingly (to me) not slow to load at all. I guess that's the best thing I could do for what I wanted to do.
In the example below, get_data function obtains the data/bytes of an image. I put that into a column of the dataframe that is taken by Altair as input.
def plot_(images_from):
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(0)
n_objects = 20
n_times = 50
# Create one (x, y) pair of metadata per object
locations = pd.DataFrame({
'id': range(n_objects),
'x': np.random.randn(n_objects),
'y': np.random.randn(n_objects)
})
def get_data(p):
import base64
with open(p, "rb") as f:
return "data:image/jpeg;base64,"+base64.b64encode(f.read()).decode()
import urllib.request
if images_from=='url':
l1=[f"https://vega.github.io/vega-datasets/data/{k}.png" for k in ['ffox','7zip','gimp']]
elif images_from=='data':
l1=[get_data(urllib.request.urlretrieve(f"https://vega.github.io/vega-datasets/data/{k}.png",f'/tmp/{k}.png')[0]) for k in ['ffox','7zip','gimp']]
np.random.seed(0)
locations['img']=np.random.choice(l1, size=len(locations))
# Create a 50-element time-series for each object
timeseries = pd.DataFrame(np.random.randn(n_times, n_objects).cumsum(0),
columns=locations['id'],
index=pd.RangeIndex(0, n_times, name='time'))
# Melt the wide-form timeseries into a long-form view
timeseries = timeseries.reset_index().melt('time')
# Merge the (x, y) metadata into the long-form view
timeseries['id'] = timeseries['id'].astype(int) # make merge not complain
data = pd.merge(timeseries, locations, on='id')
# Data is prepared, now make a chart
selector = alt.selection_single(empty='none', fields=['id'])
base = alt.Chart(data).properties(
width=250,
height=250
).add_selection(selector)
points = base.mark_point(filled=True, size=200).encode(
x='mean(x)',
y='mean(y)',
color=alt.condition(selector, 'id:O', alt.value('lightgray'), legend=None),
)
timeseries = base.mark_line().encode(
x='time',
y=alt.Y('value', scale=alt.Scale(domain=(-15, 15))),
color=alt.Color('id:O', legend=None)
).transform_filter(
selector
)
images=base.mark_image(filled=True, size=200).encode(
x='x',
y='y',
url='img',
).transform_filter(
selector
)
chart=points | timeseries | images
chart.save(f'test/chart_images_{images_from}.html')
# generate htmls
plot_(images_from='url') # generate the HTML using URLs
plot_(images_from='data') # generate the HTML using data/bytes
The HTML made using the data was ~78 times bigger than the one made using URLs (~12Mb vs ~0.16Kb), but not noticeably slower.
Update: As I later found out google site does not allow embedding an HTML file of more than 1Mb size. So in the end, encoding the images did not really help.

How can I process large files in Code Repositories?

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

How to create a custom PyTorch dataset when the order and the total number of training samples is not known in advance?

I have a 42 GB jsonl file. Every element of this file is a json object. I create training samples from every json object. But the number of training samples from every json object that I extract can vary between 0 to 5 samples. What is the best way to create a custom PyTorch dataset without reading the entire jsonl file in memory?
This is the dataset I am talking about - Google Natural Questions.
You have a couple of options.
The simplest option, if having lots of small files is not a problem, is to preprocess each json object into a single file. Then you can just read each one depending on the index requested. E.g
class SingleFileDataset(Dataset):
def __init__(self, list_of_file_paths):
self.list_of_file_paths = list_of_file_paths
def __getitem__(self, index):
return np.load(self.list_of_file_paths[index]) # Or equivalent reading code for single file
You can also split the data into a constant number of files, and then calculate, given the index, which file the sample resides in. Then you need to open that file into memory and read the appropriate index. This gives a trade-off between disk access and memory usage. Assume you have n samples, and we split the samples into c files evenly during preprocessing. Now, to read the sample at index i we would do
class SplitIntoFilesDataset(Dataset):
def __init__(self, list_of_file_paths, n_splits):
self.list_of_file_paths = list_of_file_paths
self.n_splits = n_splits
def __getitem__(self, index):
# index // n_splits is the relevant file, and
# index % len(self) is the index in in that file
file_to_load = self.list_of_file_paths[index // self.n_splits]
# Load file
file = np.load(file)
datapoint = file[index % len(self)]
Finally, you could use a HDF5 file that allows access to rows on disk. This is possibly the best solution if you have a lot of data, since the data will be close on disk. There's an implementation here which I have copy pasted below:
import h5py
import torch
import torch.utils.data as data
class H5Dataset(data.Dataset):
def __init__(self, file_path):
super(H5Dataset, self).__init__()
h5_file = h5py.File(file_path)
self.data = h5_file.get('data')
self.target = h5_file.get('label')
def __getitem__(self, index):
return (torch.from_numpy(self.data[index,:,:,:]).float(),
torch.from_numpy(self.target[index,:,:,:]).float())
def __len__(self):
return self.data.shape[0]