How to convert multiple images to csv? - csv

I want to run some images through a neural network, and I want to create a .csv file for the data. How can I create a csv that will represent the images and keep each image separate?

One way to approach is to use numpy to convert image to array, which can then be converted to a CSV file or simply a comma separated list.
The csv data can be manipulated or original image can be retrieved when needed.
Here is a basic code that demonstrates above concept.
import Image
import numpy as np
#Function to convert image to array or list
def loadImage (inFileName, outType ) :
img = Image.open( inFileName )
img.load()
data = np.asarray( img, dtype="int32" )
if outType == "anArray":
return data
if outType == "aList":
return list(data)
#Load image to array
myArray1 = loadImage("bug.png", "anArray")
#Load image to a list
myList1 = loadImage("bug.png", "aList")

You can encode your image in Base64 and still use CSV, since commas are not part of characters in Base64.
See: Best way to separate two base64 strings

If possible, create a storage location just for images. If your images have unique filenames, then all you need to track is the filename. If they do not have a unique filename, you can assign one using a timestamp+randomizer function to name the photo. Once named, it must be stored in the proper location so that all you need is the filename in order to reference the appropriate image.
Due to size constraints, I would not recommend storing the actual images in the csv.
Cheers!

I guess that depends a lot on what algorithm and what implementation you select. It is not even clear that CSV is the correct choice.
For your stated requirements, Netpbm format comes to mind; if you want to have one line per image, just squish all the numbers into one line. Note that the naive neural network will ignore the topology of the image, you'd need a bit advanced setup to include it.

Related

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

Julia: Visualize images saved in csv form

What would be the best way of visualizing images saved in .csv format?
The following doesn't work:
using CSV, ImageView
data = CSV.read("myfile.csv");
imshow(data)
This is the error:
MethodError: no method matching pixelspacing(::DataFrames.DataFrame)
Closest candidates are:
pixelspacing(!Matched::MappedArrays.AbstractMultiMappedArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:63
pixelspacing(!Matched::MappedArrays.AbstractMappedArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:62
pixelspacing(!Matched::OffsetArrays.OffsetArray) at /Users/xxx/.julia/packages/ImageCore/yKxN6/src/traits.jl:67
...
Stacktrace:
[1] imshow(::Any, ::Reactive.Signal{GtkReactive.ZoomRegion{RoundingIntegers.RInt64}}, ::ImageView.SliceData, ::Any; name::Any, aspect::Any) at /Users/xxx/.julia/packages/ImageView/sCn9Q/src/ImageView.jl:269
[2] imshow(::Any; axes::Any, name::Any, aspect::Any) at /Users/xxx.julia/packages/ImageView/sCn9Q/src/ImageView.jl:260
[3] imshow(::Any) at /Users/xxx/.julia/packages/ImageView/sCn9Q/src/ImageView.jl:259
[4] top-level scope at In[5]:2
[5] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
Reference on github.
This question was answered at https://github.com/JuliaImages/ImageView.jl/issues/241. Copying the answer here:
imshow(Matrix(data))
where data is your DataFrame. But CSV is a poor choice for images; Netbpm if you simply must use text-formatted images, otherwise binary would be recommended. Binary Netpbm are especially easy to write, if you have to write your own (e.g., if the images are coming from some language that doesn't support other file formats), otherwise PNG is typically a good choice.
Does the CSV file have a header line of names for its columns or is it just a delimited file full of text number values?
If the CSV file is actually in the form of a matrix of values, such that the values are the bytes of a 2D image, you may use DelimitedFiles -- see readdlm() docs. Read the file with readdlm() into a matrix and see if ImageView can display the results.

use dask to store larger then memory csv file(s) to hdf5 file

Task: read larger than memory csv files, convert to arrays and store in hdf5.
One simple way is to use pandas to read the files in chunks
but I wanted to use dask, so far without success:
Latest attempt:
fname='test.csv'
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None)
dset.to_records().to_hdf5('/tmp/test.h5', '/x')
How could I do this?
Actually, I have a set of csv files representing 2D slices of a 3D array
that I would like to assemble and store. A suggestion on how to do the latter
would be welcome as well.
Given the comments below, here is one of many variations I tried:
dset = dd.read_csv(fname, sep=',', skiprows=0, header=None, dtype='f8')
shape = (num_csv_records(fname), num_csv_cols(fname))
arr = da.Array( dset.dask, 'arr12345', (500*10, shape[1]), 'f8', shape)
da.to_hdf5('/tmp/test.h5', '/x', arr)
which results in the error:
KeyError: ('arr12345', 77, 0)
You will probably want to do something like the following. The real crux of the problem is, that in the read-csv case, dask doesn't know the number of rows of the data before a full load, and therefore the resultant data-frame has an unknown length (as is the usual case for data-frames). Arrays, on the other hand, generally need to know their complete shape for most operations. In your case you have extra information, so you can sidestep the problem.
Here is an example.
Data
0,1,2
2,3,4
Code
dset = dd.read_csv('data', sep=',', skiprows=0, header=None)
arr = dset.astype('float').to_dask_array(True)
arr.to_hdf5('/test.h5', '/x')
Where "True" means "find the lengths", or you can supply your own set of values.
You should use the to_hdf method on dask dataframes instead of on dask arrays
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_hdf('myfile.hdf', '/data')
Alternatively, you might consider using parquet. This will be faster and is simpler in many ways
import dask.dataframe as dd
df = dd.read_csv('myfile.csv')
df.to_parquet('myfile.parquet')
For more information, see the documentation on creating and storing dask dataframes: http://docs.dask.org/en/latest/dataframe-create.html
For arrays
If for some reason you really want to convert to a dask array first then you'll need to figure out how many rows each chunk of your data has and assign that to chunks attribute. See http://docs.dask.org/en/latest/array-chunks.html#unknown-chunks . I don't recommend this approach though, it's needlessly complex.

Python Loop through variable in URLs

What I want to do here is that I want to change an user id within an url for every url and then get outputs from urls.
What I did so far:
import urllib
import requests
import json
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
data=requests.get(url).json()
print (data['user'])
(I type in 'user' inside of the print because it gives all the information about a focal user in json format)
My question is that I want to change user id (which is 12345 in this url example) by giving another number (any random number) and then get the outputs from every url I type in. For example, change to 5211, for example, and get the result. And then change to 959444 and get the result and so on. I think I need to use loop to make this iterate through just by changing the numbers within an url but kept failing to do this due to difficulty splitting the original url and then changing only the user id number inside. Could anyone help me out?
Thank you so much in advance.
=====================The Next Following Question is Stated Below================
Thank you for your previous answer! I tried to build my codes more based on the answer and made it but ran into another issue. I could iterate through and fetch each user's information in a json format. The format gave me a single quote (rather than double quotes) and a weird u' notation in front of every keys in json format but I could solve this issue. Anyway, I cleaned up json format and made it in a perfect neat json format.
My plan is to convert each json into a csv file but want to stack all the json I scrape to one csv file. For example, the first json format on user1 will be converted into a csv file and user1 will be considered row1 and all the keys in json will be column names and all the corresponding values will be the values for the corresponding columns. And the second json format I scrape will convert into the same csv file but in the second row, and so on.
from pandas.io.json import json_normalize
eg_data=[data['user']]
df=pd.DataFrame.from_dict(json_normalize(data['user']))
print (df)
df.to_csv('C:/Users/todd/Downloads/eg.csv')
print (df)
So, I found that json_normalize flattens the nested brackets so it's useful in a real world example. Also, I tried to use pandas dataframe to make it as a table. Here I have 2 questions: 1. How do I stack each json format that I scraped one by one in a row in one csv file? (If there's another way to do this without using pandas frame, that would be also appreciated) 2. As I know, pandas dataframe won't give you an output unless every row has the same number of columns. But in my case since every json format I've scraped has either 10 columns or 20 columns depending on whether a json format has nested brackets or not. In this case, how do I stack all the rows and make it in one csv file?
Comments or questions will be greatly appreciated.
You can split it into two initially and join them together every time you generate a random number
import random
url1="https://api.abc.com/users/"
url2="?api_key=5632lkjgdlg&_format=_show"
for i in range(4):
num=random.randint(1000,10000) #you can change the range here for generating a random number
url=url1+str(num)+url2
print(url)
OUTPUT
https://api.abc.com/users/2079?api_key=5632lkjgdlg&_format=_show
https://api.abc.com/users/2472?api_key=5632lkjgdlg&_format=_show
and so on...
But, if you wanted to split at that exact place without knowing how it looks beforehand, you can use regex as you know for sure that a ? is found after this number.
import re
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
matches=re.split('\d+(?=\?)',url)
['https://api.abc.com/users/', '?api_key=5632lkjgdlg&_format=_show']
Now just set
url1=matches[0]
url2=matches[1]
And use the for loop.

Convert blob to text in a mysql export

I'd have some blob data such as:
0x3333332c2044e963617269652c20356520e9746167650d0a53742d4c617572656e7420285175e9626563292048344e20334d390d0a
that I'd like to convert to text because the new database has text field instead of blobs and now it makes trouble with some accentuated characters.
Is there somekind of blob to string converter somewhere?
Thanks a lot!
Try:
CONVERT(blobname USING latin1)
It depends on what the blob is. For example, I've dealt with some blobs that could be represented as basic XML files. Those would have been relatively easy to convert. However, I dealt with other blobs that were image files. If you tried to represent them as text you'd lose data.
What are in your blobs?
Create your new database with your export, once done create your text column on the table, update that using a CONVERT drop the old column, renaming the old one if required.
However if the data contains simple byte stream (that is, unstructured data, files, audio, video, whatever) and you need to represent them as pure ASCII you could change into a Base64 string.
If using phpmyadmin, tick the box that says "Dump binary columns in hexadecimal notation (for example, "abc" becomes 0x616263)" at the bottom of the export page.