Read every nth batch in pyarrow.dataset.Dataset - pyarrow

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.

No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

Related

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.
PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.
there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

How to load image from csv file in tensorflow

I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.

Tf-slim: ValueError: Variable vgg_19/conv1/conv1_1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()

Octave: Saving Files In a loop

I am currently working on a project that involves long csv files. I have a for loop that separates different values in the time column, then finds the max in each section of time (there are many data points for each point in time). I want to save the data as either a .csv or a .dat, but I can only seem to save either the first or the last value. How can I get octave to save data in a new row on every pass through the loop?
If you are not too keen on writing to file on every loop which is generally slow, you can accumulate data in a variable and write data in one go.
X = [];
for i = 1:100,
X = [X;i]; //instead of i you can use row vectors
end
save("myfile.dat",'X');
And if you are keen on loops then use '-append' option
X = [];
for i = 1 : 10,
save("-append","myfile.dat",'i');
end

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries