I am currently working on a project that involves long csv files. I have a for loop that separates different values in the time column, then finds the max in each section of time (there are many data points for each point in time). I want to save the data as either a .csv or a .dat, but I can only seem to save either the first or the last value. How can I get octave to save data in a new row on every pass through the loop?
If you are not too keen on writing to file on every loop which is generally slow, you can accumulate data in a variable and write data in one go.
X = [];
for i = 1:100,
X = [X;i]; //instead of i you can use row vectors
end
save("myfile.dat",'X');
And if you are keen on loops then use '-append' option
X = [];
for i = 1 : 10,
save("-append","myfile.dat",'i');
end
Related
I have a flat file with several hundred thousand rows. This file has no header rows. I need to load just the first row into a hold table and then read the last field into a variable. This hold table has just two columns, first one for most of the row, second for the field I need to move into the variable. Optionally, how can I read this one field, from the flat file, into a variable?
I should note that I am currently loading the entire file, then reading just the first row to get the FILE_NBR into a variable. I would like to speed it up a bit by only loading that first row, instead of the entire file.
My source is a fixed position file, so I am putting all fields except for the last 6 bytes into one field and then the last 6 bytes into the FILE_NBR field.
I am looking to only load one record, instead of the entire file, as I only need that field from one record (the number is the same on every record in the file), for comparison to another table.
For the use case you're describing, I would likely use a Data Flow Task that is a Script Component (acting as source) to an OLE/ADO Destination.
Assumptions
A variable named #[User::CurrentFileName] exists, is of type String and is populated with a fully qualified path to the source file.
The Script Component, acting as Source, will have two columns (ROR, FILENBR) defined of the appropriate length (not to exceed 4000 characters) and 6 and the output buffer is left as the default of Output0
Approximate source component code (ensure you set CurrentFileName as a ReadOnly variable in the component)
// A variable for holding our data
string inputRow = "";
// Convert the SSIS space variable into a C# space variable
string sourceFile = Dts.Variables["CurrentFileName"].Value.ToString();
// Read from the source file
// (I was lazy, feel free to improve this)
foreach (string line in System.IO.File.ReadLines(sourceFile))
{
inputRow = line;
// We have the one row we want, let's blow this popsicle stand
break;
}
// TODO split line into RestOfRow and FileNumber
// Guessing here, likely have the logic wrong
// and am off by one is all but guaranteed
int lineLen = line.Length;
// Slice out to the final 6 characters
string ror = line.Substring(0,lineLen-6);
// Python would much more elegant
string fileNumber = line.Substring(lineLen);
// Now that we have the two pieces we need, let's do the SSIS specific thing
// Create a row in our output buffer and assign values
Output0Buffer.AddRow();
Output0Buffer.ROR = ror;
Output0Buffer.FILENBR = fileNumber;
Ref
Is File.ReadLines buffering read lines?
In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()
I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()
I'm using Audio Computer-Assisted Self-Interview (ACASI) data (http://www.novaresearch.com/QDS/) that includes some multiple choice, "select all that apply" values coded as binary (0100010) depending on what the participants chose:
Raw binary data for "What kind of insurance do you have? Please select all that apply."
What is the easiest way to read this data into SAS so that it understands that multiple values were selected per participant?
Note: I looked at the answer here, How to clean and re-code check-all-that-apply responses in R survey data?, but wonder if--since my data is already binary--I can just read it into SAS as is? I'm also not sure what the syntax is for SAS, since it obviously varies from R.
Thanks!
If you want to divide TextDisplay variable to 12 parameters/variables, you can use an array to create a set of variables, and using a loop, moving between variables in this array. Here you have some example:
data result;
TextDisplay = "001001001000";
array tab(*) $1 q1-q12;
do i = 1 to dim(tab);
tab(i) = substr(TextDisplay, i, 1);
end;
drop i;
run;
Is there a way with the super-csv library to find out the row number of the file that will be processed?
In other word, before i start to process my rows with a loop:
while ((obj = csvBeanReader.read(obj.getClass(),
csvModel.getNameMapping(), processors)) != null) {
//Do some logic here...
}
Can i retrieve with some library class the number of row contained into the csv file?
No, in order to find out how many rows are in your CSV file, you'll have to read the whole file with Super CSV (this is really the only way as CSV can span multiple lines). You could always do an initial pass over the file using CsvListReader (it doesn't do any bean mapping, so probably a bit more efficient) just to get the row count...
As an aside (it doesn't help in this situation), you can get the current line/row number from the reader as you are reading using the getLineNumber() and getRowNumber() methods.