How to use the 7z DLL to compress and append many small chunks of data to a file - open-source

I would like to use the 7z DLL to append small amounts of data to one compressed file. At the moment my best guess would be uncompress the 7z file, append the data, and recompress it. Obviously, this is not a good solution performance wise, if the size of the 7z file becomes large (say 1 gb) and I want do save a new chunk every second. How could I do this in a better way?
I could use any compression format supported by the 7z DLL.

Have a look at the Python LZMA bindings (LZMA is the compression algorithm name of 7z), you should do what you want without ctypes stuff.
EDIT
To be confirmed, but a quick look at py7zlib.py shows only support for reading 7z files, not writing. However in the src dir there's a pylzma_compressfile.c, so maybe there's something to do.
EDIT 2
The pylzma.compressfile function seems to be there, so fine.

This is NOT MY ANSWER.
How can I use a DLL file from Python?
I think ctypes is the way to go.
The following example of ctypes is from actual code I've written (in Python 2.5). This has been, by far, the easiest way I've found for doing what you ask.
import ctypes
# Load DLL into memory.
hllDll = ctypes.WinDLL ("c:\\PComm\\ehlapi32.dll")
# Set up prototype and parameters for the desired function call.
# HLLAPI
hllApiProto = ctypes.WINFUNCTYPE (ctypes.c_int,ctypes.c_void_p,
ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p)
hllApiParams = (1, "p1", 0), (1, "p2", 0), (1, "p3",0), (1, "p4",0),
# Actually map the call ("HLLAPI(...)") to a Python name.
hllApi = hllApiProto (("HLLAPI", hllDll), hllApiParams)
# This is how you can actually call the DLL function.
# Set up the variables and call the Python name with them.
p1 = ctypes.c_int (1)
p2 = ctypes.c_char_p (sessionVar)
p3 = ctypes.c_int (1)
p4 = ctypes.c_int (0)
hllApi (ctypes.byref (p1), p2, ctypes.byref (p3), ctypes.byref (p4))
The ctypes stuff has all the C-type data types (int, char, short, void*, ...) and can pass by value or reference. It can also return specific data types although my example doesn't do that (the HLL API returns values by modifying a variable passed by reference).

Related

How to edit a binary at specific address locations using Nim?

I'm trying to understand how to modify a Windows executable using Nim to write NOPs in specific locations of the file. In Python I do this as follows:
with open('file.exe', 'r+b') as f:
f.seek(0x0014bc0f)
f.write(binascii.unhexlify('9090'))
f.seek(0x0014bc18)
f.write(binascii.unhexlify('9090'))
But in Nim I can't find the necessary methods to accomplish the seeking to a specific address and how to input these op-codes properly. I'm pretty new to Nim and the different file operation possibilities are a little confusing.
the docs for io are quite good, but to start you off:
opening a file for reading and writing (equivalent of 'r+') is:
let f = open("file.exe",fmReadWriteExisting)
the equivalent of with here, to make sure to close the file on any error:
defer: f.close
seeking to an offset (by default bytes relative to the start):
f.setFilePos(0x014bc0f)
writing two bytes at that position (nop nop)
f.write("\x90\x90")
check out the documentation for each proc for more options

How can I process large files in Code Repositories?

I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.
PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.
there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

How to load image from csv file in tensorflow

I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.

Converting data stored in Fortran 90 binaries to human readable format

In your experience, in Fortran 90, what is the best way to store large arrays in output files? Previously, I had been trying to write large arrays to ASCII text files. For example, I would do something like this (thanks to the recommendation at the bottom of the page In Fortran 90, what is a good way to write an array to a text file, row-wise?):
PROGRAM testing1
IMPLICIT NONE
INTEGER :: i, j, k
INTEGER, DIMENSION(4,10) :: a
k=1
DO i=1,4
DO j=1,10
a(i,j)=k
k=k+1
END DO
END DO
OPEN(UNIT=12, FILE="output.txt", ACTION="WRITE", STATUS="REPLACE")
DO i=1,4
DO j=1,10
WRITE(12, "(i2,x)", ADVANCE="NO") a(i,j)
END DO
WRITE(12, *)
END DO
CLOSE(UNIT=12)
END PROGRAM testing1
This works, but as pointed out by the topmost reply at In Fortran 90, what is a good way to write an array to a text file, row-wise?, writing large arrays to text files is very slow and creates files that are somewhat larger in size than is necessary. The poster there recommended instead writing to an unformatted Fortran binary, using something like:
PROGRAM testing2
IMPLICIT NONE
INTEGER :: i, j, k
INTEGER, DIMENSION(4,10) :: a
k=1
DO i=1,4
DO j=1,10
a(i,j)=k
k=k+1
END DO
END DO
OPEN(UNIT=13, FILE="output.dat", ACTION="WRITE", STATUS="REPLACE", &
FORM="UNFORMATTED")
WRITE(13) a
CLOSE(UNIT=13)
END PROGRAM testing2
This seems to work, and is indeed much faster and results in smaller file sizes, as promised by the reply here. However, what do I do if I would like to be able to later work with the data stored in Fortran binary (e.g., output.dat above) and analyze its contents? For example, what if I want to open the array stored in the binary in a program such as Microsoft Excel?
When I mentioned matlab in my previous post, the reply suggested that I open the binary as a hexadecimal file and figure out and extract the records from there. But, I am nervous that I am getting into deep water since I have no prior experience in hexadecimal sleuthing. When I asked on the matlab board (here: http://www.mathworks.com/matlabcentral/answers/12639-advice-on-reading-an-unformatted-fortran-binary-file-into-matlab) about reading Fortran files into matlab, the person there suggested that using Fortran stream might be easy. But is Fortran stream (i.e., using the directive ACCESS="STREAM" in the OPEN command) likely to be similar in time and file size to the ASCII text file that I created in my first example above?
Or, do you know if there is any other software that can automatically read Fortran binaries into some sort of human readable form? (Or, do you know of any good tutorials on either hexadecimal sleuthing or Fortran stream?)
Thank you very much for your time.
Stream is a choice independent of the choice of formatted / unformatted -- one is "access", the other "format" The default for Fortran I/O is record oriented access. The typical approach of a Fortran compiler for records (at least unformatted) to write a 4-byte record length before and after each record. (The "after" is to make reading backwards easier.) Using a hex edit you could verify these extra data items that I described and skip them in MatLab. But they are not part of the language standard and are not portable and are certainly not obvious in other languages. If you select stream and unformatted you will just get the raw sequence of bytes corresponding to your data items -- no extra data items to worry about in the other language! In my experience this output tends to be fairly easy to read in other languages (not tried in MatLab). If this is a small & simple project with portability of the files to other computers not an issue, I would probably use this approach (stream & unformatted) rather than a file format specification such as HDF5 or FITS. I'd write the array as write (13) a, as in your final example. Depending on the other language, you might have to transpose the dimensions. If this is a major and long-lived project with portability a concern, then a portable and standard file interface is worth considering.
I don't know whether any of these formats can be read from Excel. More research.... You might have to write a program to read the binary file of whatever format and output a file in a format that Excel understands.
(converting comment into an answer for posterity)
Are you specifically trying to get information into Matlab? If you are, I highly recommend HDF5. This is the portable binary format you have been looking for.
For converting a Fortran binary to HDF5, you're going to have to read in the original Fortran binary and then write out the same data to an HDF5 file. If you have the Fortran source, this should be pretty easy. Allocate your arrays, make sure you read the arrays in the same order as you wrote them and then write out your new shiny HDF5 file.
The HDF5 group has tutorials with examples in C and Fortran. There is likely an example very close to what you're trying to do. When you build HDF5, make sure to manually enable Fortran support. It is disabled by default.
%In MATLAB
fid=fopen('YOUR_FILE.direct','r'); %Fortran Direct ACCESS
frewind(fid);
tbb=ones(367,45203);
for i =1:367
temp=fread(fid,[45203],'single');
tbb(i,:)=temp;
end
fclose(fid)