Reading binary files in Cython - cython

I am attempting to read a binary file in Cython. Previously this was working in Python, but I am looking to speed up the process. This code below was written as a familiarisation and logic check before writing the complete module. Once this section is complete the code will be expanded to read in multiple 400 Mb files and process.
A function was created that opens the file, reads in a number of data point and returns them to an array.
from libc.stdlib cimport malloc, free
from libc.stdio cimport fopen, fclose, FILE, fscanf, fread
def readin_binary(filename, int number_of_points):
"""
Test reading in a file and returning data
"""
header_bytes = <unsigned char*>malloc(number_of_points)
filename_byte_string = filename.encode("UTF-8")
cdef FILE *in_binary_file
in_binary_file = fopen(filename_byte_string, 'rb')
if in_binary_file is NULL:
print("file not found")
else:
print("Read file {}".format(filename))
fread(&header_bytes, 1, number_of_points, in_binary_file)
fclose(in_binary_file)
return header_bytes
print(hDVS.readin_binary(filename, 10))
The code compiles.
When the code is run the following error occurs:
Python has stopped working error
I've been playing with this for a few days now. I think there is a simple error but I can not see it. Any ideas?

Related

Can you import a file by using a variable

I have a python script that uses json to store data. In the data, there are also file names, so I was wondering if I could import a file using a variable. Example~
file = "apps/messanger"
import file as msg
If this isn't possible, I would have confirmed my hypothesis and just import all of my files separately. But, if it is possible, I would like to know how just because it would make my life easier.
Thanks for any help!
-Jester
I'm not too good with python but when you handle files you normally use
file = open("path to file", 'r or w') # r for read, w for write
file.close() # when you are done with the file you must close it
If you are going to name it msg, then change the variable from file to msg, like
msg = open("apps/messenger", 'r')
msg.close() # when finished with the file

Control the name of a CSV output file [duplicate]

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")

dask.delayed KeyError with distributed scheduler

I have a function interpolate_to_particles written in c and wrapped with ctypes. I want to use dask.delayed to make a series of calls to this function.
The code runs successfully without dask
# Interpolate w/o dask
result = interpolate_to_particles(arg1, arg2, arg3)
and with the distributed schedular in single-threaded mode
# Interpolate w/ dask
from dask.distributed import Client
client = Client()
result = dask.delayed(interpolate_to_particles)(arg1, arg2, arg3)
result_c = result.compute(scheduler='single-threaded')
but if I instead call
result_c = result.compute()
I get the following KeyError:
> Traceback (most recent call last): File
> "/path/to/lib/python3.6/site-packages/distributed/worker.py",
> line 3287, in dumps_function
> result = cache_dumps[func] File "/path/to/lib/python3.6/site-packages/distributed/utils.py",
> line 1518, in __getitem__
> value = super().__getitem__(key) File "/path/to/lib/python3.6/collections/__init__.py",
> line 991, in __getitem__
> raise KeyError(key) KeyError: <function interpolate_to_particles at 0x1228ce510>
The worker logs accessed from the dask dashboard do not provide any information. Actually, I do not see any information that the workers have done anything besides starting up.
Any ideas on what could be occurring, or suggested tools that I can use to further debug? Thanks!
Given your comments it sounds like your function does not serialize well. To test this, you might try pickling the function in one process, and try unpickling it in another.
>>> import pickle
>>> print(pickle.dumps(interpolate_to_particles))
b'some bytes printed out here'
And then in another process
>>> import pickle
>>> interpolate_to_particles = pickle.loads(b'the same bytes you had before')
If this doesn't work then you'll know that that's your problem. I would encourage you to look up "how to make sure that ctypes functions are serializable" or something similar, or ask another question with that smaller scope here on Stack Overflow.

jsonstat.from_file() return error "can't multiply sequence by non-int of type 'list'"

I'm trying to parse a json-stat file using jsonstat.py (v 0.1.7) but am getting an error.
The code below is copied from the examples on github (https://github.com/26fe/jsonstat.py/tree/master/examples-notebooks):
from __future__ import print_function
import os
import jsonstat
os.chdir(r'D:\Desktop\JSON_Stat')
url = 'http://www.cso.ie/StatbankServices/StatbankServices.svc/jsonservice/responseinstance/NQQ25'
file_name = "test02.json"
file_path = os.path.abspath(os.path.join("..","JSON_Stat", "CSO", file_name))
I added this line to deal with non ascii characters in the file:
# -*- coding: utf-8 -*-
this succesfully downloads the json file to my desktop:
if os.path.exists(file_path):
print("using already downloaded file {}".format(file_path))
else:
print("download file and storing on disk")
jsonstat.download(url, file_path)
From here, I can load and pprint the data using the json module:
import json
import pprint as pp
with open(r"CSO\test02.json") as data_file:
data = json.load(data_file)
pp.pprint(data)
... but when I try and use the jsonstat module (as specified in the examples) I get the error mentioned in the subject:
collection = jsonstat.from_file(r"D:\Desktop\JSON_Stat\CSO\test02.json")
collection
# abbreviated error message
--> 384 self.__pos2cat = self.__size * [None]
TypeError: can't multiply sequence by non-int of type 'list'
I understand what the error message itself means but, having studied the the dimensions.py module where it occurs, am stuck trying to understand why. I was able to run the sample OECD code without issue so perhaps the data itself is not formatted in the expected way, though the source site (http://www.cso.ie/webserviceclient/) states that the json-stat format is being used.
So, finally, my questions are: has anyone run into this error and resolved it? Has anyone succesfully used the jsonstat module to parse this specific data? Alternatively, any general advice towards troubleshooting this issue is welcome.
Thanks

Sympy's autowrap with cython and Matrix generates fatal error: 'numpy/arrayobject.h' file not found

I'm trying to execute the simple example from the Sympy's autowrap module that includes matrix/vector product with the Cython langage (since I do not have gfortran installed):
import sympy.utilities.autowrap as aw
from sympy.utilities.autowrap import autowrap
from sympy import symbols, IndexedBase, Idx, Eq
A, x, y = map(IndexedBase, ['A', 'x', 'y'])
m, n = symbols('m n', integer=True)
i = Idx('i', m)
j = Idx('j', n)
instruction = Eq(y[i], A[i, j]*x[j])
matvec = autowrap(instruction, language='C',backend='cython')
I'm on OSX 10.9.4, with the anaconda distribution for python 2.7, sympy 0.7.6.1 and cython 0.23.2.
I get the following (known) error: fatal error: 'numpy/arrayobject.h' file not found
It seems to be a systematic error, and one needs to include the appropriate numpy's header target in the setup file attached to the compilation process of cython as suggested here.
How to get rid form this issue in an autowrap context?
It seems this is a bug fixed here, but it does not work for me... Is this bug fix included in sympy's realease 0.7.6.1?
Any idea?
This was a bug and is now fixed. See this pull request:
https://github.com/sympy/sympy/pull/8848
If you use the development version of SymPy, it should work. Else you could have autowrap spit the files out to a temporary directory, add the correct include statement to the generated files, and manually compile the code.