I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")
I am extracting prosody features from an audio file while using Opensmile using Windows version of Opensmile. It runs successful and an output csv is generated. But when I open csv, it shows some rows that are not readable. I used this command to extract prosody feature:
SMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav -O prosody_sample1.csv
And the output of csv looks like this:
[
Even I tried to use the sample wave file given in Example audio folder given in opensmile directory and the output is same (not readable). Can someone help me in identifying where the problem is actually? and how can I fix it?
You need to enable the csvSink component in the configuration file to make it work. The file config\prosody\prosodyShs.conf that you are using does not have this component defined and always writes binary output.
You can verify that it is the standart binary output in this way: omit the -O parameter from your command so it becomesSMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav and execute it. You will get a output.htk file which is exactly the same as the prosody_sample1.csv.
How output csv? You can take a look at the example configuration in opensmile-3.0-win-x64\config\demo\demo1_energy.conf where a csvSink component is defined.
You can find more information in the official documentation:
Get started page of the openSMILE documentation
The section on configuration files
Documentation for cCsvSink
This is how I solved the issue. First I added the csvSink component to the list of the component instances. instance[csvSink].type = cCsvSink
Next I added the configuration parameters for this instance.
[csvSink:cCsvSink]
reader.dmLevel = energy
filename = \cm[outputfile(O){output.csv}:file name of the output CSV
file]
delimChar = ;
append = 0
timestamp = 1
number = 1
printHeader = 1
\{../shared/standard_data_output_lldonly.conf.inc}`
Now if you run this file it will throw you errors because reader.dmLevel = energy is dependent on waveframes. So the final changes would be:
[energy:cEnergy]
reader.dmLevel = waveframes
writer.dmLevel = energy
[int:cIntensity]
reader.dmLevel = waveframes
[framer:cFramer]
reader.dmLevel=wave
writer.dmLevel=waveframes
Further reference on how to configure opensmile configuration files can be found here
I am running a knime chunk loop to write always the same procedure in different csv files:
The Part with the python script until the csv write is working, when I do it without loop, but somehow he is not writing in the customized folder path, if I have the loop inside.
The target is to write a new csv-file for every loop (the output is a list).
The nodes are:
Chunk Loop: Rows per chunk: 51
Create file name:
Options Selected directioy: C:/....
Flow Variables: FileName: currentIteration
CSV Writer: Flow Variables: filename: CurrentIteration
How can I change the folder path of the file? He is always saving it in the default folder
Here is an example workflow (apologies for the weird blurry Windows 10 screenshots):
Create File Name config:
CSV Writer config:
You may need to run each node individually in order to create the flow variable before you can select it in the following node.
I am trying to use a formatted string to identify the file location when using 'print -dpdf file_name' to write a plot (or figure) to a file.
I've tried:
k=1;
file_name = sprintf("\'/home/user/directory to use/file%3.3i.pdf\'",k);
print -dpdf file_name;
but that only gets me a figure written to ~/file_name.pdf which is not what I want. I've tried several other approaches but I cannot find an approach that causes the the third term (file_name, in this example) to be evaluated. I have not found any other printing function that will allow me to perform a formatted write (the '-dpdf' option) of a plot (or figure) to a file.
I need the single quotes because the path name to the location where I want to write the file contains spaces. (I'm working on a Linux box running Fedora 24 updated daily.)
If I compute the file name using the line above, then cut and paste it into the print statement, everything works exactly as I wish it to. I've tried using
k=1;
file_name = sprintf("\'/home/user/directory to use/file%3.3i.pdf\'",k);
print ("-dpdf", '/home/user/directory to use/file001.pdf');
But simply switching to a different form of print statement doesn't solve the problem,although now I get an error message:
GPL Ghostscript 9.16: **** Could not open the file '/home/user/directory to use/file001.pdf' .
**** Unable to open the initial device, quitting.
warning: broken pipe
if you use foo a b this is the same as foo ("a", "b"). In your case you called print ("-dpdf", "file_name")
k = 1;
file_name = sprintf ("/home/user/directory to use/file%3.3i.pdf", k);
print ("-dpdf", file_name);
Observe:
>> k=1;
>> file_name = sprintf ('/home/tasos/Desktop/a folder with spaces in it/this is file number %3.3i.pdf', k)
file_name = /home/tasos/Desktop/a folder with spaces in it/this is file number 001.pdf
>> plot (1 : 10);
>> print (gcf, file_name, '-dpdf')
Tadaaa!
So yeah, no single quotes needed. The reason single quotes work when you're "typing it by hand" is because you're literally creating the string on the spot with the single quotes.
Having said that, it's generally a good idea when generating absolute paths to use the fullfile command instead. Have a look at it.
Tasos Papastylianou #TasosPapastylianou provided great help. My problem is now solved.
I have to process multiple CSV and TXT files in one awk script. My cmd file on windows looks like: gawk -f script.awk *.csv *.txt > output.file
I'd like to use this cmd file as I don't want to always type into the command prompt whenever I want to run the script. I would like to perform different tasks with the different file types. I have tried some stuff inside the script file like if (match(FILENAME, ".csv")) && (FNR > 1) but none of them were working. I have about 4-5 CSV files and a lot of (like 1000+) TXT files, these are all input files. The content of the CSV files are all in the same schema, one column between quotes. Example:
"Player"
"adigabor"
I want to ignore the first line of all the input CSV files when processing them and add each record w/o the quotes into an array and after that I'd like to process the TXT files which I can do just fine, my problem is that I couldn't perform the different tasks with the different input file extensions in one script.
It would be extremely useful if you told us in what way "none of them were working" so we're not just guessing but here goes anyway:
The main problem with match(FILENAME, ".csv") is it'll match csv preceded by any char anywhere in the file name. To get files that end in literally .csv you want:
match(FILENAME,/\.csv$/)
but you don't need to call a function for that:
FILENAME ~ /\.csv$/
So your script would look like:
FILENAME ~ /\.csv$/ {
if ( FNR > 1 ) {
do CSV stuff
}
next
}
{
do TXT stuff
}
If you still can't do whatever you're trying to do then edit your question to include sample input files (at least one of each small .csv and .txt files) and expected output along with a better explanation of what you are trying to do.