Control the name of a CSV output file [duplicate] - palantir-foundry

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.

This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file

You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true

I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")

Related

How to split a large CSV file with no code?

I have a CSV file which has nearly 22M records. I want to split this into multiple CSV files so that I can use it further.
I tried to open it using Excel(tried Transform Data Option as well)/Notepad++/Notepad, but all give me an error.
When I explore the options, I found that we can split the file using some coding methodologies like Java, Python, etc.. I am not much familiar with coding and want to know if there is any option to split the file without using any coding process. Also, since the file has client sensitive data I don't want to download/use any external tools.
Any help would be much appreciated.
I know you're concerned about security of sensitive data, and that makes you want to avoid external tools (even a nominally trusted tool like Google Big Query... unless your data is medical in nature).
I know you don't want a custom solution w/Python, but I don't understand why that is—this is a big problem, and CSVs can be tricky to handle.
Maybe your CSV is a "simple one" where there are no embedded line breaks, and the quoting is minimal. But if it isn't, you're going to want to a tool that's meant for CSV.
And because the file is so big, I don't see how you can do it without code. Even if you could load it into a trusted tool, how would you process the 22M records?
I look forward to seeing what else the community has to offer you.
The best I can think of based on my experience is exactly what you said you don't want.
It's a small-ish Python script that uses its CSV library to correctly read in your large file and write out several smaller files. If you don't trust this, or me, maybe find someone you do trust who can read this and assure you it won't compromise your sensitive data.
#!/usr/bin/env python3
import csv
MAX_ROWS = 22_000
# The name of your input
INPUT_CSV = 'big.csv'
# The "base name" of all new sub-CSVs, a counter will be added after the '-':
# e.g., new-1.csv, new-2.csv, etc...
NEW_BASE = 'new-'
# This function will be called over-and-over to make a new CSV file
def make_new_csv(i, header=None):
# New name
new_name = f'{NEW_BASE}{i}.csv'
# Create a new file from that name
new_f = open(new_name, 'w', newline='')
# Creates a "writer", a dedicated object for writing "rows" of CSV data
writer = csv.writer(new_f)
if header:
writer.writerow(header)
return new_f, writer
# Open your input CSV
with open(INPUT_CSV, newline='') as in_f:
# Like the "writer", dedicated to reading CSV data
reader = csv.reader(in_f)
your_header = next(reader) # see note below about "header"
# Give your new files unique, and sequential names: e.g., new-1.csv, new-2.csv, etc...
new_i = 1
# Make first new file and writer
new_f, writer = make_new_csv(new_i, your_header)
# Loop over all input rows, and count how many
# records have been written for each "new file"
new_rows = 0
for row in reader:
if new_rows == MAX_ROWS:
new_f.close() # This file is full, close it and...
break
new_i += 1
new_f, writer = make_new_csv(new_i, your_header) # get a new file and writer
new_rows = 0 # Reset row counter
writer.writerow(row)
new_rows +=1
# All done reading input rows, close last file
new_f.close()
There's also a fantastic tool I use daily for processing large CSVs, also with sensitive client contact and personally identifying information, GoCSV.
Its split command is exactly what you need:
Split a CSV into multiple files.
Usage:
gocsv split --max-rows N [--filename-base FILENAME] FILE
I'd recommend downloading it for your platform, unzipping it, putting a sample file with non-sensitive information in that folder and trying it out:
gocsv split --max-rows 1000 --filename-base New sample.csv
would end up creating a number of smaller CSVs, New-1.csv, New-2.csv, etc..., each with a header and no more than 1000 rows.

I am trying to parse 350 files on my local disk and store the data into database as json objects

I am parsing 350 txt files having json data using python. I am able to retrieve 62 of those object and store them on mysql database, but after that I am getting an error saying JSONDecodeError: ExtraData
Python:
import os
import ast
import json
import mysql.connector as mariadb
from mysql.connector.constants import ClientFlag
mariadb_connection = mariadb.connect(user='root', password='137800000', database='shaproject',client_flags=[ClientFlag.LOCAL_FILES])
cursor = mariadb_connection.cursor()
sql3 = """INSERT INTO shaproject.alttwo (alttwo_id,responses) VALUES """
os.chdir('F:/Code Blocks/SEM 2/DM/Project/350/For Merge Disqus')
current_list_dir=os.listdir()
print(current_list_dir)
cur_cwd=os.getcwd()
cur_cwd=cur_cwd.replace('\\','/')
twoid=1
for every_file in current_list_dir:
file=open(cur_cwd + "/" + every_file)
utffile=file.read()
data=json.loads(utffile)
for i in range(0,len(data['response'])):
data123 = json.dumps(data['response'][i])
tup=(twoid,data123)
print(sql3+str(tup))
twoid+=1
cursor.execute(sql3+str(tup)+";")
print(tup)
mariadb_connection.commit()
I have searched online and found that multiple dump statements are resulting in this error. But I am unable to resolve it.
You want to use glob.
Rather than os.listdir(), which is too permissive,
use glob to focus on just the *.json files.
Print out the name of the file before asking .loads() to parse it.
Rename any badly formatted files to .txt rather than .json, in order to skip them.
Note that you can pass the open file directly to .load(), if you wish.
Closing open files would be a good thing.
Rather than a direct assignment (with no close()!)
you would be better off with with:
with open(cur_cwd + "/" + every_file) as file:
data = json.load(file)
Talking about current current working directory seems
both repetitive and redundant.
It would suffice to call it cwd.

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Reading all CSV files in current working directory into pandas with correct filenames

I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.
You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]
Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.

Spark: spark-csv takes too long

I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.