How to see the compression used to create a parquet file with pyarrow? - pyarrow

If I have a parquet file I can do
pqfile=pq.ParquetFile("pathtofile.parquet")
pqfile.metadata
but exploring around using dir in the pqfile object, I can't find anything that would indicate the compression of the file. How can I get that info?

#0x26res has a good point in the comments that converting the metadata to a dict will be easier than using dir.
Compression is stored at the column level. A parquet file consists of a number of row groups. Each row group has columns. So you would want something like...
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pydict({'x': list(range(100000))})
pq.write_table(table, '/tmp/foo.parquet')
pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).compression
# 'SNAPPY'

Related

converting parquet files in S3 to CSV and store back in S3

Information:
I have parquet files stored in S3 which I need to convert into CSV and store back into S3.
the way I have the parquet files structured in S3 is as so:
2019
2020
|- 01
...
|- 12
|- 01
...
|- 29
|- part-0000.snappy.parquet
|- part-0001.snappy.parquet
...
|- part-1000.snappy.parquet
...
The solution required:
Any AWS tooling (needs to use lambda, no EC2, ECS) (open to suggestions though)
That the CSV files keep their headers during conversion (if they are split up)
That the CSV retain are original information and have no added columns/information
That the converted CSV file remain around 50-100MB
The solution I have already tried:
"entire folder method"
Using Athena CREATE EXTERNAL TABLE -> CREATE TABLE AS on the entire data folder (e.g: s3://2020/06/01/)
fig: #1
CREATE EXTERNAL TABLE IF NOT EXISTS database.table_name (
value_0 bigint,
value_1 string,
value_2 string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '1' )
LOCATION 's3://2020/06/01' TBLPROPERTIES ('has_encrypted_data'='false')
fig: #2
CREATE TABLE database.different_table_name
WITH ( format='TEXTFILE', field_delimiter=',', external_location='s3://2020/06/01-output') AS
SELECT * FROM database.table_name
doing this "entire folder method" works at converting parquet to CSV but leaves the CSV files at around 1GB+ size which is way too large. I tried creating a solution to split up the CSV files (thanks to help from this guide) but it failed since lambda has a 15-minute limit & memory constraints which made it difficult to split about all these 1GB+ CSV files into about 50-100MB files.
"single file method"
using the same CREATE EXTERNAL TABLE (see fig: #1) and
fig: #3
CREATE TABLE database.different_table_name
WITH ( format = 'TEXTFILE', field_delimiter=',', external_location = 's3://2020/06/01-output') AS
SELECT *, "$path" FROM database.table_name
WHERE "$path" LIKE 's3://2020/06/01/part-0000.snappy.parquet';
doing this "single file method" required me to integrate AWS SQS to listen to events from S3 for objects created in the bucket which looked for .snappy.parquet. this solution converted the parquet to CSV and created CSVs which fit the size requirements. the only issue is that the CSVs were missing headers, and had additional fields which never existed in the parquet in the first place such as the entire bucket location.
You can use dask
import dask.dataframe as dd
df = dd.read_parquet(s3://bucket_path/*.parquet’)
#converting dask df to pandas df
df = df.compute()
df.to_csv(’out.csv’)
While there's no way to configure the output file sizes, you can control the number out files in each output partition when using CTAS in Athena. The key is to use the bucket_count and bucketed_by configuration parameters, as described here: How can I set the number or size of files when I run a CTAS query in Athena?. Run a few conversions and record the sizes of the Parquet and CSV files, and use that as a heuristic for how many buckets to configure for each job, each bucket will become one file.
When working with Athena from Lambda you can use Step Functions to avoid the need for the Lambda function to run while Athena is executing. Use the Poll for Job Status tutorial as a starting point. It's especially useful when running CTAS jobs since these tend to take longer to run.

Spark read multiple CSV files, one partition for each file

suppose I have multiple CSV files in the same directory, these files all share the same schema.
/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv
I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?
You have two options I can think of:
1) Use the Input File name
Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:
SQL:
SELECT input_file_name() as fname FROM dataframe
Or Python:
from pyspark.sql.functions import input_file_name
newDf = df.withColumn("filename", input_file_name())
2) Gzip your CSV files
Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.

PySpark Querying Multiple JSON Files

I have uploaded into Spark 2.2.0 many JSONL files (the structure is the same for all of them) contained in a directory using the command (python spark): df = spark.read.json(mydirectory) df.createGlobalTempView("MyDatabase") sqlDF = spark.sql("SELECT count(*) FROM MyDatabase") sqlDF.show().
The uploading works, but when I query sqlDF (sqlDF.show()), it seems that Spark counts the rows of just one file (the first?) and not those of all of them. I am assuming that "MyDatabase" is a dataframe containing all the files.
What am I missing?
If I upload just one file consisting of only one line of multiple json objects {...}, Spark can properly identify the tabular structure. If I have more than one file, I have to put each {} on a new line to get the same result.

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Spark: spark-csv takes too long

I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.