This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 4 years ago.
Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
How would I save a DF with :
Path mapping to the exact file name instead of folder
Header available in first line
Save as a single file instead of multiple files.
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
Use:
df.toPandas().to_csv("sample_file.csv", header=True)
See documentation for details:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas
df.coalesce(1).write.option("inferSchema","true").csv("/newFolder",header =
'true',dateFormat = "yyyy-MM-dd HH:mm:ss")
The following scala method works in local or client mode, and writes the df to a single csv of the chosen name. It requires that the df fit into memory, otherwise collect() will blow up.
import org.apache.hadoop.fs.{FileSystem, Path}
val SPARK_WRITE_LOCATION = some_directory
val SPARKSESSION = org.apache.spark.sql.SparkSession
def saveResults(results : DataFrame, filename: String) {
var fs = FileSystem.get(this.SPARKSESSION.sparkContext.hadoopConfiguration)
if (SPARKSESSION.conf.get("spark.master").toString.contains("local")) {
fs = FileSystem.getLocal(new conf.Configuration())
}
val tempWritePath = new Path(SPARK_WRITE_LOCATION)
if (fs.exists(tempWritePath)) {
val x = fs.delete(new Path(SPARK_WRITE_LOCATION), true)
assert(x)
}
if (results.count > 0) {
val hadoopFilepath = new Path(SPARK_WRITE_LOCATION, filename)
val writeStream = fs.create(hadoopFilepath, true)
val bw = new BufferedWriter( new OutputStreamWriter( writeStream, "UTF-8" ) )
val x = results.collect()
for (row : Row <- x) {
val rowString = row.mkString(start = "", sep = ",", end="\n")
bw.write(rowString)
}
bw.close()
writeStream.close()
val resultsWritePath = new Path(WRITE_DIRECTORY, filename)
if (fs.exists(resultsWritePath)) {
fs.delete(resultsWritePath, true)
}
fs.copyToLocalFile(false, hadoopFilepath, resultsWritePath, true)
} else {
System.exit(-1)
}
}
This solution is based on a Shell Script and is not parallelized, but is still very fast, especially on SSDs. It uses cat and output redirection on Unix systems. Suppose that the CSV directory containing partitions is located on /my/csv/dir and that the output file is /my/csv/output.csv:
#!/bin/bash
echo "col1,col2,col3" > /my/csv/output.csv
for i in /my/csv/dir/*.csv ; do
echo "Processing $i"
cat $i >> /my/csv/output.csv
rm $i
done
echo "Done"
It will remove each partition after appending it to the final CSV in order to free space.
"col1,col2,col3" is the CSV header (here we have three columns of name col1, col2 and col3). You must tell Spark to don't put the header in each partition (this is accomplished with .option("header", "false") because the Shell Script will do it.
For those still wanting to do this here's how I got it done using spark 2.1 in scala with some java.nio.file help.
Based on https://fullstackml.com/how-to-export-data-frame-from-apache-spark-3215274ee9d6
val df: org.apache.spark.sql.DataFrame = ??? // data frame to write
val file: java.nio.file.Path = ??? // target output file (i.e. 'out.csv')
import scala.collection.JavaConversions._
// write csv into temp directory which contains the additional spark output files
// could use Files.createTempDirectory instead
val tempDir = file.getParent.resolve(file.getFileName + "_tmp")
df.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save(tempDir.toAbsolutePath.toString)
// find the actual csv file
val tmpCsvFile = Files.walk(tempDir, 1).iterator().toSeq.find { p =>
val fname = p.getFileName.toString
fname.startsWith("part-00000") && fname.endsWith(".csv") && Files.isRegularFile(p)
}.get
// move to desired final path
Files.move(tmpCsvFile, file)
// delete temp directory
Files.walk(tempDir)
.sorted(java.util.Comparator.reverseOrder())
.iterator().toSeq
.foreach(Files.delete(_))
The FileUtil.copyMerge() from the Hadoop API should solve your problem.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
See Write single CSV file using spark-csv
This is how distributed computing work! Multiple files inside a directory is exactly how distributed computing works, this is not a problem at all since all software can handle it.
Your question should be "how is it possible to download a CSV composed of multiple files?" -> there are already lof of solutions in SO.
Another approach could be to use Spark as a JDBC source (with the awesome Spark Thrift server), write a SQL query and transform the result to CSV.
In order to prevent OOM in the driver (since the driver will get ALL
the data), use incremental collect
(spark.sql.thriftServer.incrementalCollect=true), more info at
http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/.
Small recap about Spark "data partition" concept:
INPUT (X PARTITIONs) -> COMPUTING (Y PARTITIONs) -> OUTPUT (Z PARTITIONs)
Between "stages", data can be transferred between partitions, this is the "shuffle". You want "Z" = 1, but with Y > 1, without shuffle? this is impossible.
Related
I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,
You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.
Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.
I'm trying to create multiple DataFrames from the two lists below,
val paths = ListBuffer("s3://abc_xyz_tableA.json",
"s3://def_xyz_tableA.json",
"s3://abc_xyz_tableB.json",
"s3://def_xyz_tableB.json",
"s3://abc_xyz_tableC.json",....)
val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)
I want to create different dataframes using the table names by bringing all the common table name ending s3 paths together as they have the unique schema.
so for example if the tables and paths related to it are brought together then -
"tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path
"tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path
and so on there can be many tableNames and Paths
I'm trying different approaches but not successful yet.
Any leads in achieving the desired solution will be of great help. Thanks!
using input_file_name() udf, you can filter based on the file names to get the dataframe for each file/file patterns
import org.apache.spark.sql.functions._
import spark.implicits._
var df = spark.read.format("json").load("s3://data/*.json")
df = df.withColumn(
"input_file", input_file_name()
)
val tableADF= df.filter($"input_file".endsWith("tableA.json"))
val tableBDF= df.filter($"input_file".endsWith("tableB.json"))
If the file post fix name list is pretty long then you an use something as below,
Also find the code explanation inline
import org.apache.spark.sql.functions._
object DFByFileName {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load your JSON data
var df = spark.read.format("json").load("s3://data/*.json")
//Add a column with file name
df = df.withColumn(
"input_file", (input_file_name())
)
//Extract unique file postfix from the file names in a List
val fileGroupList = df.select("input_file").map(row => {
val fileName = row.getString(0)
val index1 = fileName.lastIndexOf("_")
val index2 = fileName.lastIndexOf(".")
fileName.substring(index1 + 1, index2)
}).collect()
//Iterate file group name to map of (fileGroup -> Dataframe of file group)
fileGroupList.map(fileGroupName => {
df.filter($"input_file".endsWith(s"${fileGroupName}.json"))
//perform dataframe operations
})
}
}
Check below code & Final result type is
scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...]) where ... is your column list.
paths
.map(path => (s"${path.split("_").last.split("\\.json").head}DF",path)) // parsing file names and extracting table name and path into tuple
.groupBy(_._1) // grouping paths based same table name
.map(p => (p._1 -> p._2.map(_._2))).par // combining paths for same table into list and also .par function to execute subsequent steps in Parallel
.map(mp => {
(
mp._1, // table name
mp._2.par // For same DF multiple Files load parallel.
.map(spark.read.json(_)) // loading files s3
.reduce(_ union _) // union if same table has multiple files.
)
}
)
I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.
Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.
Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?
Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")
Faced exact issue while working on a ETL job which used AWS Glue.
The documentation for from_catalog says:
additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.
I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.
additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},
One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
'paths': ['s3://bucket_name/folder_name'],
"recurse": True,
'groupFiles': 'inPartition'
},
format="csv",
format_options={
"withHeader": True
},
transformation_ctx = "datasource0"
)
But if you can't do this for any reason and want to stick with from_catalog, using a filter worked for me.
Assuming that one of your header's name is name, this is what the snippet can look like:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")
Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.
I have a large number of fairly large daily files stored in a blog storage engine(S3, Azure datalake exc.. exc..) data1900-01-01.csv, data1900-01-02.csv,....,data2017-04-27.csv. My goal is to preform a rolling N-day linear regression but I am having trouble with the data loading aspect. I am not sure how to do this without nested RDD's.
The schema for every .csv file is the same.
In other words for every date d_t, I need data x_t and to join data (x_t-1, x_t-2,... x_t-N).
How can I use PySpark to load an N-day Window of these daily files? All of the PySpark examples I can find seem to load from one very large file or data set.
Here's an example of my current code:
dates = [('1995-01-03', '1995-01-04', '1995-01-05'), ('1995-01-04', '1995-01-05', '1995-01-06')]
p = sc.parallelize(dates)
def test_run(date_range):
dt0 = date_range[-1] #get the latest date
s = '/daily/data{}.csv'
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
file_list = [s.format(dt) for dt in date_range[:-1]] # Get a window of trailing dates
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
return 1
p.filter(test_run)
p.map(test_run) #fails with same error as p.filter
I'm on PySpark version '2.1.0'
I'm running this on an Azure HDInsight cluster jupyter notebook.
spark here is of type <class 'pyspark.sql.session.SparkSession'>
A smaller more reproducible example is as follows:
p = sc.parallelize([1, 2, 3])
def foo(date_range):
df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
return 1
p.filter(foo).count()
You are better off with using Dataframes rather than RDD. Dataframe's read.csv api accepts list of paths like -
pathList = ['/path/to/data1900-01-01.csv','/path/to/data1900-01-02.csv']
df = spark.read.csv(pathList)
have a look at documentation for read.csv
You can form the list of paths to date files to your data files by doing some date operation over window of N days like "path/to/data"+datetime.today().strftime("%Y-%m-%d"))+.csv" (This will get you file name of today only but its not hard to figure out date calculation for N days)
However keep in mind that schema of all date csvs should be same for above to work.
edit : When you parallelize list of dates i.e. p, each date gets processed individually by different executors, so input to test_run2 wasnt really as list of dates, it was one individual string like 1995-01-01
Try this instead, see if this works.
# Get the list of dates
date_range = window(dates, N)
s = '/daily/data{}.csv'
dt0 = date_range[-1] # most recent file
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
# read previous files
file_list = [s.format(dt) for dt in date_range[:-1]]
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
r, resid = computeLinearRegression(df0,df1)
r.write.save('daily/r{}.csv'.format(dt0))
resid.write.save('/daily/resid{}.csv'.format(dt0))
I am playing with spark RDD with json files and i am doing something like below
val uisJson5 = sqlContext.read.json(
sc.textFile("s3n://localtion/*")
.filter(line =>
line.contains("\"xyz\":\"A\"")
&& line.contains("\"id\":\"adasdfasdfasd\"")
))
uisJson5.show()
I want to know the source json files as well from where the results are coming. Is there any way i can do this?
Edit:
I was able to do it using below code
val uisJson1 = sc.textFile("s3n://localtion/*”)
.filter(line => line.contains("\"xyz\":\"A\"")
&& line.contains("\"id\":\"adasdfasdfasd\""))
uisJson1.collect().foreach(println)
You are looking for wholeTextFiles along with flatMapValues.
wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.