I need Spark to write a compressed csv file to HDFS, but I need it to start with a few lines of version information.
Example of file content
version=2
date=2020-01-31
id,name,age
1,Alice,21
2,Bob,23
Three ideas of ways to do this
First write out to hdfs://data/tmp/file1.csv.gz, then use hadoop fs -cat
to stream it to hdfs://data/real/file1.csv.gz
Convert output
datafram to text format / RDD[String] and union real file with the
extra header lines
Change the first column name to multi line
So for approach 3:
column1 ="version=2\ndate=2020-01-31\n\nid"
Let me know if you know a more elegant way to do this.
I tried all the approaches. Here is simplified code:
Approach 1
Approach 1 is using Hadoop commands from a bash script etc.
This works, but requires a double HDFS write and a cleanup. Also it doesn't fit so well in a Scala Spark project.
(echo -e "version=2\ndate=2020-01-31\n\nid,name,age" | gzip -vc ; hadoop fs -cat "$INPUT_DIR/*" ) | hadoop fs -put - "$OUTPUT_PATH"
What is happening here is that it will
echo multi line header out to stdout
pipe that into gzip and to stdout
pipe other HDSF dir to stdout
pipe into hadoop fs -put that will combine everything
Approach 2
Code is a little more complicated there are not bad quote characters in headline, but headlines sometimes come after csv part.
import org.apache.hadoop.io.compress.GzipCodec
val heading = """version=2
date=2020-01-31
id,name,age""".split("\n", -1).toSeq
val headingRdd: RDD[String] = sc.parallelize(heading)
val mediamathRdd: RDD[String] = df.rdd.map(row => row.mkString(","))
val combinedResult: RDD[String] = (headingRdd union mediamathRdd)
combinedResult.repartition(1).saveAsTextFile(path, classOf[GzipCodec])
Approach 3
The simplest approach, but the output is slightly off
df.repartition(1)
.withColumnRenamed("id", "version=2\ndate=2020-01-31\n\nid")
.option("header", true)
.option("delimiter", ",")
.option("quoteMode", "NONE")
.option("quote", " ")
.option("codec", "gzip")
.csv(path)
Result will look like this, which might or might not be acceptable
version=2
date=2020-01-31
id ,name,age
1,Alice,21
2,Bob,23
I also tried with:
.option("quote", "\u0000")
It actually prints the ascii charter zero and while this did not show up in my HDFS viewer this was not part of the spec.
Best Approach
None of them are perfect for what seems like a very simple task. Maybe there is a small fix to make approach 2 work perfectly.
Related
I'm reading a CSV pipe delimited data file using spark. It's quote qualified. A block of text has a /n in it and it's causing the read to corrupt. What I don't understand is that it's quote qualified text so surely it should just skip that!? The rows themselves are CR+LN delimited.
Anyhow it's not. How do I get around this? I can cleanse them out on extract but doesn't seem that elegant to me.
This is what I'm using to load the data
val sch = spark.table("db.mytable").schema
val df = spark.read
.format("csv")
.schema(sch)
.option("header", "true")
.option("delimiter", "|")
.option("mode", "PERMISSIVE")
.option("quote", "\"")
.load("/yadaydayda/mydata.txt")
Glad to know I'm not the only one who's dealt with this issue in Spark!
Spark reads files line-by-line, so CSVs with newlines in them cause problems for the parser. Reading line-by-line makes it easier for Spark to handle large CSV files, rather than trying to parse all of the content for quotes, which would significantly impair performance for a case is more likely to not be an issue when trying to have high-performing analytics.
For cases where I knew newlines were a possibility, I've used a third party CSV parsing library, run the CSV "lines" through that (which would handle the newlines correctly), strip the newlines, write/cache the file somewhere, and read from that cached file. For a production use case, those files would be loaded into a database, or for log files or something where you don't want them in a database, using Parquet like you suggested works pretty well, or really just enforcing the lack of newlines somewhere before the files get to Spark.
Got around this by initially striping them on extract. The final solution I settled on however was to use a parquet format on extract then all these problems just go away.
I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?
You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**
I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?
Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.
I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:
df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')
I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.
How to convert json to parquet in streaming with Spark?
Acutually i have to ssh from a server, recieve a big json file, convert it to parquet, and upload it on hadoop.
I there a way to do this in a pipelined way?
They are backup files so I have a directory with a predefined amount of files that don't change in size in time
Something like:
scp host /dev/stdout | spark-submit myprogram.py | hadoop /dir/
edit:
Actually I'm working on this:
sc = SparkContext(appName="Test")
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec.", "gzip")
#Since i couldn't get the stdio, went for a pipe:
with open("mypipe", "r") as o:
while True:
line = o.readline()
print "Processing: " + line
lineRDD = sc.parallelize([line])
df = sqlContext.jsonRDD(lineRDD)
#Create and append
df.write.parquet("file:///home/user/spark/test", mode="append")
print "Done."
This is working fine, but the resulting parquet is very large (280kb for 4 lines 2 columns json). Any improvements?
If anyone is interested, I managed to resolve this using the .pipe() method.
https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe
I am using the following code to save a spark DataFrame to JSON file
unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
the output result is:
part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
_SUCCESS
._SUCCESS.crc
How do I generate a single JSON file and not a file per line?
How can I avoid the *crc files?
How can I avoid the SUCCESS file?
If you want a single file, you need to do a coalesce to a single partition before calling write, so:
unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.
I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Note, that you may also want to disable generation of the metadata files with:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.
Just a little update in above answer. To disable crc and SUCCESS file, simply set property in spark session as follows(example)
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Ignore crc files on .write
val hadoopConf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.setWriteChecksum(false)