Databricks Write Json file is too slow - json

I have simple scala snippet to read/write json files of total of 10GB (with mounting dir from storage account) --> it took 1.7 hour with almost all the time in the write json file line.
Cluster setup:
Azure Databricks DBR 7.3 LTS, spark 3.0.1, scala 2.12
11 workers + one driver of type Standard_E4as_v4 (each has 32.0 GB Memory, 4 Cores)
Why writing is too slow?
Is not writing is parallelized as reading accross partitions/workers?
How to speed writing or the whole process up?
Code for mount dir:
val containerName = "containerName"
val storageAccountName = "storageAccountName"
val sas = "sastoken"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
dbutils.fs.mount(
source = "wasbs://containerName#storageAccountName.blob.core.windows.net/",
mountPoint = "/mnt/myfile",
extraConfigs = Map(config -> sas))
Code for read/write json files:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val jsonDF = spark.read.json("/mnt/myfile")
jsonDF.write.json("/mnt/myfile/spark_output_test")

Solved by using "I would say" better spark apis, like
read --> spark.read.format("json").load("path")
write --> res.write.format("parquet").save("path)
and writing in parquet format as it is compressed and very optimized for spark

Related

How do I split / chunk Large JSON Files with AWS glueContext before converting them to JSON?

I'm trying to convert a 20GB JSON gzip file to parquet using AWS Glue.
I've setup a job using Pyspark with the code below.
I got this log WARN message:
LOG.WARN: Loading one large unsplittable file s3://aws-glue-data.json.gz with only one partition, because the file is compressed by unsplittable compression codec.
I was wondering if there was a way to split / chunk the file? I know I can do it with pandas, but unfortunately that takes far too long (12+ hours).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
import pyspark.sql.functions
from pyspark.sql.functions import col, concat, reverse, translate
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
test = glueContext.create_dynamic_frame_from_catalog(
database="test_db",
table_name="aws-glue-test_table")
# Create Spark DataFrame, remove timestamp field and re-name other fields
reconfigure = test.drop_fields(['timestamp']).rename_field('name', 'FirstName').rename_field('LName', 'LastName').rename_field('type', 'record_type')
# Create pyspark DF
spark_df = reconfigure.toDF()
# Filter and only return 'a' record types
spark_df = spark_df.where("record_type == 'a'")
# Once filtered, remove the record_type column
spark_df = spark_df.drop('record_type')
spark_df = spark_df.withColumn("LastName", translate("LastName", "LName:", ""))
spark_df = spark_df.withColumn("FirstName", reverse("FirstName"))
spark_df.write.parquet("s3a://aws-glue-bucket/parquet/test.parquet")
Spark does not parallelize reading a single gzip file. However, you can do split it in chunks.
Also, Spark is really slow at reading gzip files(since its not paralleized). You can do this to speed it up:
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())

Reading Json files using pyspark

I am trying to read multiple json files from dbfs in databricks.
raw_df = spark.read.json('/mnt/testdatabricks/metrics-raw/',recursiveFileLookup=True)
This returns data for only 35 files whereas there are around 1600 files.
I tried to read some of the files (except those 35) using pandas and it returned data.
However the driver fails when I try to read all 1600 files using pandas.
import pandas as pd
from glob import glob
jsonFiles = glob('/dbfs/mnt/testdatabricks/metrics-raw/***/*.json')
dfList = []
for jsonFile in jsonFiles:
df = pd.read_json(jsonFile)
dfList.append(df)
print("written :", jsonFile )
dfTrainingDF = pd.concat(dfList, axis=0)
Not sure why spark is not able to read all the files.
Try:
spark.read.option("recursiveFileLookup", "true").json("file:///dir1/subdirectory")
Ref: How to make Spark session read all the files recursively?

spark streaming+kafka - appending to parquet

I am streaming meter reading records as JSON from kafka 2.11-1 into Spark 2.1. I dont understand how to convert the streamed object into a dataframe before saving it to a parquet file. I want the scala script to infer the schema from JSON so that a new parquet file format will be generated automatically when the JSON format of the streaming source data changes (I'll figure out later how to detect this and start a new file whenever a format change occurs). For now, I am unable to write the parquet file.
import org.apache.spark
import org.apache.spark.streaming._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession}
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
val sqlContext = new SQLContext(sc)
ssc.checkpoint("_checkpoint")
// Connect to Kafka
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
val kafkaParams = Map("metadata.broker.list" -> "xx.xx.xx.xx:9092")
val kafkaTopics = Set("test")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
messages.print()
messages.foreachRDD(rdd => {
val part1= rdd.map(_._1)
val part2= rdd.map(_._2) // this has the json
print ("%%%% part1 is : " + part1)
print ("%%%% part2 is : " + part2)
// here: infer the schema from json and append the streamed data to a parquet file on hdfs
} )
ssc.start()
ssc.awaitTermination()
The json looks like this:
-------------------------------------------
Time: 1513155855000 ms
-------------------------------------------
(null,{"customer_id":"customer_51","customer_acct_id":"cusaccid_1197","serv_acct_id":"service_1957","installed_service_id":"instserv_946","meter_id":"meter_319","channel_number":"156","interval_read_date":"2013-06-16 11:26:04","interval_received":"5","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N", "provisioned_meter_ind:"N"})
(null,{"customer_id":"customer_25","customer_acct_id":"cusaccid_1303","serv_acct_id":"service_844","installed_service_id":"instserv_1636","meter_id":"meter_663","channel_number":"1564","interval_read_date":"2014-02-13 12:52:34","interval_received":"8","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1955","customer_acct_id":"cusaccid_1793","serv_acct_id":"service_577","installed_service_id":"instserv_1971","meter_id":"meter_1459","channel_number":"1312","interval_read_date":"2017-05-23 07:32:13","interval_received":"11","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1833","customer_acct_id":"cusaccid_1381","serv_acct_id":"service_461","installed_service_id":"instserv_477","meter_id":"meter_1373","channel_number":"1769","interval_read_date":"2011-12-13 10:12:20","interval_received":"15","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1597","customer_acct_id":"cusaccid_1753","serv_acct_id":"service_379","installed_service_id":"instserv_1061","meter_id":"meter_1759","channel_number":"632","interval_read_date":"2013-07-22 05:49:55","interval_received":"7","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
2017-12-13 09:04:15,626 INFO org.apache.spark.streaming.scheduler.JobGenerator (Logging.scala:logInfo(54)) - Checkpointing graph for time 1513155855000 ms
I'm testing this using spark-shell:
spark-shell --jars /opt/alti-spark-2.1.1/external/kafka-0-8/target/spark-streaming-kafka-0-8_2.11-2.1.1.jar --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0

pyspark csv at url to dataframe, without writing to disk

How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?
I've tried the following with no luck:
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode('utf-8')
f = StringIO(text)
df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
TL;DR It is not possible and in general transferring data through driver is a dead-end.
Before Spark 2.3 csv reader can read only from URI (and http is not supported).
In Spark 2.3 you use RDD:
spark.read.csv(sc.parallelize(text.splitlines()))
but data will be written to disk.
You can createDataFrame from Pandas:
spark.createDataFrame(pd.read_csv(url)))
but this once again writes to disk
If file is small I'd just use sparkFiles:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("iris.csv"), header=True))

spark failing to read mysql data and save in hdfs [duplicate]

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"#"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes 20 to 30 sec.
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
This is taking 5 to 10 minutes.
I am running spark inside jvm. Using same configuration in both cases.
I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
This problem has been covered multiple times on StackOverflow:
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
Please follow the steps below
1.download a copy of the JDBC connector for mysql. I believe you already have one.
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
2.create a db-properties.flat file in the below format
jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
user=<username>
password=<password>
3.create a empty table first where you want to load the data.
invoke spark shell with driver class
spark-shell --driver-class-path <your path to mysql jar>
then import all the required package
import java.io.{File, FileInputStream}
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
initiate a hive context or a sql context
val sQLContext = new HiveContext(sc)
import sQLContext.implicits._
import sQLContext.sql
set some of the properties
sQLContext.setConf("hive.exec.dynamic.partition", "true")
sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Load mysql db properties from file
val dbProperties = new Properties()
dbProperties.load(new FileInputStream(new File("your_path_to/db- properties.flat")))
val jdbcurl = dbProperties.getProperty("jdbcUrl")
create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause
val df1 = "(SELECT * FROM your_table_name) as s1"
pass the jdbcurl, select query and db properties to read method
val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
write it to your table
df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")