Not able to read streaming files using Spark structured streaming - csv

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

Related

Azure Function in Python get schema of parquet file

It is possible get schema of parquet file using Azure Function in Python without download file from datalake ? I using BlobStorageClient to connect to data lake and get the files and containers but i have no idea how can i dispatcher the command using for example pyarrow.
About pyarrow: https://arrow.apache.org/docs/python/parquet.html
BlobStorageClient: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python-legacy
Regarding the issue, please refer to the following script
import pyarrow.parquet as pq
import io
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client('test.parquet')
with io.BytesIO() as f:
download_stream = blob_client.download_blob(0)
download_stream.readinto(f)
schema = pq.read_schema(f)
print(schema)
It is possible to read both parquet schema and parquet metadata without reading the file content using read_schema and read_metadata:
import pyarrow.parquet as pq
fname = 'filename.parquet'
meta = pq.read_metadata(fname)
schema = pq.read_schema(fname)

use pre-defined schema in pyspark json

Currently, if i want to read a json with pyspark, either I use interfered schema, or I have to define manually my schema StructType
Is it possible to use a file as reference for the schema ?
You can indeed use a file to define your schema. For example, for the following schema:
TICKET:string
TRANSFERRED:string
ACCOUNT:integer
you can use this code to import it:
import csv
from collections import OrderedDict
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
schema = OrderedDict()
with open(r'schema.txt') as csvfile:
schemareader = csv.reader(csvfile, delimiter=':')
for row in schemareader:
schema[row[0]]=row[1]
and then you can use it to create your StructType schema on the fly:
mapping = {"string": StringType, "integer": IntegerType}
schema = StructType([
StructField(k, mapping.get(v.lower())(), True) for (k, v) in schema.items()])
You may have to create a more complex schema file for JSON file, however, please note that you can't use a JSON file to define your schema as the order of the columns is not guaranteed when parsing JSON.

how to load data from csv to mysql database in Spark?

I would like to load data from csv to mySql as a batch. But I could see the tutorials/logic to insert the data from csv to hive database. Could anyone kindly help me to achieve the above integration in spark using scala?
There is a reason why those tutorials don't exist. This task is very straightforward. Here is minimal working example:
val dbStr = "jdbc:mysql://[host1][:port1][,[host2][:port2]]...[/[database]]"
spark
.read
.format("csv")
.option("header", "true")
.load("some/path/to/file.csv")
.write
.mode("overwrite")
.jdbc(dbStr, tablename, props)
Create the dataframe reading CSV using spark session and write using the method jdbc with mysql Connection properties
val url = "jdbc:mysql://[host][:port][/[database]]"
val table = "mytable"
val property = new Properties()
spark
.read
.csv("some/path/to/file.csv")
.write
.jdbc(url, table, property)

Converting DataSet to Json Array Spark using Scala

I am new to the spark and unable to figure out the solution for the following problem.
I have a JSON file to parse and then create a couple of metrics and write the data back into the JSON format.
now following is my code I am using
import org.apache.spark.sql._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions._
object quick2 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.appName("quick1")
.master("local[*]")
.getOrCreate()
val rawData = spark.read.json("/home/umesh/Documents/Demo2/src/main/resources/sampleQuick.json")
val mat1 = rawData.select(rawData("mal_name"),rawData("cust_id")).distinct().orderBy("cust_id").toJSON.cache()
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).toJSON.cache()
val write1 = mat1.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat1/")
val write = mat2.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat2/")
}
}
Now above code is writing the proper json format.
However, matrices can contain duplicate result as well
example:
md5 mal_name
1 a
1 b
2 c
3 d
3 e
so with above code every object is getting written in single line
like this
{"file_md5":"1","mal_name":"a"}
{"file_md5":"1","mal_name":"b"}
{"file_md5":"2","mal_name":"c"}
{"file_md5":"3","mal_name":"d"}
and so on.
but I want to combine the data of common keys:
so the output should be
{"file_md5":"1","mal_name":["a","b"]}
can somebody please suggest me what shall I do here. Or if there is any other better way to approach this problem.
Thanks!
You can use collect_list or collect_set as per your need on mal_name column
You can directly save DataFrame/DataSet directly as JSON file
import org.apache.spark.sql.functions.{alias, collect_list}
import spark.implicits._
rawData.groupBy($"file_md5")
.agg(collect_set($"mal_name").alias("mal_name"))
.write
.format("json")
.save("json/file/location/to/save")
as wrote by #mrsrinivas I changed my code as per below
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).cache()
val labeledDf = mat2.toDF("file_md5","mal_name")
labeledDf.groupBy($"file_md5").agg(collect_list($"mal_name")).coalesce(1).write.format("json").save("/home/umesh/Documents/Demo2/src/test/run8/")
Keeping this quesion open for some more suggestions if any.

How to save JSON data fetched from URL in PySpark?

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.
This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)
For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.
from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url