use pre-defined schema in pyspark json - json

Currently, if i want to read a json with pyspark, either I use interfered schema, or I have to define manually my schema StructType
Is it possible to use a file as reference for the schema ?

You can indeed use a file to define your schema. For example, for the following schema:
TICKET:string
TRANSFERRED:string
ACCOUNT:integer
you can use this code to import it:
import csv
from collections import OrderedDict
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
schema = OrderedDict()
with open(r'schema.txt') as csvfile:
schemareader = csv.reader(csvfile, delimiter=':')
for row in schemareader:
schema[row[0]]=row[1]
and then you can use it to create your StructType schema on the fly:
mapping = {"string": StringType, "integer": IntegerType}
schema = StructType([
StructField(k, mapping.get(v.lower())(), True) for (k, v) in schema.items()])
You may have to create a more complex schema file for JSON file, however, please note that you can't use a JSON file to define your schema as the order of the columns is not guaranteed when parsing JSON.

Related

How can I create an empty dataset using sparkcontext in Code Workbook in Palantir Foundry?

How do I create a bare minimum PySpark DataFrame in a Palantir Foundry Code Workbook?
To do this in a Code Repository I'd use:
my_df = ctx.spark_session.createDataFrame([('1',)], ["a"])
Code workbook injects a global spark as the Spark session, rather than a transform context in ctx. You can use it in a Python transform ('New Transform' > 'Python Code'):
def my_dataframe():
return spark.createDataFrame([('1',)], ["a"])
Or with a defined schema:
from pyspark.sql import types as T
from datetime import datetime
SCHEMA = T.StructType([
T.StructField('entity_name', T.StringType()),
T.StructField('thing_value', T.IntegerType()),
T.StructField('created_at', T.TimestampType()),
])
def my_dataframe():
return spark.createDataFrame([("Name", 3, datetime.now())], SCHEMA)

Is it possible to create a dataframe column with json data which doesn't have a fixed schema?

I am trying to create a dataframe column with JSON data which does not have a fixed schema. I am trying to write it in its original form as map/object but getting various errors.
I don't want to convert it to a string as I need to write this data in it's original form to the file.
Later this file is used for json processing, original structure should not be compromised.
Currently when I try writing data to a file it contain all the escape characters and is considered entire json as a string instead of complex type. Eg
{"field1":"d1","field2":"app","value":"{\"data\":\"{\\\"app\\\":\\\"am\\\"}\"}"}
You could try to make up a schema for the json file.
I don't know what output you expect.
As a clue I give you an example and two interesting links:
spark-read-json-with-schema
spark-schema-explained-with-examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructType}
object RareJson {
val spark = SparkSession
.builder()
.appName("RareJson")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","RareJson") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/rare.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
val structureSchema = new StructType()
.add("field1",StringType)
.add("field2",StringType)
.add("value",StringType,true)
val rareJson = sqlContext
.read
.option("allowBackslashEscapingAnyCharacter", true)
.option("allowUnquotedFieldNames", true)
.option("multiLine", true)
.option("mode", "DROPMALFORMED")
.schema(structureSchema)
.json(input)
rareJson.show(truncate = false)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
output
+------+------+---------------------------+
|field1|field2|value |
+------+------+---------------------------+
|d1 |app |{"data":"{\"app\":\"am\"}"}|
+------+------+---------------------------+
You can try to parse the value column too if it maintain the same format along the all rows.

How to export all data from Elastic Search Index to file in JSON format with _id field specified?

I'm new to both Spark and Scala. I'm trying to read all data from a particular index in Elastic Search into a RDD and use this data to write to Mongo DB.
I'm loading the Elastic search data to a esJsonRDD and when I try to print the RDD contents, it is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
How can I achieve the output from elastic search to be formatted this way?.
Any help would be appreciated.
The data retrieved from elastic search is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format is,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
object readFromES {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("readFromES")
.set("es.nodes", Config.ES_NODES)
.set("es.nodes.wan.only", Config.ES_NODES_WAN_ONLY)
.set("es.net.http.auth.user", Config.ES_NET_HTTP_AUTH_USER)
.set("es.net.http.auth.pass", Config.ES_NET_HTTP_AUTH_PASS)
.set("es.net.ssl", Config.ES_NET_SSL)
.set("es.output.json","true")
val sc = new SparkContext(conf)
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
//RDD.coalesce(1).saveAsTextFile(args(0))
RDD.take(5).foreach(println)
}
}
I would like the RDD output to be written to a file in the following JSON Format(one line per doc),
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
{_id:"1765770533","FirstName":DEF,"LastName":"DEF",Zipcode":"35525","City":"PortWinchestor","StateCode":"AI"}
"_id" is a part of metadata, to access it you should add .config("es.read.metadata", true) to config.
Then you can access it two ways, You can use
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
and manually add the _id field in json
Or easier way is to read as a dataframe
val df = spark.read
.format("org.elasticsearch.spark.sql")
.load("userdata/user")
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
//Write as json in file
df.write.json("output folder ")
Here the spark is the spark session created as
val spark = SparkSession.builder().master("local[*]").appName("Test")
.config("spark.es.nodes","host")
.config("spark.es.port","ports")
.config("spark.es.nodes.wan.only","true")
.config("es.read.metadata", true) //for enabling metadata
.getOrCreate()
Hope this helps

Not able to read streaming files using Spark structured streaming

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

How to save JSON data fetched from URL in PySpark?

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.
This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)
For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.
from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url