Spark read file into a dataframe - json

I get a corrupt record when I try to read the below file in.
I am trying to use SqlContext.read.Json(file location) but get _corrupt_record:string. Could someone help me out? Added the head of the dataset below for the file that i am trying to read in.
Any assistance appreciated.

For reading multiline json, you need to pass an option multiLine = True:
df = spark.read.json('/path/to/json', multiLine=True)
And you should consider using the Spark Session to read json, instead of using the deprecated SQL context.

For someone who wants to do it in scala, you can do it as below :
val df = spark.read.option("multiline",true)json("/path/to/json")

val DB_DETAILS_FILE_PATH = "file:///C:/Users/sshashank/Desktop/db_details.json"
var dbDetailsDF = spark.read
.option("multiline", "true")
.json(DB_DETAILS_FILE_PATH)

Related

Reading JSON Array with Apache Spark

I have a json array file, which looks like this:
["{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}",{"meta":{"headers":{"app":"music"},"customerId":"2"}}]
I have a json file, nodes that looks like this:
I am trying to read this file in scala through the spark-shell.
val s1 = spark.read.json("path/to/file/file.json")
However, this results in a corrupt_record error:
org.apache.spark.sql.DataFrame = [_corrupt_record: string]
I have also tried reading it like this:
val df = spark.read.json(spark.sparkContext.wholeTextFiles("path.json").values)
val df = spark.read.option("multiline", "true").json("<file>")
But still the same error.
As the json array contains string and json objects may be thats why I am not able to read it.
Can anyone shed some light on this error? How can we read it via spark udf?
Yes the reason is the mix of text and actual json object. To me it looks though as if the two entries belong together so why not change the schema to sth like this:
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
New line means also new record so for multiple events your file would look like this:
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"2\",\"events\":[{\"event_type\":\"ON\"}]}"}

Many errors when trying to load and work with json in python

I have tried using .read() and .decode("utf-8") just keep getting errors like this 'TypeError: can only concatenate str (not "dict") to str'
from requests import get
import json
url = 'http://taco-randomizer.herokuapp.com/random/?full-taco=true'
requested_taco = get(url)
requested_taco_data = json.loads(requested_taco.read())
title = requested_taco_data['name']
Thank you in advance to anyone who is able to help me figure out how to get the json to become a dictionary in python.
There is no response.read() in Requests, instead you should use response.json() like so:
taco = requested_taco.json()
print(taco['name'])
which would output:
'Black Bean, Potato, and Onion Tacos'
no need for json library.

how to load data from csv to mysql database in Spark?

I would like to load data from csv to mySql as a batch. But I could see the tutorials/logic to insert the data from csv to hive database. Could anyone kindly help me to achieve the above integration in spark using scala?
There is a reason why those tutorials don't exist. This task is very straightforward. Here is minimal working example:
val dbStr = "jdbc:mysql://[host1][:port1][,[host2][:port2]]...[/[database]]"
spark
.read
.format("csv")
.option("header", "true")
.load("some/path/to/file.csv")
.write
.mode("overwrite")
.jdbc(dbStr, tablename, props)
Create the dataframe reading CSV using spark session and write using the method jdbc with mysql Connection properties
val url = "jdbc:mysql://[host][:port][/[database]]"
val table = "mytable"
val property = new Properties()
spark
.read
.csv("some/path/to/file.csv")
.write
.jdbc(url, table, property)

How to read a JSON object that I created in R into sparkR

I would like to take a dataframe I've created in R, and turn that into a JSON Object, and then read that JSON object into sparkR. With my current project, I can't just pass a dataframe into SparkR, and have to do this roundabout method to get my project to work. I also can't make a local JSON file first to read into sparkR, and so I am trying to make a JSON object to hold my data, and then read that into sparkR.
In other posts I read, Scala Spark has a function
sqlContext.read.json(anotherPeopleRDD)
That seems to do what I am trying to accomplish. Is there something similar for SparkR?
Here is the code I am working with right now:
.libPaths(c(.libPaths(), '/root/Spark1.6.2/spark-1.6.2-bin-hadoop2./R/lib'))
Sys.setenv(SPARK_HOME = '/root/Spark1.6.2/spark-1.6.2-bin-hadoop2.6')
Sys.setenv(R_HOME = '/root/R-3.4.1')
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
Sys.setenv("spark.r.command" = '/usr/bin')
Sys.setenv(HADOOP_CONF_DIR = "/etc/hadoop/conf.cloudera.yarn")
Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/root/Spark1.6.2/spark1.6.2-bin-hadoop2.6/bin', sep=':'))
library(SparkR)
sparkR.stop()
sc <- sparkR.init(sparkEnvir = list(spark.shuffle.service.enabled=TRUE,spark.dynamicAllocation.enabled=TRUE, spark.dynamicAllocation.initialExecutors="2"), master = "yarn-client", appName = "SparkR")
sqlContext <- sparkRSQL.init(sc)
options(warn=-1)
n = 1000
x = data.frame(id = 1:n, val = rnorm(n))
library(RJSONIO)
exportJson <- toJSON(x)
testJsonData = read.json(sqlContext, exportJson) #fails
collect(testJsonData)
remove(sc)
remove(sqlContext)
sparkR.stop()
options(warn=0)
With the error message I'm getting for read.json:
17/08/03 12:25:35 ERROR r.RBackendHandler: json on 2 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {
The solution to this problem was that the JSON files I was working with was not supported by the spark read.json function, due to how it was formated. Instead I had to use another R library, jsonlite, to make my JSON files, and now it works as intended.
This is how it looks like when I create the file now:
library(jsonlite)
exportJson <- toJSON(x)
testJsonData = read.json(sqlContext, exportJson) #fails
collect(testJsonData)
I hope that helps anyone!

to.JSON() in Spark Streaming using pyspark

I used toJSON() method to convert DataFrame to RDD of json documents within transform() function of spark Streaming.
I am using pyspark for coding like follow:
def process(rdd):
rddDataframe = sqlContext.createDataFrame(rdd)
rddback = rddDataFrame.toJSON()
return rdd
dstream_test = dstream_in.transform(lambda rdd: process(rdd))
But I got the following error:
UnpicklingError: invalid load key, '{'
Please help me how to solve this.
Don't pass a rdd to a function, pass the function to your rdd.
Define your transformation for each row, then send it
def transform(row):
....
your_rdd = your_rdd.map(transform)