Writing dataframe to CSV - Spark 1.6 - csv

I am trying to write a pyspark dataframe to CSV. I have Spark 1.6, and I am trying things such as the line: df.write.format('com.intelli.spark.csv).save('mycsv.csv') and df.write.format('com.databricks.spark.csv').save(PATH).
These always give an error along the lines of java.lang.ClassNotFoundException: Failed to find data source: com.intelli.spark.csv. Please find packages at http://spark-packages.org.
I have tried downloading spark-cv_2.10-0.1.jar and using it in the --jars argument of spark-submit, but that also leads to a similar error. I have also tried spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 but it gives server access errors.

Try this way. In spark 1.6, you will have to covert it to rdd and write.
def toCSVLine(data):
return ','.join(str(d) for d in data)
rdd1 = df.rdd.map(toCSVLine)
rdd1.saveAsTextFile('output_dir')
Edit-
Try to add this in your spark code after passing
--py-files argument.
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Related

"Unable to infer schema for JSON." error in PySpark?

I have a json file with about 1,200,000 records.
I want to read this file with pyspark as :
spark.read.option("multiline","true").json('file.json')
But it causes this error:
AnalysisException: Unable to infer schema for JSON. It must be specified manually.
When I create a json file with a smaller record count in the main file, this code can read the file.
I can read this json file with pandas, when I set the encoding to utf-8-sig:
pd.read_json("file.json", encoding = 'utf-8-sig')
How can I solve this problem?
Try this out:
spark.read.option("multiline","true").option("inferSchema", "true").json('file.json')
Since adding the encoding helps, maybe the following is what you need:
spark.read.json("file.json", multiLine=True, encoding="utf8")

How to compare two JSON responses on Robot Framework?

I'm trying to find a library that contains a keyword to help me with that and didn't succeed.
What I'm doing at the moment is converting each JSON response to a dictionary and then comparing the dictionaries, but I hate it.
I was trying to find similar libraries and found this python code, but I don't know how to make this function works to me.
def _verify_json_file(self, result, exp):
'''
Verifies if two json files are different
'''
with open(exp) as json_data:
data = re.sub(ID, ID_REP, json_data.read())
expected = JSON.loads(data)
differences = jsondiff.diff(expected, result, syntax='explicit')
if not differences :
return True
if differences == expected or differences == result:
raise AssertionError("ERROR! Jsons have different structure")
return False
APPROACH#0
To make the above function work for you you just have to create a python file and put your function in that file and keep that file in the PYTHONPATH and use the same in your robot code by calling it in settings section using Library keyword. I have answered this question in detail with all the steps mention in this link.
Create a python file(comparejsons.py) with the above code in it
Keep the above python file in PYTHONPATH
Use Library comparejsons.py under settings section in your robot file
APPROACH#1
You should create a custom keyword which makes use of the below library and then compare the 2 jsons.
You can make use of "robotframework-jsonvalidator" module
Sample code below,
*** Settings ***
Library JsonValidator
Library OperatingSystem
*** Test Cases ***
Check Element
${json_example}= OperatingSystem.Get File ${CURDIR}${/}json_example.json
Element should exist ${json_example} .author:contains("Evelyn Waugh")
APPROACH#2
After converting the JSON to a dictionary you can just make use of the Built-in keyword , here , values=True is imortant.
Dictionaries Should Be Equal<<TAB>>dict1<<TAB>>dict2<<TAB>>values=True

Extracting JSON object from BigQuery Client in AWS Lambda using Python

I am running a SQL query via the google.cloud.bigquery.Client.query package in AWS lambda (Python 2.7 runtime). The native BQ object extracted from a query is the BigQuery Row() i.e.,
Row((u'exampleEmail#gmail.com', u'XXX1234XXX'), {u'email': 0, u'email_id': 1})
I need to convert this to Json, i.e.,
[{'email_id': 'XXX1234XXX', 'email': 'exampleEmail#gmail.com'}]
When running locally, I am able to just call the python Dict function on the row to transform it, i.e.,
queryJob = bigquery.Client.query(sql)
list=[]
for row in queryJob.result():
** at this point row = the BQ sample Row object shown above **
tmp = dict(row)
list.append(tmp)`
but when I load this into AWS Lambda it throws the error:
ValueError: dictionary update sequence element #0 has length 22; 2 is required
I have tried forcing it in different ways, breaking it out into sections etc but cannot get this into the JSON format desired.
I took a brief dive into the rabbit hole of transforming the QueryJob into a Pandas dataframe and then from there into a JSON object, which also works locally but runs into numpy package errors in AWS Lambda which seems to be a bit of a known issue.
I feel like this should have an easy solution but just haven't found it yet.
Try doing it like this
`
L = []
sql = (#sql_statement)
query_job = client.query(sql) # API request
query_job.result()
for row in query_job:
email_id= row.get('email_id')
email= row.get('email')
L.append([email_id, email])
`

Read Multilple json schema with spark

Software Configuration:
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0
Tried to read with multiple json schema,
val df = spark.read.option("mergeSchema",
"true").json("s3a://s3bucket/2018/01/01/*")
Throws an error,
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:206)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340)
How to read json with multipl schema's with spark?
This sometimes happens when you are pointing to wrong path (when data does not exist).

Spark exception handling for json

I am trying to catch/ignore a parsing error when I'm reading a json file
val DF = sqlContext.jsonFile("file")
There are a couple of lines that aren't valid json objects, but the data is too large to go through individually (~1TB)
I've come across exception handling for mapping using import scala.util.Tryand in.map(a => Try(a.toInt)) referencing:
how to handle the Exception in spark map() function?
How would I catch an exception when reading a json file with the function sqlContext.jsonFile?
Thanks!
Unfortunately you are out of luck here. DataFrameReader.json which is used under the hood is pretty much all-or-nothing. If your input contains malformed lines you have to filter these manually. A basic solution could look like this:
import scala.util.parsing.json._
val df = sqlContext.read.json(
sc.textFile("file").filter(JSON.parseFull(_).isDefined)
)
Since above validation is rather expensive you may prefer to drop jsonFile / read.json completely and to use parsed JSON lines directly.