"Unable to infer schema for JSON." error in PySpark? - json

I have a json file with about 1,200,000 records.
I want to read this file with pyspark as :
spark.read.option("multiline","true").json('file.json')
But it causes this error:
AnalysisException: Unable to infer schema for JSON. It must be specified manually.
When I create a json file with a smaller record count in the main file, this code can read the file.
I can read this json file with pandas, when I set the encoding to utf-8-sig:
pd.read_json("file.json", encoding = 'utf-8-sig')
How can I solve this problem?

Try this out:
spark.read.option("multiline","true").option("inferSchema", "true").json('file.json')

Since adding the encoding helps, maybe the following is what you need:
spark.read.json("file.json", multiLine=True, encoding="utf8")

Related

Read an html table via URL with R

I'm trying to add this data https://datahub.io/core/country-list/r/0.html to R as a df, but can't find a function to read a table from html file.
(the link to the git post about the data https://datahub.io/core/country-list#readme)
read_csv and read.table doesn't work.
I want to add this data to my df and use it in a project. Will appreciate your help!
read_csv and read.table doesn't work. I also tried XML package but got an error:
readHTMLTable("http://datahub.io/core/country-list/r/0.html", header = c(Country_Name, Code))
Error: failed to load external entity "http://datahub.io/core/country-list/r/0.html"
The same error I got using htmlParse
I want to add this data to my df and use it in a project. Will appreciate your help!

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Pyspark: write json from schema

I need to delete certain entries from nested Json files. As far as I know, I cant just delete them from the json file directly, so my next choice would be to load them into a pyspark dataframe, delete the entries there, create a new json with the same schema (& preferably the same name) and replace the old json file. I have extracted the schema into a json file, is there a way to write the dataframe back into a json file, somehow parsing the extracted schema?
Thanks!
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore.
errorifexists or error – This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.
df2.write.mode(SaveMode.Overwrite).json("/tmp/spark_output/zipcodes.json")

Python - How to update a value in a json file?

I hate json files. They are unwieldy and hard to handle :( Please tell me why the following doesn't work:
with open('data.json', 'r+') as file_object:
data = json.load(file_object)[user_1]['balance']
am_nt = 5
data += int(am_nt['amount'])
print(data)
file_object[user_1]['balance'] = data
Through trial and error (and many print statements), I have discovered that it opens the file, goes to the correct place, and then actually adds the am_nt, but I can't make the original json file update. Please help me :( :( . I get:
2000
TypeError: '_io.TextIOWrapper' object is not subscriptable
json is fun to work with as it is similar to python data structures.
The error is: object is not subscriptable
This error is for this line:
file_object[user_1]['balance'] = data
file_object is not json/dictionary data that can be updated like above. Hence the error.
Try to read the json data:
data=json.load(file_object)
Then manipulate the data as python dictionary. And save the file.

CSV parsing nested quotes

I try to parse a fairly complex CSV with apache sparks CSV reader which internally relies on the apache commons library (https://github.com/databricks/spark-csv).
I tried different combination of
quoteMode and escape but could not get it to work e.g. prevent the exceptions. Do you have any hints which parameters would support such a nested Structure?
ERROR CsvRelation$: Exception while parsing line: "Gabriella's Song" From The Motion Picture "The Mission";
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
I know that sed could be used to pre-process the data. However, it would be great if was integrated into Spark e.g. if no further pre-processing was needed. I did not find the possibility to specify a regex or so.
The CSV file looks like:
"Gabriella's Song" From The Motion Picture "The Mission";
This is related to https://github.com/databricks/spark-csv/issues/295
some more special fields like
&
Or "Eccoli; attenti ben (Don Pasquale)"
Cause these Problems. We will write our own CSV Pre-Processor for Apache Camel.
Try this,it worked very well for me -
HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(s"hdfs://{my-hdfs-file-path}")
Non-HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(my-hdfs-file-path)
Above approach works for any delimited file, just change the delimiter value.
You can also use Regex but that will very in-efficient for large files.
Hope this is helpful.