Import JSON content from CSV file using spark - json

Currently, I’m working with the following architecture.
I do have a DocumentDB database that has data exported to S3 using DMS (CDC task), once this data is landed on S3 I need to load it into Databricks.
I’m already able to read the CSV content (which has a lot of JSONS), but I don't how to parse/insert it into a Databricks table.
Following my JSON payload which is exported to S3.
{
"_id": {
"$oid": "12332334"
},
"processed": false,
"col1": "0000000122",
"startedAt": {
"$date": 1635667097000
},
"endedAt": {
"$date": 1635667710000
},
"col2": "JFHFGALJF-DADAD",
"col3": 2.75,
"created_at": {
"$date": 1635726018693
},
"updated_at": {
"$date": 1635726018693
},
"__v": 0
}
To extract the data into Daframe I'm using the following spark command:
df = spark.read \
.option("header", "true") \
.option("delimiter", "|") \
.option("inferSchema", "false" ) \
.option("lineterminator", "\n" ) \
.option("encoding", "ISO-8859-1") \
.option("ESCAPE quote", '"') \
.option("escape", "\"") \
.csv("dbfs:/mnt/s3-data-2/folder_name/LOAD00000001.csv")

Thank you Alex Ott as per your Suggestion and as per this document. you can use from_json in your file to read JSON to CSV
In order to read a JSON string from a CSV file, first, we need to read a CSV file into Spark Dataframe using spark.read.csv("path") and then parse the JSON string column and convert it to columns using from_json() function. This function takes the first argument as a JSON column name and the second argument as JSON schema.

Related

How to write multiline json using pyspark

We are easy read multiline json using below command
df =spark.read.option("multiline","True").json("any multiline.json")
but not able to write easily any multiline json using write command
Example of multiline json
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
I tried using below solution but here aggregating all value and save in text format
How to save a dataframe into a json file with multiline option in pyspark
could you please suggest any other solution without aggregation and save directly in. json extension
Since Spark does not have options to prettify an output JSON, you could convert the result to string JSON using toJSON and then use the python json library to save a properly indented json file.
For example :
jsonResults = resultDF.toJSON().collect()
jsonResultList = [json.loads(x) for x in jsonResults]
with open('result.json', 'w', encoding='utf-8') as f:
json.dump(jsonResultList, f, ensure_ascii=False, indent=4)

Json data as dict for cerberus validations

I am trying to read Json file as string and using the data as validator for cerberus. I am using a custom function with check_with. The json works fine if I use the code from within my python test script.
abc.json
{
"rows": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"amt": {"type": "integer"},
"amt": {"check_with": util_cls.amt_gt_than}
}
}
}
}
python test code
with open("abc.json") as f:
s = f.read()
#s = ast.literal_eval(s)
v = Validator()
r = v.validate(json_data, s)
Cerberus requires the variable s to be a dict and I can't convert abc.json file contents to json using json.load as the json is not a valid format. Any ideas on how to convert the string to dict?

Load JSON data into BigQuery table

I'm trying to load simple JSON data into BigQuery table the following way:
$ bq load \
--apilog \
--source_format=NEWLINE_DELIMITED_JSON \
my_dataset.my_table \
./input.json ./schema.json
but get the following error message:
Upload complete.
Waiting on bqjob_xxxx_xxx ... (3s) Current status: DONE
BigQuery error in load operation: Error processing job 'my_project_id:bqjob_xxxx_xxx': CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
Failure details:
- file-00000000: Error detected while parsing row starting at
position: 0. Error: Data between close double quote (") and field
separator.
It complains about some CSV error, but I'm trying to load JSON (--source_format=NEWLINE_DELIMITED_JSON)
My input.json contains this data:
{"domain":"stackoverflow.com","key":"hello","value":"world"}
My schema.json is the following:
[
{
"name": "domain",
"type": "string",
"mode": "nullable"
},
{
"name": "key",
"type": "string",
"mode": "nullable"
},
{
"name": "value",
"type": "string",
"mode": "nullable"
}
]
bq version 2.0.25:
$ gcloud version | grep ^bq
bq 2.0.25
The problem here is that the flag apilog expects a string as input. This command should work for you:
bq load \
--apilog '' \
--source_format=NEWLINE_DELIMITED_JSON \
my_dataset.my_table \
./input.json ./schema.json
Empty string sends output to stdout. If you want to save the log to a local file then you can just send a non-empty string, such as --apilog 'localfile_name'.
BQ command says:
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
As you see there are global_flags and command_flags
For the global_flags that have values you need to use the equal sign:
--flag=value
The command_flags are either boolean:
--[no]replace
Or they take arguments that must follow the flag:
--source_format NEWLINE_DELIMITED_JSON
Also do not mix global and command flags: apilog is a global flag.
I would rewrite your command into:
$ bq --apilog load \
--source_format NEWLINE_DELIMITED_JSON \
my_dataset.my_table \
./input.json ./schema.json

Read JSON files from multiple line file in spark scala

I'm learning spark in Scala. I have a JSON file as follows:
[
{
"name": "ali",
"age": "13",
"phone": "09123455737",
"sex": "m"
},{
"name": "amir",
"age": "24",
"phone": "09123475737",
"sex": "m"
}
]
and there is just this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonFile = sqlContext.read.json("path-to-json-file")
I just receive corrupted_row : String nothing else
but when put every person(or objects) in single row, code works fine
How can I read from multiple lines for a JSON sqlContext in spark?
You will have to read it into an RDD yourself and then convert it to a Dataset:
spark.read.json(sparkContext.wholeTextFiles(...).values)
This problem is getting caused because you have multiline json row. Although by default spark.read.json expect a row to be in a single line but this is configurable:
You can set option spark.read.json("path-to-json-file").option("multiLine", true)

Converting json file to a single line

I have a huge .json file like the one below. I want to convert that JSON to a dataframe on Spark.
{
"movie": {
"id": 1,
"name": "test"
}
}
When I execute the following code, I get a _corrupt_record error:
val df = sqlContext.read.json("example.json")
df.first()
Lately I learned that Spark only supports one-line JSON files, like:
{ "movie": { "id": 1, "name": "test test" } }
How can I convert a JSON text from multiple lines to a single line.