In GCP Dataproc (with pySpark), I am doing a task i.e. to read JSON file as per custom schema and load it in a Dataframe.
I do have following sample testing JSON:
{"Transactions": [{"schema": "a",
"id": "1",
"app": "testing",
"description": "JSON schema for testing purpose"}]}
I have created following schema:
custom_schema = StructType([
StructField("Transactions",
StructType([
StructField("schema", StringType()),
StructField("id", StringType()),
StructField("app", StringType()),
StructField("description", StringType())
])
)])
Reading JSON as:
df_2 = spark.read.json(json_path, schema = custom_schema)
Getting following results,
Now, I need to check data in Dataframe, When I try to do df_2.show(), it is taking too much time and show as kernel Busy for hours.
I need help, that what I am missing here in code and how can I see the data in dataframe (Tabular format).
I think the problem is with your custom schema definition and the JSON file. The following code and JSON file worked for me:
Code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import ArrayType
spark = SparkSession \
.builder \
.appName("JSON test") \
.getOrCreate()
custom_schema = StructType([
StructField("schema", StringType(), False),
StructField("id", StringType(), True),
StructField("app", StringType(), True),
StructField("description", StringType(), True)])
df = spark.read.format("json") \
.schema(custom_schema) \
.load("gs://my-bucket/transactions.json")
df.show()
JSON file
The contents of gs://my-bucket/transactions.json is:
{"schema": "a", "id": "1", "app": "foo", "description": "test"}
{"schema": "b", "id": "2", "app": "bar", "description": "test2"}
Output
+------+---+---+-----------+
|schema| id|app|description|
+------+---+---+-----------+
| a| 1|foo| test|
| b| 2|bar| test2|
+------+---+---+-----------+
Related
I am trying to read a json message from kafka topic using spark streaming using a custom schema, I can see data is coming when I am using Cast value as string only. But when I am using a schema it is not working.
Data is like this :
|{"items": [{"SKU": "22673", "title": "FRENCH GARDEN SIGN BLUE METAL", "unit_price": 1.25, "quantity": 6}, {"SKU": "20972", "title": "PINK CREAM FELT CRAFT TRINKET BOX ", "unit_price": 1.25, "quantity": 2}, {"SKU": "84596F", "title": "SMALL MARSHMALLOWS PINK BOWL", "unit_price": 0.42, "quantity": 1}, {"SKU": "21181", "title": "PLEASE ONE PERSON METAL SIGN", "unit_price": 2.1, "quantity": 12}], "type": "ORDER", "country": "United Kingdom", "invoice_no": 154132552854862, "timestamp": "2023-01-20 07:34:22"}
|
I have used schema as :
schema = StructType([
StructField("items", StructType([
StructField("SKU", IntegerType(), True),
StructField("title", StringType(), True),
StructField("unit_price", FloatType(), True),
StructField("quantity", IntegerType(), True)
]), True)
StructField("type", StringType(), True),
StructField("country", StringType(), True),
StructField("invoice_no", StringType(), True),
StructField("timestamp", TimestampType(), True)
])
I am using the function :
kafkaDF = lines.selectExpr('CAST(value AS STRING)').select(from_json('value',schema).alias("value")).select("value.items.SKU","value.items.title","value.items.unit_price","value.items.quantity","value.type","value.country","value.invoice_no","value.timestamp")
Still the output are coming as null.
It's null because the schema is incorrect.
Your items need to be an ArrayType, containing your defined StructType
That being said, you cannot select value.items.X since there isn't a single element there. You'd need to EXPLODE(value.items) first.
Currently, I’m working with the following architecture.
I do have a DocumentDB database that has data exported to S3 using DMS (CDC task), once this data is landed on S3 I need to load it into Databricks.
I’m already able to read the CSV content (which has a lot of JSONS), but I don't how to parse/insert it into a Databricks table.
Following my JSON payload which is exported to S3.
{
"_id": {
"$oid": "12332334"
},
"processed": false,
"col1": "0000000122",
"startedAt": {
"$date": 1635667097000
},
"endedAt": {
"$date": 1635667710000
},
"col2": "JFHFGALJF-DADAD",
"col3": 2.75,
"created_at": {
"$date": 1635726018693
},
"updated_at": {
"$date": 1635726018693
},
"__v": 0
}
To extract the data into Daframe I'm using the following spark command:
df = spark.read \
.option("header", "true") \
.option("delimiter", "|") \
.option("inferSchema", "false" ) \
.option("lineterminator", "\n" ) \
.option("encoding", "ISO-8859-1") \
.option("ESCAPE quote", '"') \
.option("escape", "\"") \
.csv("dbfs:/mnt/s3-data-2/folder_name/LOAD00000001.csv")
Thank you Alex Ott as per your Suggestion and as per this document. you can use from_json in your file to read JSON to CSV
In order to read a JSON string from a CSV file, first, we need to read a CSV file into Spark Dataframe using spark.read.csv("path") and then parse the JSON string column and convert it to columns using from_json() function. This function takes the first argument as a JSON column name and the second argument as JSON schema.
What is the best way to create a dataframe for a json file using a separate json schema file in pyspark?.
Sample json file
{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":1}
{"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":264}
{"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":69}
{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":24}
Code to read this file
df_json = spark.read.format("json")\
.option("mode", "FAILFAST")\
.option("inferschema", "true")\
.load("C:\\pyspark\\data\\2010-summary.json")
If I don't want to use the "inferschema" option and want to use a json schema file instead, may I know how to do that?
json schema file
{"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {"ORIGIN_COUNTRY_NAME": {"type": "string"},
"DEST_COUNTRY_NAME": {"type": "string"},
"count": {"type": "integer"}
},
"required": ["ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME","count"]
}
option1:
I assumed your columns are all nullable,
from spark.sql.types import StructType, StructField, StringType, IntegerType
yourSchema = StructType([ StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
StructField("DEST_COUNTRY_NAME", StringType(), True),
StructField("count", IntegerType(), True),])
option2:
simple read your file like so..
df_json = spark.read.json("C:\\pyspark\\data\\2010-summary.json")
df_jsonSchema = df_json.schema
print(type(df_jsonSchema))
[each for each in zipsDFSchema]
from the results, you can then build your schema just like in option1.
I am getting started with apache spark.
I have a requirement to convert a json log to a flattened metrics, can be considered as a simple csv as well.
For eg.
"orderId":1,
"orderData": {
"customerId": 123,
"orders": [
{
"itemCount": 2,
"items": [
{
"quantity": 1,
"price": 315
},
{
"quantity": 2,
"price": 300
},
]
}
]
}
This can be considered as a single json log, I want to convert this into,
orderId,customerId,totalValue,units
1 , 123 , 915 , 3
I was going through sparkSQL documentation and can use it to get hold of individual values like "select orderId,orderData.customerId from Order" but I am not sure how to get the summation of all the prices and units.
What should be the best practice to get this done using apache spark?
Try:
>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))
For the people who are looking for a java solution of the above, please follow:
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> orders = sqlContext.read().json("order.json");
Dataset<Row> newOrders = orders.select(
col("orderId"),
col("orderData.customerId"),
explode(col("orderData.orders")).alias("order"))
.withColumn("item",explode(col("order.items")))
.groupBy(col("orderId"),col("customerId"))
.agg(sum(col("item.quantity")),sum(col("item.price")));
newOrders.show();
I would like to create a JSON file for a Python script to parse.
My data is currently in a text file in the format of:
url1,string1
url2,string2
url3,string3
url4,string4
I would like to manually create a JSON file that I could input against a Python script to scrape for a string.
Thank you, I used your example to build something like it and it worked!
{"url": "url1", "string": "string1"} {"url": "url2", "string": "string2"} {"url": "url3", "string": "string3"}
Thanks
Something like the following should work
import csv
import json
csv_file = open('file.csv', 'r')
json_file = open('file.json', 'w')
field_names = ("url", "string")
reader = csv.DictReader(csv_file, field_names)
for row in reader:
json.dump(row, json_file)
json_file.write('\n')
I may misunderstand your question, if it's converting this CSV into a JSON manually, it would be :
[
[
"url1",
"string1"
],
[
"url2",
"string2"
],
[
"url3",
"string3"
],
[
"url4",
"string4"
]
]
If you prefer you can use CSV to JSON converter online