Convert CSV with dynamic columns to parquet - csv

I have csv files for a table that have dynamic columns with uncertain order:
csv file 1:
name, id, age, job
Amy, 001, 30, SDE
csv file 2:
id, job, name
002, PM, Brandon
I converted the csv files to parquet files in pyspark,
spark.read.csv(input_path, header = True).write.parquet(output_path)
and when I read the parquet using sparksql, the data has been shifted.
name, id, age, job
Amy, 001, 30, SDE
002, PM, Brandon
What I want is:
name, id, age, job
Amy, 001, 30, SDE
Brandon, 002, null, PM
I know parquet is a columnar format. When it comes to reordering, it should be able to write to parquet by column names, so the data won't get shifted. Or, the problem could be the read.csv because its formats depend on ordering, so it won't work in dynamic order.
Is there any config I can add to the code to make it work? or any other ways?

You have to use mergeSchema=true option as below.
scala> spark.read.option("mergeSchema", "true").parquet("/user/hive/warehouse/test.db/csv_test/")
scala> res16.printSchema
root
|-- id: string (nullable = true)
|-- job: string (nullable = true)
|-- name: string (nullable = true)
|-- age: string (nullable = true)
scala> res16.show
+---+---+-------+----+
| id|job| name| age|
+---+---+-------+----+
|001|SDE| Amy| 30|
|002| PM|Brandon|null|
+---+---+-------+----+
Please, beware schema merging is expensive operation.
https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#schema-merging

Related

Expand JSON from pySpark DataFrame into name / value pairs

I have a pySpark dataframe looking like this:
|id|json |
+--+--------------------------------------+
|1 |{"attr1": "value1"} |
|2 |{"attr2": "value2", "attr3": "value3"}|
root
|-- id: string (nullable = true)
|-- json: string (nullable = true)
How do I convert it into a new dataframe which will look like this:
|id|attr |value |
+--+-----+------+
|1 |attr1|value1|
|2 |attr2|value2|
|2 |attr3|value3|
(tried to google for the solution with no success, apologies if it's a duplicate)
Thanks!
Please check schema, looks to me like a map type. if column json is of maptype, use map_entries to extract elements and explode.
df=spark.createDataFrame(Data, schema)
new.withColumn('attri', explode(map_entries('json'))).select('id','attri.*').show()

pyspark explode json array of dictionary items with key/values pairs into columns

I have a spark dataframe that looks like this:
I want to flatten the columns.
Result should look like this:
Data:
{
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27"
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records"
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704"
}
]
}
}
You should first rename columns in header then explode properties.property array then pivot and group columns.
Here is an example that produces your wanted result:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("Test").getOrCreate()
data = {
"header": {
"message-id": "ID:EL2-202103221753-77777777-88888-9999999999-1:2:1:1:1",
"reply-to": "queue://CaseProcess.v2",
"timestamp": "2021-03-22T20:07:27",
},
"properties": {
"property": [
{
"name": "ELIS_EXCEPTION_MSG",
"value": "The AWS Access Key Id you provided does not exist in our records",
},
{
"name": "ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"value": "1616458043704",
},
]
},
}
sc = spark.sparkContext
df = spark.read.json(sc.parallelize([json.dumps(data)]))
df = df.select(
F.col("header.message-id").alias("message-id"),
F.col("header.reply-to").alias("reply-to"),
F.col("header.timestamp").alias("timestamp"),
F.col("properties"),
)
df = df.withColumn("propertyexploded", F.explode("properties.property"))
df = df.withColumn("propertyname", F.col("propertyexploded")["name"])
df = df.withColumn("propertyvalue", F.col("propertyexploded")["value"])
df = (
df.groupBy("message-id", "reply-to", "timestamp")
.pivot("propertyname")
.agg(F.first("propertyvalue"))
)
df.printSchema()
df.show()
Result:
root
|-- message-id: string (nullable = true)
|-- reply-to: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ELIS_EXCEPTION_MSG: string (nullable = true)
|-- ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS: string (nullable = true)
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
| message-id| reply-to| timestamp| ELIS_EXCEPTION_MSG|ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
|ID:EL2-2021032217...|queue://CaseProce...|2021-03-22T20:07:27|The AWS Access Ke...| 1616458043704|
+--------------------+--------------------+-------------------+--------------------+----------------------------------+
Thanks Vlad. I tried your option and it was successful.
In your solution, the properties.property['name'] is dynamic and I liked that.
The only drawback is that the # of rows get multiplied by the # of properties in which when I had 3 rows, it created 36 rows in the flattened df. Of course you pivot back to 3 rows.
The problem i faced is that all the properties get repeated x*12 times. it would have been fine if the properties were small. But I have 2 columns in the stack which can be up to 2K-20K and I have some queues which go over 100K rows. Repeating that over and over again seems a bit of an overkill for my process.
However, I found another solution, with a drawback where the property name can be hard-coded but the repeat of rows is eliminated.
Here is what I ended up using:
XXX = df.rdd.flatMap(lambda x: [( x[1]["destination"].replace("queue://", ""),
x[1]["message-id"].replace("ID:", ""),
x[1]["delivery-mode"],
x[1]["expiration"],
x[1]["priority"],
x[1]["redelivered"],
x[1]["timestamp"],
y[0]["value"],
y[1]["value"],
y[2]["value"],
y[3]["value"],
y[4]["value"],
y[5]["value"],
y[6]["value"],
y[7]["value"],
y[8]["value"],
y[9]["value"],
y[10]["value"],
y[11]["value"],
x[0],
x[3],
x[4],
x[5]
) for y in x[2]
])\
.toDF(["queue",
"message_id",
"delivery_mode",
"expiration",
"priority",
"redelivered",
"timestamp",
"ELIS_EXCEPTION_MSG",
"ELIS_MESSAGE_ORIG_TIMESTAMP_MILLIS",
"ELIS_MESSAGE_RETRY_COUNT",
"ELIS_MESSAGE_ORIG_TIMESTAMP",
"ELIS_MDC_TRACER_ID",
"tracestate",
"ELIS_ROOT_CAUSE_EXCEPTION_MSG",
"traceparent",
"ELIS_MESSAGE_TYPE",
"ELIS_EXCEPTION_CLASS",
"newrelic",
"ELIS_EXCEPTION_TRACE",
"body",
"partition_0",
"partition_1",
"partition_2"
])
print(f"... type(XXX): {type(XXX)} | df.count(): {df.count()} | XXX.count(): {XXX.count()}")
output: ... type(XXX): <class 'pyspark.rdd.PipelinedRDD'> | df.count(): 3 | XXX.count(): 3
My column structure is from activeMQ API extracts which means that the column structure is consistent and its OK for my use case to hard-code the column names in the flattened_df

Extract and explode embedded json fields in apache spark

I'm completely new to spark, but don't mind if the answer is in python or Scala. I can't show the actual data for privacy reasons, but basically I am reading json files with a structure like this:
{
"EnqueuedTimeUtc": 'some date time',
"Properties": {},
"SystemProperties": {
"connectionDeviceId": "an id",
"some other fields that we don't care about": "data"
},
"Body": {
"device_id": "an id",
"tabs": [
{
"selected": false,
"title": "some title",
"url": "https:...."
},
{"same again, for multiple tabs"}
]
}
}
Most of the data is of no interest. What I want is a Dataframe consisting of the time, device_id, and url. There can be multiple urls for the same device and time, so I'm looking to explode these into one row per url.
| timestamp | device_id | url |
My immediate problem is that when I read this, although it can work out the structure of SystemProperties, Body is just a string, probably because of variation. Perhaps I need to specify the schema, would that help?
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionAuthMethod: string (nullable = true)
| |-- connectionDeviceGenerationId: string (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
| |-- contentEncoding: string (nullable = true)
| |-- contentType: string (nullable = true)
| |-- enqueuedTime: string (nullable = true)
Any idea of an efficient (there are lots and lots of these records) way to extract urls and associate with the time and device_id? Thanks in advance.
Here's an example for extraction. Basically you can use from_json to convert the Body to something that is more structured, and use explode(transform()) to get the URLs and expand to different rows.
# Sample dataframe
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|Body |EnqueuedTimeUtc|SystemProperties|
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|{"device_id":"an id","tabs":[{"selected":false,"title":"some title","url":"https:1"},{"selected":false,"title":"some title","url":"https:2"}]}|some date time |[an id] |
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
df.printSchema()
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
# Extract desired properties
df2 = df.selectExpr(
"EnqueuedTimeUtc as timestamp",
"from_json(Body, 'device_id string, tabs array<map<string,string>>') as Body"
).selectExpr(
"timestamp",
"Body.device_id",
"explode(transform(Body.tabs, x -> x.url)) as url"
)
df2.show()
+--------------+---------+-------+
| timestamp|device_id| url|
+--------------+---------+-------+
|some date time| an id|https:1|
|some date time| an id|https:2|
+--------------+---------+-------+

AWS Glue write_dynamic_frame_from_options encounters schema exception

I'm new to Pyspark and AWS Glue and I'm having an issue when I try to write out a file with Glue.
When I try to write some output into s3 using Glue's write_dynamic_frame_from_options it's getting an exception and saying
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 199.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 199.0 (TID 7991, 10.135.30.121, executor 9): java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
Header length: 7, schema size: 6
CSV file: s3://************************************cache.csv
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:180)
at org.apache.spark.sql.execution.datasources.csv.CSVDataSource$$anonfun$checkHeaderColumnNames$1.apply(CSVDataSource.scala:176)
at scala.Option.foreach(Option.scala:257)
at .....
It seems like its saying that my dataframe's schema has 6 fields, but the csv has 7. I don't understand which csv it's talking about, because I am actually trying to create a new csv from the dataframe...
Any insight to this specific issue or to how the write_dynamic_frame_from_options method works in general would be very helpful!
Here is the source code for the function in my job that is causing this issue.
def update_geocache(glueContext, originalDf, newDf):
logger.info("Got the two df's to union")
logger.info("Schema of the original df")
originalDf.printSchema()
logger.info("Schema of the new df")
newDf.printSchema()
# add the two Dataframes together
unioned_df = originalDf.unionByName(newDf).distinct()
logger.info("Schema of the union")
unioned_df.printSchema()
##root
#|-- location_key: string (nullable = true)
#|-- addr1: string (nullable = true)
#|-- addr2: string (nullable = true)
#|-- zip: string (nullable = true)
#|-- lat: string (nullable = true)
#|-- lon: string (nullable = true)
# Create just 1 partition, because there is so little data
unioned_df = unioned_df.repartition(1)
logger.info("Unioned the geocache and the new addresses")
# Convert back to dynamic frame
dynamic_frame = DynamicFrame.fromDF(
unioned_df, glueContext, "dynamic_frame")
logger.info("Converted the unioned tables to a Dynamic Frame")
# Write data back to S3
# THIS IS THE LINE THAT THROWS THE EXCEPTION
glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={
"path": "s3://" + S3_BUCKET + "/" + TEMP_FILE_LOCATION
},
format="csv"
)

Nested dynamic schema not working while parsing JSON using pyspark

I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark.
My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON.
JSON schema sample
Note - This is not the exact schema. Its just to give the idea of nested nature of the schema
{
"tweet": {
"text": "RT #author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
PySpark Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
Current output - (with real JSON data)
All (keys:values) under nested retweet_status JSON are squashed into 1 single list. eg [text, created_at, entities]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT #Hoosier602: #...|Mon Feb 11 19:04:...|[#CLeroyjnr #Gabr...|
|RT #EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT #JacobAWohl: #...|Mon Feb 11 19:04:...|[#realDonaldTrump...|
Expected output
I want independent columns for each key. Also, note that you already have a parent level key by the same name text. How will you deal with such instances?
Ideally, I would want columns like "text", "entities", "retweet_status_text", "retweet_status_entities", etc
Your schema is not mapped properly ... please see these posts if you want to manually construct schema (which is recommended if the data doesn't change):
PySpark: How to Update Nested Columns?
https://docs.databricks.com/_static/notebooks/complex-nested-structured.html
Also, if your JSON is multi-line (like your example) then you can ...
read json via multi-line option to get Spark to infer schema
then save nested schema
then read data back in with the correct schema mapping to avoid triggering a Spark job
! cat nested.json
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
getSchema = spark.read.option("multiline", "true").json("nested.json")
extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))
loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")
loadJson.printSchema()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array |dict |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1] |1 |string1|
|[2, 4, 6]|[, value2] |2 |string2|
|[3, 6, 9]|[extra_value3, value3]|3 |string3|
+---------+----------------------+---+-------+
Once you have the data loaded with the correct mapping then you can start to transform into a normalized schema via the "dot" notation for nested columns and "explode" to flatten arrays, etc.
loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()
+------+------------+
| key| extra_key|
+------+------------+
|value1| null|
|value2| null|
|value3|extra_value3|
+------+------------+