I gave a text file with complex type column. Could you please tell about automatically inferring schema with array, map and structure type in Spark.
Source:
name,work_place,gender_age,skills_score,depart_title,work_contractor
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
code example:
val employeeComplexDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/employee_complex/employee.txt")
parsed schema (fact):
root
|-- name: string (nullable = true)
|-- work_place: string (nullable = true)
|-- gender_age: string (nullable = true)
|-- skills_score: string (nullable = true)
|-- depart_title: string (nullable = true)
|-- work_contractor: string (nullable = true)
Expected schema is schema with ArrayType, ...
Related
I am trying to read a .csv with multiline records to a Spark data frame. My .csv looks like below. The first line is a header.
Software,Version,Date,Update Date,Extended Support,Reference,Notes
Windows,Windows XP,12/28/2022,12/28/2023,12/28/2024,https://www.software.com/,"Some notes"
VxWorks,VxWorks ,,,12/28/2024,https://www.software.com/,"Some Notes
with multiple lines"
I am using the below code to read the above file.
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("multiLine", true)
.option("escape","\"")
.csv(s"${file_path}")
This one is considering all row values as columns. Not sure where it's going wrong.
scala> df.printSchema()
root
|-- Software: string (nullable = true)
|-- Version: string (nullable = true)
|-- Date: string (nullable = true)
|-- Update Date: string (nullable = true)
|-- Extended Support: string (nullable = true)
|-- Reference: string (nullable = true)
Windows: string (nullable = true)
|-- Windows XP: string (nullable = true)
|-- 12/28/2022: string (nullable = true)
|-- 12/28/2023: string (nullable = true)
|-- 12/28/2024: string (nullable = true)
|-- https://www.software.com/: string (nullable = true)
|-- "Some notes
VxWorks: string (nullable = true)
|-- VxWorks: string (nullable = true)
|-- <blank>: string (nullable = true)
|-- <blank>: string (nullable = true)
|-- 12/28/2024: string (nullable = true)
|-- https://www.software.com/: string (nullable = true)
I have flatten the nested JSON file now I am facing an ambiguity issue to get the actual column name using PySpark.
Dataframe with the following schema:
Before flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo: struct (nullable = true)
| |-- a: float (nullable = true)
| |-- b: float (nullable = true)
| |-- c: integer (nullable = true)
After Flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo_a: float (nullable = true)
|-- foo_b: float (nullable = true)
|-- foo_c: integer (nullable = true)
Is it possible to get only the actual name of the column in Data Frame as shown below:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- a: float (nullable = true)
|-- b: float (nullable = true)
|-- c: integer (nullable = true)
Yes, just do following instead of flattening:
select("*", "foo.*").drop("foo")
or
select("x", "y", "foo.*")
The foo.* syntax pulls all fields from the struct and put them into the "top-level"
I have a json structure which contains some top-level metadata and a payload which is equivalent to what pandas would define as a split orientation json payload. This was done to reduce duplication as we ingest a lot of these files.
Usually in pandas I would load the json file and pass the required parts (index, column names and data) to a DataFrame constructor, which would give me a flat table which is easy to work with and can then be exported to influxdb or sql.
obj = load_json('file.json')
df = pd.DataFrame(index=obj['payload']['Time'], columns=obj['Names']['Time'], data=obj['payload']['Data'])
df['Machine_ID'] = obj['Machine_ID']
df['TimeSend'] = obj['TimeSend']
df['Version'] = obj['Version']
It seems that this schema is not easy to flatten with spark, as the data isn't record based and so the column names and data aren't associated. Is there anyway I can process this to flat schema with spark or should I add an extra pandas processing step in my pipeline.
root
|-- Machine_ID: string (nullable = true)
|-- TimeSend: string (nullable = true)
|-- Version: long (nullable = true)
|-- Payload: struct (nullable = true)
| |-- Data: array (nullable = true)
| | |-- element: array (containsNull = true)
| | | |-- element: double (containsNull = true)
| |-- Names: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- Time: array (nullable = true)
| | |-- element: string (containsNull = true)
Edit: I found a way that works, however I'm curious whether the ordering can be relied on, because I split the dataframe before adding id's.
Probably its better to zip Time and Data to enable them both to be exploded, and work from there.
# Make flattened dataframe
df_ = df.select(col('Payload.Time').alias('Time'), col('Payload.Names').alias('Names'), col('Payload.Data').alias('Data'), col('Machine_ID'), col('TimeSend'), col('Version'))
# Make exploded `Data` table
columns = df_.rdd.flatMap(lambda x: x.Names).collect()
df_a = df_.select(explode(col('Data')))
df_a = df_a.select([df_a.col[x] for x in range(len(columns))])
df_a = rdd_data.toDF(columns)
df_a = df_a.withColumn("id", monotonically_increasing_id())
# Make exploded `Metadata` table
df_b = df_.select(explode(col('Time')), col('Machine_ID'), col('TimeSend'), col('Version'))
df_b = df_b.withColumn("id", monotonically_increasing_id())
# Join tables
df_c = df_a.join(df_b, "id")
# Schema is now flattened and joined
df_c.printSchema()
root
|-- id: long (nullable = false)
|-- Machine_ID: string (nullable = true)
|-- TimeSend: string (nullable = true)
|-- Version: long (nullable = true)
|-- Index: string (nullable = true) <- From Payload.Time
|-- TagA: double (nullable = true) <- From Payload.Names & Data
|-- TagB: double (nullable = true) <- || -
|-- TagC: double (nullable = true) <- || -
|-- TagD: double (nullable = true) <- || -
I have a pyspark dataframe with an input schema like
|-- runName: string (nullable = true)
|-- action_name: string (nullable = true)
|-- model_payload: string (nullable = true)
|-- model_type: string (nullable = true)
|-- did_pass: string (nullable = true)
|-- ymd: string (nullable = false)
Inside model_payload is a list containing a json and I want to pull out the data from here and create a separate dataframe for it. However at the moment model_payload is a string.
root
|-- dataset_A: string (nullable = true)
|-- dataset_B: string (nullable = true)
|-- ks_statistic: double (nullable = true)
|-- pvalue: double (nullable = true)
|-- rejected_hypothesis: boolean (nullable = true)
|-- target_ks_statistic: double (nullable = true)
|-- target_pvalue: double (nullable = true)
|-- action: string (nullable = true)
Where the json in model payload looks like
d = {
"dataset_A": str,
"dataset_B": str,
"ks_statistic": str,
"pvalue": str,
"rejected_hypothesis": bool,
"target_ks_statistic": str,
"target_pvalue": str,
}
The only solution I've found so far is to transform this to a pandas dataframe and use json.loads(). However this is very slow and not suitable for large datasets
according to your payload, you have to create the struct in pyspark and use it to parse your data.
from pyspark.sql import functions as F, types as T
schm = T.StructType(
[
T.StructField("dataset_A", T.StringType()),
T.StructField("dataset_B", T.StringType()),
T.StructField("ks_statistic", T.StringType()),
T.StructField("pvalue", T.StringType()),
T.StructField("rejected_hypothesis", T.BooleanType()),
T.StructField("target_ks_statistic", T.StringType()),
T.StructField("target_pvalue", T.StringType()),
]
)
df.withColumn("model_payload", F.from_json("model_payload", schm)).select(
"model_payload.*"
)
I am trying to create a schema of a nested JSON file so that it can become a dataframe.
However, I am not sure if there is way to create a schema without defining all the fields in the JSON file if I only need the 'id' and 'text' from it - a subset.
I am currently doing it using scala in spark shell. As you can see from the file, I downloaded it as part-00000 from HDFS.
.
From the manuals on JSON:
Apply the schema using the .schema method. This read returns only
the columns specified in the schema.
So you are good to go with what you imply.
E.g.
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
.add("op_ts", StringType, true)
val df = spark.read.schema(schema)
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
returns:
root
|-- op_ts: string (nullable = true)
+--------------------------+
|op_ts |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+
for this schema:
root
|-- after: struct (nullable = true)
| |-- CODE: string (nullable = true)
| |-- CREATED: string (nullable = true)
| |-- ID: long (nullable = true)
| |-- STATUS: string (nullable = true)
| |-- UPDATE_TIME: string (nullable = true)
|-- before: string (nullable = true)
|-- current_ts: string (nullable = true)
|-- op_ts: string (nullable = true)
|-- op_type: string (nullable = true)
|-- pos: string (nullable = true)
|-- primary_keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- table: string (nullable = true)
|-- tokens: struct (nullable = true)
| |-- csn: string (nullable = true)
| |-- txid: string (nullable = true)
gotten from same file using:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
This latter is just for proof.