Actual column name after flattening Nested JSON using PySpark

Actual column name after flattening Nested JSON using PySpark - json

I have flatten the nested JSON file now I am facing an ambiguity issue to get the actual column name using PySpark.
Dataframe with the following schema:
Before flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo: struct (nullable = true)
| |-- a: float (nullable = true)
| |-- b: float (nullable = true)
| |-- c: integer (nullable = true)
After Flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo_a: float (nullable = true)
|-- foo_b: float (nullable = true)
|-- foo_c: integer (nullable = true)
Is it possible to get only the actual name of the column in Data Frame as shown below:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- a: float (nullable = true)
|-- b: float (nullable = true)
|-- c: integer (nullable = true)

Yes, just do following instead of flattening:
select("*", "foo.*").drop("foo")
or
select("x", "y", "foo.*")
The foo.* syntax pulls all fields from the struct and put them into the "top-level"

Related

infer schema with complex type

I gave a text file with complex type column. Could you please tell about automatically inferring schema with array, map and structure type in Spark.
Source:
name,work_place,gender_age,skills_score,depart_title,work_contractor
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
code example:
val employeeComplexDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/employee_complex/employee.txt")
parsed schema (fact):
root
|-- name: string (nullable = true)
|-- work_place: string (nullable = true)
|-- gender_age: string (nullable = true)
|-- skills_score: string (nullable = true)
|-- depart_title: string (nullable = true)
|-- work_contractor: string (nullable = true)
Expected schema is schema with ArrayType, ...

Spark - Not able to read a .csv file with Multiple lines record to a dataframe

I am trying to read a .csv with multiline records to a Spark data frame. My .csv looks like below. The first line is a header.
Software,Version,Date,Update Date,Extended Support,Reference,Notes
Windows,Windows XP,12/28/2022,12/28/2023,12/28/2024,https://www.software.com/,"Some notes"
VxWorks,VxWorks ,,,12/28/2024,https://www.software.com/,"Some Notes
with multiple lines"
I am using the below code to read the above file.
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("multiLine", true)
.option("escape","\"")
.csv(s"${file_path}")
This one is considering all row values as columns. Not sure where it's going wrong.
scala> df.printSchema()
root
|-- Software: string (nullable = true)
|-- Version: string (nullable = true)
|-- Date: string (nullable = true)
|-- Update Date: string (nullable = true)
|-- Extended Support: string (nullable = true)
|-- Reference: string (nullable = true)
Windows: string (nullable = true)
|-- Windows XP: string (nullable = true)
|-- 12/28/2022: string (nullable = true)
|-- 12/28/2023: string (nullable = true)
|-- 12/28/2024: string (nullable = true)
|-- https://www.software.com/: string (nullable = true)
|-- "Some notes
VxWorks: string (nullable = true)
|-- VxWorks: string (nullable = true)
|-- <blank>: string (nullable = true)
|-- <blank>: string (nullable = true)
|-- 12/28/2024: string (nullable = true)
|-- https://www.software.com/: string (nullable = true)

PySpark Error while trying to analyze twitter dataset

I'm trying to analyze massive Twitter dumps which my Professor collected over the period of 2 years into multiple parquet files. I have run all my code with one file and all the codes run smoothly but I'm consistently facing issues when I'm trying to read and analyze all the parquet files. This is one example of the error I'm facing. I'm trying to create a temp table view so that I can run SQL queries to query data.
Please find my Spark Configuration:
[('spark.app.name', 'Colab'),
('spark.driver.memory', '50g'),
('spark.app.id', 'local-1636990273526'),
('spark.memory.offHeap.size', '20g'),
('spark.app.startTime', '1636990272425'),
('spark.driver.port', '40705'),
('spark.executor.id', 'driver'),
('spark.driver.host', '3c3ebb00872b'),
('spark.sql.warehouse.dir', 'file:/content/spark-warehouse'),
('spark.memory.offHeap.enabled', 'true'),
('spark.rdd.compress', 'True'),
('spark.serializer.objectStreamReset', '100'),
('spark.master', 'local[*]'),
('spark.submit.pyFiles', ''),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.executor.memory', '50g')]
The spark dataframe has 393 million rows and 383 columns.
Whenever I try to run the following code:
df.createOrReplaceTempView("twitlogs")
tbl = spark.sql("""
SELECT * FROM twitlogs LIMIT 20
""")
tbl.show()
I get the following error:
Py4JJavaError: An error occurred while calling o114.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 34 in stage 7.0 failed 1 times, most recent failure: Lost task 34.0 in stage 7.0 (TID 1219) (3c3ebb00872b executor driver): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///content/drive/MyDrive/ProjAbe/tweets-2020-08-25T09___30___02Z.parq. Column: [geo], Expected: double, Found: BINARY
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:570)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:268)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1077)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:172)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:154)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:184)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
... 16 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///content/drive/MyDrive/ProjAbe/tweets-2020-08-25T09___30___02Z.parq. Column: [geo], Expected: double, Found: BINARY
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedSchemaColumnConvertError(QueryExecutionErrors.scala:570)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:268)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1077)
at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:172)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:154)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:184)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
... 16 more
I have run the same code with just 1 parquet file which has about 183000 rows and it runs smoothly.
I also get the same error when I run this code trying to find missing values:
def get_dtype(df,colname):
return [dtype for name, dtype in df.dtypes if name == colname][0]
df_mis = df.select([count(when(col(c).isNull() if get_dtype(df, c) not in ['double', 'float'] else isnan(col(c)), c)).alias(c) for c in df.columns])
df_mis.show()
I think its a memory issue but not sure how to deal with it. I have tried to spend a whole day doing extensive research about the whole issue and have come up with nothing. Please help.
Update 11/16/21: After being asked to include the schema, I could only include a part of it as there's a word limit in stackoverflow. Please note that the dataset has 383 columns.
root
|-- created_at: string (nullable = true)
|-- id: long (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- display_text_range: string (nullable = true)
|-- source: string (nullable = true)
|-- truncated: boolean (nullable = true)
|-- in_reply_to_status_id: double (nullable = true)
|-- in_reply_to_status_id_str: string (nullable = true)
|-- in_reply_to_user_id: double (nullable = true)
|-- in_reply_to_user_id_str: string (nullable = true)
|-- in_reply_to_screen_name: string (nullable = true)
|-- geo: double (nullable = true)
|-- coordinates: double (nullable = true)
|-- place: double (nullable = true)
|-- contributors: string (nullable = true)
|-- is_quote_status: boolean (nullable = true)
|-- quote_count: long (nullable = true)
|-- reply_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- favorite_count: long (nullable = true)
|-- favorited: boolean (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- filter_level: string (nullable = true)
|-- lang: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- user_id: long (nullable = true)
|-- user_id_str: string (nullable = true)
|-- user_name: string (nullable = true)
|-- user_screen_name: string (nullable = true)
|-- user_location: string (nullable = true)
|-- user_url: string (nullable = true)
|-- user_description: string (nullable = true)
|-- user_translator_type: string (nullable = true)
|-- user_protected: boolean (nullable = true)
|-- user_verified: boolean (nullable = true)
|-- user_followers_count: long (nullable = true)
|-- user_friends_count: long (nullable = true)
|-- user_listed_count: long (nullable = true)
|-- user_favourites_count: long (nullable = true)
|-- user_statuses_count: long (nullable = true)
|-- user_created_at: string (nullable = true)
|-- user_utc_offset: string (nullable = true)
|-- user_time_zone: string (nullable = true)
|-- user_geo_enabled: boolean (nullable = true)
|-- user_lang: string (nullable = true)
|-- user_contributors_enabled: boolean (nullable = true)
|-- user_is_translator: boolean (nullable = true)
|-- user_profile_background_color: string (nullable = true)
|-- user_profile_background_image_url: string (nullable = true)
|-- user_profile_background_image_url_https: string (nullable = true)
|-- user_profile_background_tile: boolean (nullable = true)
|-- user_profile_link_color: string (nullable = true)
|-- user_profile_sidebar_border_color: string (nullable = true)
|-- user_profile_sidebar_fill_color: string (nullable = true)
|-- user_profile_text_color: string (nullable = true)
|-- user_profile_use_background_image: boolean (nullable = true)
|-- user_profile_image_url: string (nullable = true)
|-- user_profile_image_url_https: string (nullable = true)
|-- user_profile_banner_url: string (nullable = true)
|-- user_default_profile: boolean (nullable = true)
|-- user_default_profile_image: boolean (nullable = true)
|-- user_following: string (nullable = true)
|-- user_follow_request_sent: string (nullable = true)
|-- user_notifications: string (nullable = true)
|-- entities_hashtags: string (nullable = true)
|-- entities_urls: string (nullable = true)
|-- entities_user_mentions: string (nullable = true)
|-- entities_symbols: string (nullable = true)
|-- retweeted_status_created_at: string (nullable = true)
|-- retweeted_status_id: double (nullable = true)
|-- retweeted_status_id_str: string (nullable = true)
|-- retweeted_status_text: string (nullable = true)
|-- retweeted_status_display_text_range: string (nullable = true)
|-- retweeted_status_source: string (nullable = true)
|-- retweeted_status_truncated: boolean (nullable = true)
|-- retweeted_status_in_reply_to_status_id: double (nullable = true)
|-- retweeted_status_in_reply_to_status_id_str: string (nullable = true)
|-- retweeted_status_in_reply_to_user_id: double (nullable = true)
|-- retweeted_status_in_reply_to_user_id_str: string (nullable = true)
|-- retweeted_status_in_reply_to_screen_name: string (nullable = true)
|-- retweeted_status_user_id: double (nullable = true)
|-- retweeted_status_user_id_str: string (nullable = true)
|-- retweeted_status_user_name: string (nullable = true)
|-- retweeted_status_user_screen_name: string (nullable = true)
|-- retweeted_status_user_location: string (nullable = true)
|-- retweeted_status_user_url: string (nullable = true)
|-- retweeted_status_user_description: string (nullable = true)
|-- retweeted_status_user_translator_type: string (nullable = true)
|-- retweeted_status_user_protected: boolean (nullable = true)
|-- retweeted_status_user_verified: boolean (nullable = true)
|-- retweeted_status_user_followers_count: double (nullable = true)
|-- retweeted_status_user_friends_count: double (nullable = true)
|-- retweeted_status_user_listed_count: double (nullable = true)
|-- retweeted_status_user_favourites_count: double (nullable = true)
|-- retweeted_status_user_statuses_count: double (nullable = true)
|-- retweeted_status_user_created_at: string (nullable = true)
|-- retweeted_status_user_utc_offset: double (nullable = true)
|-- retweeted_status_user_time_zone: double (nullable = true)
|-- retweeted_status_user_geo_enabled: boolean (nullable = true)
|-- retweeted_status_user_lang: double (nullable = true)
|-- retweeted_status_user_contributors_enabled: boolean (nullable = true)
|-- retweeted_status_user_is_translator: boolean (nullable = true)
|-- retweeted_status_user_profile_background_color: string (nullable = true)
|-- retweeted_status_user_profile_background_image_url: string (nullable = true)
|-- retweeted_status_user_profile_background_image_url_https: string (nullable = true)
|-- retweeted_status_user_profile_background_tile: boolean (nullable = true)
|-- retweeted_status_user_profile_link_color: string (nullable = true)
|-- retweeted_status_user_profile_sidebar_border_color: string (nullable = true)
|-- retweeted_status_user_profile_sidebar_fill_color: string (nullable = true)
|-- retweeted_status_user_profile_text_color: string (nullable = true)
|-- retweeted_status_user_profile_use_background_image: boolean (nullable = true)
|-- retweeted_status_user_profile_image_url: string (nullable = true)
|-- retweeted_status_user_profile_image_url_https: string (nullable = true)
|-- retweeted_status_user_profile_banner_url: string (nullable = true)
|-- retweeted_status_user_default_profile: boolean (nullable = true)
|-- retweeted_status_user_default_profile_image: boolean (nullable = true)
|-- retweeted_status_user_following: double (nullable = true)
|-- retweeted_status_user_follow_request_sent: double (nullable = true)
|-- retweeted_status_user_notifications: double (nullable = true)
|-- retweeted_status_geo: double (nullable = true)
|-- retweeted_status_coordinates: double (nullable = true)
|-- retweeted_status_place: double (nullable = true)
|-- retweeted_status_contributors: double (nullable = true)
|-- retweeted_status_is_quote_status: boolean (nullable = true)
|-- retweeted_status_extended_tweet_full_text: string (nullable = true)
Update 11/18/2021: I have made further discoveries about my error. It seems like some of the parquet files have schema for the column "geo" as string instead of double. After trying mergeSchema, it gives similar error. "Failed to merge fields 'geo' and 'geo'. Failed to merge incompatible data types double and string".

Converting List with string to json pyspark

I have a pyspark dataframe with an input schema like
|-- runName: string (nullable = true)
|-- action_name: string (nullable = true)
|-- model_payload: string (nullable = true)
|-- model_type: string (nullable = true)
|-- did_pass: string (nullable = true)
|-- ymd: string (nullable = false)
Inside model_payload is a list containing a json and I want to pull out the data from here and create a separate dataframe for it. However at the moment model_payload is a string.
root
|-- dataset_A: string (nullable = true)
|-- dataset_B: string (nullable = true)
|-- ks_statistic: double (nullable = true)
|-- pvalue: double (nullable = true)
|-- rejected_hypothesis: boolean (nullable = true)
|-- target_ks_statistic: double (nullable = true)
|-- target_pvalue: double (nullable = true)
|-- action: string (nullable = true)
Where the json in model payload looks like
d = {
"dataset_A": str,
"dataset_B": str,
"ks_statistic": str,
"pvalue": str,
"rejected_hypothesis": bool,
"target_ks_statistic": str,
"target_pvalue": str,
}
The only solution I've found so far is to transform this to a pandas dataframe and use json.loads(). However this is very slow and not suitable for large datasets

according to your payload, you have to create the struct in pyspark and use it to parse your data.
from pyspark.sql import functions as F, types as T
schm = T.StructType(
[
T.StructField("dataset_A", T.StringType()),
T.StructField("dataset_B", T.StringType()),
T.StructField("ks_statistic", T.StringType()),
T.StructField("pvalue", T.StringType()),
T.StructField("rejected_hypothesis", T.BooleanType()),
T.StructField("target_ks_statistic", T.StringType()),
T.StructField("target_pvalue", T.StringType()),
]
)
df.withColumn("model_payload", F.from_json("model_payload", schm)).select(
"model_payload.*"
)

How to create a schema from JSON file using Spark Scala for subset of fields?

I am trying to create a schema of a nested JSON file so that it can become a dataframe.
However, I am not sure if there is way to create a schema without defining all the fields in the JSON file if I only need the 'id' and 'text' from it - a subset.
I am currently doing it using scala in spark shell. As you can see from the file, I downloaded it as part-00000 from HDFS.
.

From the manuals on JSON:
Apply the schema using the .schema method. This read returns only
the columns specified in the schema.
So you are good to go with what you imply.
E.g.
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
.add("op_ts", StringType, true)
val df = spark.read.schema(schema)
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
returns:
root
|-- op_ts: string (nullable = true)
+--------------------------+
|op_ts |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+
for this schema:
root
|-- after: struct (nullable = true)
| |-- CODE: string (nullable = true)
| |-- CREATED: string (nullable = true)
| |-- ID: long (nullable = true)
| |-- STATUS: string (nullable = true)
| |-- UPDATE_TIME: string (nullable = true)
|-- before: string (nullable = true)
|-- current_ts: string (nullable = true)
|-- op_ts: string (nullable = true)
|-- op_type: string (nullable = true)
|-- pos: string (nullable = true)
|-- primary_keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- table: string (nullable = true)
|-- tokens: struct (nullable = true)
| |-- csn: string (nullable = true)
| |-- txid: string (nullable = true)
gotten from same file using:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
This latter is just for proof.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Actual column name after flattening Nested JSON using PySpark - json

Yes, just do following instead of flattening: select("", "foo.").drop("foo") or select("x", "y", "foo.") The foo. syntax pulls all fields from the struct and put them into the "top-level"

Related

infer schema with complex type

Spark - Not able to read a .csv file with Multiple lines record to a dataframe

PySpark Error while trying to analyze twitter dataset

Converting List with string to json pyspark

How to create a schema from JSON file using Spark Scala for subset of fields?

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Actual column name after flattening Nested JSON using PySpark - json

Yes, just do following instead of flattening: select("*", "foo.*").drop("foo") or select("x", "y", "foo.*") The foo.* syntax pulls all fields from the struct and put them into the "top-level"

Related

infer schema with complex type

Spark - Not able to read a .csv file with Multiple lines record to a dataframe

PySpark Error while trying to analyze twitter dataset

Converting List with string to json pyspark

How to create a schema from JSON file using Spark Scala for subset of fields?

Categories

Resources

Yes, just do following instead of flattening: select("", "foo.").drop("foo") or select("x", "y", "foo.") The foo. syntax pulls all fields from the struct and put them into the "top-level"