I need to write a data into json file like in the below format using pyspark.
{
"list-item": [
{"author":"author1","title":"title1","pages":1,"email":"author1#gmail.com"},
{"author":"author2","title":"title2","pages":2,"email":"author2#gmail.com"},
{"author":"author3","title":"title3","pages":3,"email":"author3#gmail.com"},
{"author":"author4","title":"title4","pages":4,"email":"author4#gmail.com"},
],
"version": 1
}
I have written the below pyspark code but it write "" and adding "" at the beginning and end of each item. How to remove the backslash and double quote
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col,to_json,struct,collect_list,lit
from datetime import datetime
from time import time
if __name__ == '__main__':
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])
data = [
["author1", "title1", 1, "author1#gmail.com"],
["author2", "title2", 2, "author2#gmail.com"],
["author3", "title3", 3, "author3#gmail.com"],
["author4", "title4", 4, "author4#gmail.com"]
]
df = spark.createDataFrame(data, schema)
df=df.select(to_json(struct("author", "title", "pages", "email")).alias("json-data")).agg(collect_list("json-data").alias("list-item"))
df=df.withColumn("version",lit("1.0").cast(IntegerType()))
df.printSchema()
df.show(2, False)
curDT = datetime.now()
targetPath = curDT.strftime("%m-%d-%Y-%H-%M-%S")
df.write.format("json").mode("overwrite").option("escape", "").save(targetPath)
my code writes the json with backslash and double quote enclosed each item like below.how to remove those.Please help
{"list-item":["{\"author\":\"author1\",\"title\":\"title1\",\"pages\":1,\"email\":\"author1#gmail.com\"}","{\"author\":\"author2\",\"title\":\"title2\",\"pages\":2,\"email\":\"author2#gmail.com\"}","{\"author\":\"author3\",\"title\":\"title3\",\"pages\\":3,\"email\":\"author3#gmail.com\\"}","{\"author\":\"author4\",\"title\":\"title4\",\"pages\":4,\"email\":\"author4#gmail.com\"}"],"version":1}
The reason is that the type of the elements of the list-item array is string, and the \ is there to point this fact out.
To avoid that you can try:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])
data = [
["author1", "title1", 1, "author1#gmail.com"],
["author2", "title2", 2, "author2#gmail.com"],
["author3", "title3", 3, "author3#gmail.com"],
["author4", "title4", 4, "author4#gmail.com"]
]
df = spark.createDataFrame(data, schema)
df=df.groupby().agg(f.collect_list(f.struct(f.col('author'), f.col('title'), f.col('pages'), f.col('email'))).alias("list-item"))
df=df.withColumn("version",f.lit("1.0").cast(IntegerType()))
df.printSchema()
df.show(2, False)
df.write.format("json").mode("overwrite").option("escape", "").save('./TestJson')
and the output json file is gonna look like:
{"list-item":[{"author":"author1","title":"title1","pages":1,"email":"author1#gmail.com"},{"author":"author2","title":"title2","pages":2,"email":"author2#gmail.com"},{"author":"author3","title":"title3","pages":3,"email":"author3#gmail.com"},{"author":"author4","title":"title4","pages":4,"email":"author4#gmail.com"}],"version":1}
Related
my_data=[
{'stationCode': 'NB001',
'summaries': [{'period': {'year': 2017}, 'rainfall': 449},
{'period': {'year': 2018}, 'rainfall': 352.4},
{'period': {'year': 2019}, 'rainfall': 253.2},
{'period': {'year': 2020}, 'rainfall': 283},
{'period': {'year': 2021}, 'rainfall': 104.2}]},
{'stationCode': 'NA003',
'summaries': [{'period': {'year': 2019}, 'rainfall': 58.2},
{'period': {'year': 2020}, 'rainfall': 628.2},
{'period': {'year': 2021}, 'rainfall': 120}]}]
In Pandas I can:
import pandas as pd
from pandas import json_normalize
pd.concat([json_normalize(entry, 'summaries', 'stationCode')
for entry in my_data])
That will give me the following table:
rainfall period.year stationCode
0 449.0 2017 NB001
1 352.4 2018 NB001
2 253.2 2019 NB001
3 283.0 2020 NB001
4 104.2 2021 NB001
0 58.2 2019 NA003
1 628.2 2020 NA003
2 120.0 2021 NA003
Can this be achieved in one line of code in pyspark?
I have tried the code below and it gives me the same result. However, it is too long, is there a way to shorten it?;
df=sc.parallelize(my_data)
df1=spark.read.json(df)
df1.select("stationCode","summaries.period.year","summaries.rainfall").display()
df1 = df1.withColumn("year_rainfall", F.arrays_zip("year", "rainfall"))
.withColumn("year_rainfall", F.explode("year_rainfall"))
.select("stationCode",
F.col("year_rainfall.rainfall").alias("Rainfall"),
F.col("year_rainfall.year").alias("Year"))
df1.display(20, False)
Introducing myself to pyspark and so some explanation or good information sources will highly be appreciated
What you have looks fine to me and is readable. However you can also zip and explode directly:
out = (df1.select("stationCode",
F.explode(F.arrays_zip(*["summaries.period.year","summaries.rainfall"])))
.select("stationCode",F.col("col")['0'].alias("year"),F.col("col")['1'].alias("rainfall")))
out.show()
+-----------+----+--------+
|stationCode|year|rainfall|
+-----------+----+--------+
| NB001|2017| 449.0|
| NB001|2018| 352.4|
| NB001|2019| 253.2|
| NB001|2020| 283.0|
| NB001|2021| 104.2|
| NA003|2019| 58.2|
| NA003|2020| 628.2|
| NA003|2021| 120.0|
+-----------+----+--------+
Consider a sample json file with the following data.
{
"Name": "TestName",
"Date": "2021-04-09",
"Readings": [
{
"Id": 1,
"Reading": 5.678,
"datetime": "2021-04-09 00:00:00"
},
{
"Id": 2,
"Reading": 3.692,
"datetime": "2020-04-09 00:00:00"
}
]
}
Define a schema that we can enforce to read our data.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
data_schema = StructType(fields=[
StructField('Name', StringType(), False),
StructField('Date', StringType(), True),
StructField(
'Readings', ArrayType(
StructType([
StructField('Id', IntegerType(), False),
StructField('Reading', DoubleType(), True),
StructField('datetime', StringType(), True)
])
)
)
])
Now we can use our schema to read the JSON files in our directory
data_df = spark.read.json('/mnt/data/' + '*.json', schema=data_schema)
We want the data that’s nested in "Readings" so we can use explode to get these sub-columns.
from pyspark.sql.functions import explode
data_df = data_df.select(
"Name",
explode("Readings").alias("ReadingsExplode")
).select("Name", "ReadingsExplode.*")
data_df.show()
This should provide the required output with flatten dataframe.
I have many "wide" csv files (100+ columns) in a directory. I think I have read somewhere that by applying a schema I can already preselect the columns which should be read. Unfortunately my code only returns "NULL"´s.
Does somebody know if my assumption with the "schema" is wrong? The path in the read-statment in the code below is ok.
Here is the code
from pyspark.sql import functions as F
from pyspark.sql import types as T
DCU_schema = T.StructType([
T.StructField("consistId", T.StringType(), True),
T.StructField("subsystemId", T.StringType(), True),
T.StructField("E13", T.BooleanType(), True),
T.StructField("E40", T.BooleanType(), True),
T.StructField("Strom_links", T.DoubleType(), True),
T.StructField("Strom_rechts", T.DoubleType(), True),
T.StructField("Spannung_links", T.DoubleType(), True),
T.StructField("Spannung_rechts", T.DoubleType(), True),
T.StructField("Position_links", T.IntegerType(), True),
T.StructField("Position_rechts", T.IntegerType(), True),
T.StructField("canTimeStamp", T.LongType(), True),
T.StructField("latitude", T.DoubleType(), True),
T.StructField("longitude", T.DoubleType(), True),
T.StructField("fileName", T.StringType(), True)
])
first_kb_df = (spark.read.csv(path=path, schema=DCU_schema, inferSchema=False, header=True, sep=";")
.orderBy("canTimeStamp"))
display(first_kb_df)
Attached is also a screenshot of the result.
Thanks in advance for your help and best regards
Alex
Screenshot of Returned Data
Screenshot of Input Data
From the Microsoft Documentation
import org.apache.spark.sql.types._
val schema = new StructType()
.add("_c0",IntegerType,true)
.add("carat",DoubleType,true)
.add("cut",StringType,true)
.add("color",StringType,true)
.add("clarity",StringType,true)
.add("depth",DoubleType,true)
.add("table",DoubleType,true)
.add("price",IntegerType,true)
.add("x",DoubleType,true)
.add("y",DoubleType,true)
.add("z",DoubleType,true)
val diamonds_with_schema = spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
I'm trying to read csv file with Pyspark. Csv-File has some meta-information and data columns, which have different column numbers and structures.
Excel has no Problem to read this file.
I would like to define a custom Schema in spark to read this file.
Here is an Example:
HEADER_TAG\tHEADER_VALUE
FORMAT\t2.00
NUMBER_PASSES\t0001
"Time"\t"Name"\t"Country"\t"City"\t"Street"\t"Phone1"\t"Phone2"
0.49tName1\tUSA\tNewYork\t5th Avenue\t123456\t+001236273
0.5tName2\tUSA\tWashington\t524 Street\t222222\t+0012222
0.62tName3\tGermany\tBerlin\tLinden Strasse\t3434343\t+491343434
NUM_DATA_ROWS\t3
NUM_DATA_COLUMNS\t7
START_TIME_FORMAT\tMM/dd/yyyy HH:mm:ss
START_TIME\t06/04/2019 13:04:23
END_HEADER
Without pre-defined Schema spark read only 2 columns:
df_static = spark.read.options(header='false', inferschema='true', multiLine=True, delimiter = "\t",mode="PERMISSIVE",).csv("/FileStore/111.txt")
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
Define a custom schema:
from pyspark.sql.types import *
# define it as per your data types
user_schema = StructType([
... StructField("time", TimestampType(), True),
... StructField("name", StringType(), True),
... StructField("Country", StringType(), True),
... StructField("City", StringType(), True),
... StructField("Phone1", StringType(), True),
... StructField("Phone2", StringType(), True),])
Refer: https://spark.apache.org/docs/2.1.2/api/python/_modules/pyspark/sql/types.html
df_static = spark.read.schema(user_schema).options(header='false', multiLine=True, delimiter = "\t", mode="PERMISSIVE").csv("/FileStore/111.txt")
I have multiple schemas like below
user_schema1 = StructType([
... StructField("time", TimestampType(), True),
... StructField("name", StringType(), True),
... StructField("Country", StringType(), True),...
... ])
user_schema2 = StructType([...
... StructField("Phone1", StringType(), True),
... StructField("Phone2", StringType(), True),])
df_static = spark.read.schema(user_schema(send schema name dynamic)).options(header='false', multiLine=True, delimiter = "\t", mode="PERMISSIVE").csv("/FileStore/111.txt")`enter code here`
Kindly provide me the solution
JSON byte data streaming from Kafka-console-producer
PySpark - has a parser json data to dataframe.
I have tried to parse this json by using given schema.But it gives me an error about "AssertionError: keyType should be DataType"
What do I need to do to parse json with custom schema?
schema = StructType()\
.add("contact_id", LongType())\
.add("first_name", StringType())\
.add("last_name", StringType())\
.add("contact_number", MapType(StringType,
StructType()
.add("home", LongType())
.add("contry_code", StringType())))
Expecting this format JSON Data:
{"contact_id":"23","first_name":"John","last_name":"Doe","contact_number":{"home":4564564567,"country_code":"+1"}}
I have found the solution. This should be the correct schema definition.
schema = StructType([
StructField('contactId', LongType(), True),
StructField('firstName', StringType(), True),
StructField('lastName', StringType(), True),
StructField("contactNumber", ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("number", LongType(), True),
StructField("countryCode", StringType(), True)
])
), True)
])
So i was trying to load the csv file inferring custom schema but everytime i end up with the following errors:
Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]
This is how my program looks like and my csv file entries ,
age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
My Code :
$spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val bankSchema = StructType(Array(
StructField("age", IntegerType, true),
StructField("job", StringType, true),
StructField("marital", StringType, true),
StructField("education", StringType, true),
StructField("default", StringType, true),
StructField("balance", IntegerType, true),
StructField("housing", StringType, true),
StructField("loan", StringType, true),
StructField("contact", StringType, true),
StructField("day", IntegerType, true),
StructField("month", StringType, true),
StructField("duration", IntegerType, true),
StructField("campaign", IntegerType, true),
StructField("pdays", IntegerType, true),
StructField("previous", IntegerType, true),
StructField("poutcome", StringType, true),
StructField("y", StringType, true)))
val df = sqlContext.
read.
schema(bankSchema).
option("header", "true").
option("delimiter", ";").
load("/user/amit.kudnaver_gmail/hadoop/project_bank/Project_Bank.csv").toDF()
df.registerTempTable("people")
df.printSchema()
val distinctage = sqlContext.sql("select distinct age from people")
Any suggestion as why am not able to work with the csv file here after pushing the correct schema. Thanks in advance for your advise.
Thanks
Amit K
Here the problem is Data Frame expects Parquet file while processing it. In order to handle data in CSV. Here what you can do.
First of all, remove the header row from the data.
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
Next we write following code to read the data.
Create case class
case class BankSchema(age: Int, job: String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:Int, month:String, duration:Int, campaign:Int, pdays:Int, previous:Int, poutcome:String, y:String)
Read data from HDFS and parse it
val bankData = sc.textFile("/user/myuser/Project_Bank.csv").map(_.split(";")).map(p => BankSchema(p(0).toInt, p(1), p(2),p(3),p(4), p(5).toInt, p(6), p(7), p(8), p(9).toInt, p(10), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toInt, p(15), p(16))).toDF()
And then register table and execute queries.
bankData.registerTempTable("bankData")
val distinctage = sqlContext.sql("select distinct age from bankData")
Here is what the output would look like
+---+
|age|
+---+
| 33|
| 44|
| 58|
+---+
Here the expected file format is csv but as per error its looking for parquet file format.
This can be overcome by explicitly mentioning the file format as below (which was missing in the problem shared) because if we don't specify the file format then it by default expects Parquet format.
As per Java code version (sample example):
Dataset<Row> resultData = session.read().format("csv")
.option("sep", ",")
.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(definedSchema)
.load(inputPath);
Here, schema can be defined either by using a java class (ie. POJO class) or by using StructType as already mentioned.
And inputPath is the path of input csv file.