pyspark - read csv with custom row delimiter - csv

how can I read a csv file with custom row delimiter (\x03) using pyspark?
I tried the following code but it did not work.
df = spark.read.option("lineSep","\x03").csv(path)
display(df)

Works just fine with both OSS Spark (3.2.0) and DBR 9.1 ML:
>>> df = spark.read.option("lineSep","\x03")\
.option("header", "true").csv("/path_to_file.csv")
>>> df.show()
+----+----+
|val1|val2|
+----+----+
| 1| 2|
| 3| 4|
+----+----+
Look for problems inside file, or something like this.

Related

Wrong encoding when reading csv file with pyspark

For my course in university, I run pyspark-notebook docker image
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook
And then run next python code
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark
listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED')
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()
The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running
listings_df.groupBy("room_type").count().show()
which gives next output
+---------------+-----+
| room_type|count|
+---------------+-----+
| 169| 1|
| 4.88612| 1|
| 4.90075| 1|
| Shared room| 44|
| 35| 1|
| 187| 1|
| null| 16|
| 70| 1|
| 27| 1|
| 75| 1|
| Hotel room| 109|
| 198| 1|
| 60| 1|
| 280| 1|
|Entire home/apt|12818|
| 220| 1|
| 190| 1|
| 156| 1|
| 450| 1|
| 4.88865| 1|
+---------------+-----+
only showing top 20 rows
while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].
Spark info which might be useful:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.2
Master
local[*]
AppName
pyspark-shell
And encoding of the file
!file listings.csv
listings.csv: UTF-8 Unicode text
listings.csv is an Airbnb statistics csv file downloaded from here
All run & drive code I've also uploaded to Colab
There are two things that I've found:
Some lines have quotes to escape (escape='"')
Also #JosefZ has mentioned about unwanted line breaks (multiLine=True)
That's how you must read it:
input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')
output_df = input_df.groupBy("room_type").count()
output_df.show()
+---------------+-----+
| room_type|count|
+---------------+-----+
| Shared room| 44|
| Hotel room| 110|
|Entire home/apt|12829|
| Private room| 3495|
+---------------+-----+
I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.
As shown below:
listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

spark.createDataFrame() returns a 'NoneType' object

I've parsed a nested json file using the following code:
df = df.select(df.data.attributes.signatures_by_constituency.ons_code.alias("wc_code"), \
df.data.attributes.signatures_by_constituency.signature_count.alias("sign_count")) \
This results in the following dataframe with values sitting in arrays.
+--------------------+--------------------+
| wc_code| sign_count|
+--------------------+--------------------+
|[E14000530, E1400...|[28, 6, 17, 15, 5...|
+--------------------+--------------------+
The next step is to parse the arrays into columns for which I'm using the explode() as follows:
df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
.show()
which results in the following error:
+---------+----------+
| wc_code|sign_count|
+---------+----------+
|E14000530| 28|
|E14000531| 6|
|E14000532| 17|
|E14000533| 15|
|E14000534| 54|
|E14000535| 12|
|E14000536| 34|
|E14000537| 10|
|E14000538| 32|
|E14000539| 29|
|E14000540| 3|
|E14000541| 10|
|E14000542| 8|
|E14000543| 13|
|E14000544| 15|
|E14000545| 19|
|E14000546| 8|
|E14000547| 28|
|E14000548| 7|
|E14000549| 13|
+---------+----------+
only showing top 20 rows
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-16ce85050366> in <module>
2
3 df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
----> 4 .select("wc_count.wc_code", "wc_count.sign_count")\
5 .show()
6 )
/usr/local/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
305 Py4JJavaError: ...
306 """
--> 307 return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
308
309 #since(1.3)
/usr/local/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema)
749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
TypeError: 'NoneType' object is not iterable
Note that the table is displayed.
I don't get the error is I simply use df1 = ... but no matter what I do with the variable afterwards errors out in 'NoneType' object has no attribute 'something' .
If I don't assign the above to a variable and don't use df1 = spark.createDataFrame(), I don't get this error so I'm guessing something breaks when the variable gets created.
In case this is important, the df.printSchema() produces the following:
root
|-- wc_code: array (nullable = true)
| |-- element: string (containsNull = true)
|-- sign_count: array (nullable = true)
| |-- element: long (containsNull = true)
Can you try below query-
val df = spark.sql("select array('E14000530', 'E1400') as wc_code, array(28, 6, 17, 15) as sign_count")
df.selectExpr("inline_outer(arrays_zip(wc_code, sign_count)) as (wc_code, sign_count)").show()
/**
* +---------+----------+
* | wc_code|sign_count|
* +---------+----------+
* |E14000530| 28|
* | E1400| 6|
* | null| 17|
* | null| 15|
* +---------+----------+
*/
I think I've fixed the issue.
Running the code like this seems to have fixed it:
df1 = df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
For some reason calling .show() at the end of it was messing with the newly created dataframe. Would be curious to know if there's a valid reason or it's a bug

CSV Columns removed From file while loading Dataframe

While loading csv via databricks, below the 2nd row 4th column is not loaded.
The csv's no of columns varies per row.
In test_01.csv,
a,b,c
s,d,a,d
f,s
Loaded above csv file via databricks as below
>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
Tried loading with textfile
rdd = sc.textFile ("sample_files/test_01.csv")
rdd.collect()
[u'a,b,c', u's,d,a,d', u'f,s']
But not conversion of above rdd to dataframe causes error
Was able to solve by specifying the schema as below.
df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")
df2.show()
+---+---+----+----+----+
| e1| e2| e3| e4| e5|
+---+---+----+----+----+
| a| b| c|null|null|
| s| d| a| d|null|
| f| s|null|null|null|
+---+---+----+----+----+
Tried with inferschema. still not working
df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
But is there any other way without using schema as the no of column varies?
Ensure you have fixed headers ie rows can have a data missing but column names should be fixed.
If you don't specify column names, you can still create the schema while reading the csv:
val schema = new StructType()
.add(StructField("keyname", StringType, true))

Better way to handle REST calls in pyspark with json schema [duplicate]

I have an existing Spark dataframe that has columns as such:
--------------------
pid | response
--------------------
12 | {"status":"200"}
response is a string column.
Is there a way to cast it into JSON and extract specific fields? Can lateral view be used as it is in Hive? I looked up some examples on line that used explode and later view but it doesn't seem to work with Spark 2.1.1
From pyspark.sql.functions , you can use any of from_json,get_json_object,json_tuple to extract fields from json string as below,
>>from pyspark.sql.functions import json_tuple,from_json,get_json_object
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> l = [(12, '{"status":"200"}'),(13,'{"status":"200","somecol":"300"}')]
>>> df = spark.createDataFrame(l,['pid','response'])
>>> df.show()
+---+--------------------+
|pid| response|
+---+--------------------+
| 12| {"status":"200"}|
| 13|{"status":"200",...|
+---+--------------------+
>>> df.printSchema()
root
|-- pid: long (nullable = true)
|-- response: string (nullable = true)
Using json_tuple :
>>> df.select('pid',json_tuple(df.response,'status','somecol')).show()
+---+---+----+
|pid| c0| c1|
+---+---+----+
| 12|200|null|
| 13|200| 300|
+---+---+----+
Using from_json:
>>> schema = StructType([StructField("status", StringType()),StructField("somecol", StringType())])
>>> df.select('pid',from_json(df.response, schema).alias("json")).show()
+---+----------+
|pid| json|
+---+----------+
| 12|[200,null]|
| 13| [200,300]|
+---+----------+
Using get_json_object:
>>> df.select('pid',get_json_object(df.response,'$.status').alias('status'),get_json_object(df.response,'$.somecol').alias('somecol')).show()
+---+------+-------+
|pid|status|somecol|
+---+------+-------+
| 12| 200| null|
| 13| 200| 300|
+---+------+-------+

How to use a JSON mapping file to generate a new DataFrame in Spark using Scala

I have two DataFrames, DF1 and DF2, and a JSON file which I need to use as a mapping file to create another dataframe (DF3).
DF1:
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| 100| John| Mumbai|
| 101| Alex| Delhi|
| 104| Divas|Kolkata|
| 108| Jerry|Chennai|
+-------+-------+-------+
DF2:
+-------+-----------+-------+
|column4| column5|column6|
+-------+-----------+-------+
| S1| New| xxx|
| S2| Old| yyy|
| S5|replacement| zzz|
| S10| New| ppp|
+-------+-----------+-------+
Apart from this one mapping file I am having in JSON format which will be use to generate DF3.
Below is the JSON mapping file:
{"targetColumn":"newColumn1","sourceField1":"column2","sourceField2":"column4"}
{"targetColumn":"newColumn2","sourceField1":"column7","sourceField2":"column5"}
{"targetColumn":"newColumn3","sourceField1":"column8","sourceField2":"column6"}
So from this JSON file I need to create DF3 with a column available in the targetColumn section of the mapping and it will check the source column if it is present in DF1 then it map to sourceField1 from DF1 otherwise sourceField2 from DF2.
Below is the expected output.
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Divas|replacement| zzz|
| Jerry| New| ppp|
+----------+-----------+----------+
Any help here will be appropriated.
Parse the JSON and create the below List of custom objects
case class SrcTgtMapping(targetColumn:String,sourceField1:String,sourceField2:String)
val srcTgtMappingList=List(SrcTgtMapping("newColumn1","column2","column4"),SrcTgtMapping("newColumn2","column7","column5"),SrcTgtMapping("newColumn3","column8","column6"))
Add dummy index column to both the dataframes and join both the dataframes based on index column
import org.apache.spark.sql.functions._
val df1WithIndex=df1.withColumn("index",monotonicallyIncreasingId)
val df2WithIndex=df2.withColumn("index",monotonicallyIncreasingId)
val joinedDf=df1WithIndex.join(df2WithIndex,df1WithIndex.col("index")===df2WithIndex.col("index"))
Create the query and execute it.
val df1Columns=df1WithIndex.columns.toList
val df2Columns=df2WithIndex.columns.toList
val query=srcTgtMappingList.map(stm=>if(df1Columns.contains(stm.sourceField1)) joinedDf.col(stm.sourceField1).alias(stm.targetColumn) else joinedDf.col(stm.sourceField2).alias(stm.targetColumn))
val output=joinedDf.select(query:_*)
output.show
Sample Output:
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Jerry| New| ppp|
| Divas|replacement| zzz|
+----------+-----------+----------+
Hope this approach will help you