Count occurance of an element in PySpark DataFrame - csv

I have a csv file marks.csv. I have read it using pyspark and created a data frame df.
It looks like this (the csv file):
sub1,sub2,sub3
a,a,b
b,b,a
c,a,b
How can I get the count of ‘a’ in each column in the data frame df?
Thanks.

As we can leverage SQL's features in Spark, we can simply do as below:
df.selectExpr("sum(if( sub1 = 'a' , 1, 0 )) as count1","sum(if( sub2 = 'a' , 1, 0 )) as count2","sum(if( sub3 = 'a' , 1, 0 )) as count3").show()
It should give output as below:
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 1| 2| 1|
+------+------+------+
To know more about spark SQL please visit this.
:EDIT:
If you want to do it for all columns then you can try something like below:
from pyspark.sql.types import Row
final_out = spark.createDataFrame([Row()]) # create an empty dataframe
#Just loop through all columns
for col_name in event_df.columns:
final_out = final_out.crossJoin(event_df.selectExpr("sum(if( "+col_name+" = 'a' , 1, 0 )) as "+ col_name))
final_out.show()
It should give you output like below:
+----+----+----+
|sub1|sub2|sub3|
+----+----+----+
| 1| 2| 1|
+----+----+----+

You can use CASE when statement to get count of "a" in each column
import pyspark.sql.functions as F
df2 = df.select(
F.sum(when(df("sub1")=="a",1).otherwise(0)).alias("sub1_cnt"),
F.sum(when(df("sub2") == "a",1).otherwise(0)).alias("sub2_cnt"),
F.sum(when(df("sub3") == "a",1).otherwise(0)).alias("sub3_cnt"))
df2.show()

Related

Writing out spark dataframe as nested JSON doc

I have a spark dataframe as:
A B val_of_B val1 val2 val3 val4
"c1" "MCC" "cd1" 1 2 1.1 1.05
"c1" "MCC" "cd2" 2 3 1.1 1.05
"c1" "MCC" "cd3" 3 4 1.1 1.05
val1 and val2 are obtained with group by of A, B and val_of_B where as val3, val4 is A level information only (for example, distinct of A, val3 is only "c1",1.1)
I would like to write this out as nested JSON, which should look like:
For each A, JSON format should look like
{"val3": 1.1, "val4": 1.05, "MCC":[["cd1",1,2], ["cd2",2,3], ["cd3",3,4]]}
Is it possible to accomplish this with existing tools under spark api? If not, can you provide guidelines?
You should groupBy on column A and aggregate necessary columns using first and collect_list and array inbuilt functions
import org.apache.spark.sql.functions._
def zipping = udf((arr1: Seq[String], arr2: Seq[Seq[String]])=> arr1.indices.map(index => Array(arr1(index))++arr2(index)))
val jsonDF = df.groupBy("A")
.agg(first(col("val3")).as("val3"), first(col("val4")).as("val4"), first(col("B")).as("B"), collect_list("val_of_B").as("val_of_B"), collect_list(array("val1", "val2")).as("list"))
.select(col("val3"), col("val4"), col("B"), zipping(col("val_of_B"), col("list")).as("list"))
.toJSON
which should give you
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","B":"MCC","list":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+-----------------------------------------------------------------------------------------------+
Next is to exchange the list name to value of B using a udf function as
def exchangeName = udf((json: String)=> {
val splitted = json.split(",")
val name = splitted(2).split(":")(1).trim
val value = splitted(3).split(":")(1).trim
splitted(0).trim+","+splitted(1).trim+","+name+":"+value+","+(4 until splitted.size).map(splitted(_)).mkString(",")
})
jsonDF.select(exchangeName(col("value")).as("json"))
.show(false)
which should give you your desired output
+------------------------------------------------------------------------------------+
|json |
+------------------------------------------------------------------------------------+
|{"val3":"1.1","val4":"1.05","MCC":[["cd1","1","2"],["cd2","2","3"],["cd3","3","4"]]}|
+------------------------------------------------------------------------------------+

Remove special characters from csv data using Spark

I want to remove the specific(e.g. #,&) special characters from csv data using PySpark. I have gone through optimuspyspark(https://github.com/ironmussa/Optimus). However it's removing all the special characters. I want to remove specific special characters from the CSV data using Spark. Is there any inbuilt functions or custom functions or third party librabies to achieve this functionality. Thanks in advance.
Few Links I tried :
https://community.hortonworks.com/questions/49802/escaping-double-quotes-in-spark-dataframe.html
Hope this is what you are looking for:
assume you have a simple csv file (2 lines) that looks like this:
#A 234, 'B' 225, 'C' !556
#D 235, 'E' 2256, 'F'! 557
read csv into dataframe:
df=spark.read.csv('test1.csv',mode="DROPMALFORMED",\
inferSchema=True,\
header = False)
df.show()
+------+---------+---------+
| _c0| _c1| _c2|
+------+---------+---------+
|#A 234| 'B' 225| 'C' !556|
|#D 235| 'E' 2256| 'F'! 557|
+------+---------+---------+
use pyspark functions to remove specific unwanted characters
from pyspark.sql.functions import *
newDf = df.withColumn('_c0', regexp_replace('_c0', '#', ''))\
.withColumn('_c1', regexp_replace('_c1', "'", ''))\
.withColumn('_c2', regexp_replace('_c2', '!', ''))
newDf.show()
+-----+-------+--------+
| _c0| _c1| _c2|
+-----+-------+--------+
|A 234| B 225| 'C' 556|
|D 235| E 2256| 'F' 557|
+-----+-------+--------+
if you want to remove a specific character from ALL columns try this:
starting with the same simplified textfile/dataFrame as above:
+------+---------+---------+
| _c0| _c1| _c2|
+------+---------+---------+
|#A 234| 'B' 225| 'C' !556|
|#D 235| 'E' 2256| 'F'! 557|
+------+---------+---------+
function to remove a character from a column in a dataframe:
def cleanColumn(tmpdf,colName,findChar,replaceChar):
tmpdf = tmpdf.withColumn(colName, regexp_replace(colName, findChar, replaceChar))
return tmpdf
remove the " ' " character from ALL columns in the df (replace with nothing i.e. "")
allColNames = df.schema.names
charToRemove= "'"
replaceWith =""
for colName in allColNames:
df=cleanColumn(df,colName,charToRemove,replaceWith)
The resultant output is:
df.show()
+------+-------+-------+
| _c0| _c1| _c2|
+------+-------+-------+
|#A 234| B 225| C !556|
|#D 235| E 2256| F! 557|
+------+-------+-------+
With Optimus you can:
df.cols.replace("*",["a","b","c"]," ").table()
Use the start to select all the columns.
Pass an array with the elements you want to match
The char that will replace the match. Whitespace in this case.

Returning a new Dataframe (by transforming an existing one) using a function - spark/scala

I am new to Spark. I am trying to read a JSONArray into a Dataframe and perform some transformations on it. I am trying to cleanse my data by removing some html tags and some newline characters. for example:
Initial dataframe read from JSON:
+-----+---+-----+-------------------------------+
|index| X|label| date |
+-----+---+-----+-------------------------------+
| 1| 1| A|<div>&quot2017-01-01&quot</div>|
| 2| 3| B|<div>2017-01-02</div> |
| 3| 5| A|<div>2017-01-03</div> |
| 4| 7| B|<div>2017-01-04</div> |
+-----+---+-----+-------------------------------+
Should be transformed to :
+-----+---+-----+------------+
|index| X|label| date |
+-----+---+-----+------------+
| 1| 1| A|'2017-01-01'|
| 2| 3| B|2017-01-02 |
| 3| 5| A|2017-01-03 |
| 4| 7| B|2017-01-04 |
+-----+---+-----+------------+
I know that we can perform these transformations using:
df.withColumn("col_name",regexp_replace("col_name",pattern,replacement))
I am able to cleanse my data using the withColumn as shown above. However, I have a large number of columns and writing a .withColumn method for every column doesn't seem to be elegant, concise or efficient. So I tried doing something like this:
val finalDF = htmlCleanse(intialDF, columnsArray)
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = hiveContext.emptyDataFrame
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = df.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}
I defined a new function htmlCleanse and I am passing the Dataframe to be transformed and the columns array to the function. The function creates a new emptyDataFrame and iterates over the columns list performing the cleansing on a column for a single iteration and assigns the transformed df to the retDF variable.
This gave me no errors, but it doesn't seem to remove the html tags from all the columns while some of the columns appear to be cleansed. Not sure what's the reason for this inconsistent behavior(any ideas on this?).
So, what would be an efficient way to cleanse my data? Any help would be appreciated. Thank you!
The first issue is that initializing for an empty frame does nothing, you just create something new. You can't then "add" things to it from another dataframe without a join (which would be a bad idea performance wise).
The second issue is that retDF is always defined from df. This means that you throw away everything you did except for cleaning the last column.
Instead you should initialize retDF to df and in every iteration fix a column and overwrite retDF as follows:
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = df
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = retDF.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}

Parsing JSON file and extracting keys and values using Spark

I'm new to spark. I have tried to parse the below mentioned JSON file in spark using SparkSQL but it didn't work. Can someone please help me to resolve this.
InputJSON:
[{"num":"1234","Projections":[{"Transactions":[{"14:45":0,"15:00":0}]}]}]
Expected output:
1234 14:45 0\n
1234 15:00 0
I have tried with the below code but it did not work
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("hdfs:/user/aswin/test.json").toDF();
val sql_output = sqlContext.sql("SELECT num, Projections.Transactions FROM df group by Projections.TotalTransactions ")
sql_output.collect.foreach(println)
Output:
[01532,WrappedArray(WrappedArray([0,0]))]
Spark recognizes your {"14:45":0,"15:00":0} map as structure so probably the only way to read your data is to specify schema manually:
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('num', StringType()), StructField('Projections', ArrayType(StructType([StructField('Transactions', ArrayType(MapType(StringType(), IntegerType())))])))])
Then you can query this temporary table to get results using multiple exploding:
>>> sqlContext.read.json('sample.json', schema=schema).registerTempTable('df')
>>> sqlContext.sql("select num, explode(col) from (select explode(col.Transactions), num from (select explode(Projections), num from df))").show()
+----+-----+-----+
| num| key|value|
+----+-----+-----+
|1234|14:45| 0|
|1234|15:00| 0|
+----+-----+-----+

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.