I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.
Related
H/a, i am getting null values when using from_json , can you help me figure out the missing piece here.
~ input is the .csv file with json e.g.
id,request
1,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}
2,{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}
~my code (scala/spark)
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
You problem is because the commas in your json column are being used as delimiters. If you have a look at the contents of you input_df:
val input_df = spark.read.option("header",true).option("escape","\"").csv(json_file_input)
input_df.show(false)
+---+--------------+
|id |request |
+---+--------------+
|1 |{"Zipcode":704|
|2 |{"Zipcode":704|
+---+--------------+
You see that the request column is not complete: it was chopped off at the first comma in the request column.
The rest of your code is correct, you can test it like this:
val input_df = Seq(
(1, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""),
(2, """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PASEO COSTA DEL SUR","State":"PR"}""")
).toDF("id","request")
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val json_schema_abc = StructType(Array(
StructField("Zipcode",IntegerType,true),
StructField("ZipCodeType",StringType,true),
StructField("City",StringType,true),
StructField("State",StringType,true))
)
val output_df = input_df
.select($"id",from_json(col("request"),json_schema_abc).as("json_request"))
.select("id","json_request.*")
output_df.show(false)
+---+-------+-----------+-------------------+-----+
|id |Zipcode|ZipCodeType|City |State|
+---+-------+-----------+-------------------+-----+
|1 |704 |STANDARD |PARC PARQUE |PR |
|2 |704 |STANDARD |PASEO COSTA DEL SUR|PR |
+---+-------+-----------+-------------------+-----+
I would suggest changing your CSV file's delimiter (for example ; if that does not appear in your data), that way the commas won't be bothering you.
Below is my dataframe. Im getting this dataframe by converting the java Json list to dataframe.
+---+--------------------------------------------------------------------------------------------------+
| | Value |
+---+---+----------------------------------------------------------------------------------------------+
| 1| {"Name":"john","type":"On(r) TV: Channel","desc_lang":"en~en~en~en","copyright":"Copyright 2021"}|
|-- --------------------------------------------------------------------------------------------------+
| 2| {"Name":"Dane","type":"On(r) TV: Prgrm","desc_lang":"FR~FR~FR~FR","copyright":"Copyright 2022"} |
+------------------------------------------------------------------------------------------------------+
The required output is as below.
+---+---------------------+---+-------+----------------+
|Name| type | desc_lang | copyright |
+----+--------------------+-----------+-------+--------+
|john| On(r) TV: Channel | en~en~en~en |Copyright 2021|
|Dane| On(r) TV: Prgrm | FR~FR~FR~FR |Copyright 2022|
+----+--------------------+-------------+--------------+
This is only the sample data, I actually have around 180 columns that needs to be flattened to the above tabular format. Below is the code I tried splitting, but it did not given me the desired output.
val dfcollect = DF.withColumn("finalop", split($"Value", ":"))
Could someone please assist on how to achieve this output.
You can use from_json and star expand the resulting struct:
val df2 = df.select(
from_json(
col("Value"),
schema_of_json(df.select("Value").head().getString(0))
).as("Value")
).select("Value.*")
df2.show
+----+-------------+-----------+---------------+
|Name| copyright| desc_lang| type|
+----+-------------+-----------+---------------+
|john|Copyright2021|en~en~en~en|On(r)TV:Channel|
|Dane|Copyright2022|FR~FR~FR~FR| On(r)TV:Prgrm|
+----+-------------+-----------+---------------+
here is an example csv data:
"ID", "name", "abbreviation", "CreatedTime", "CreatedByAccount", "UpdatedTime", "UpdatedByAccount", "inc_country_id", "time_zone_id"
"1","NULL","UNITED ARAB EMIRATES"",NULL","AE","NULL","2015-07-01 20:41:49","NULL","379","NULL","2016-03-16 07:38:49","NULL","8215","NULL","262","NULL","9","NULL"
this is causing column mismatch when iam trying create dataframe using pyspark
there are about 600+ such files which have above data, I need to read all these files with proper column mapping
>>> df=spark.read.csv("s3://xyz.csv",header=True)
>>> df.show()
+---+----+--------------------+-----------+----------------+-----------+-------------------+--------------+------------+
| ID|name| abbreviation|CreatedTime|CreatedByAccount|UpdatedTime| UpdatedByAccount|inc_country_id|time_zone_id|
+---+----+--------------------+-----------+----------------+-----------+-------------------+--------------+------------+
| 1|NULL|UNITED ARAB EMIRATES| NULL| AE| NULL|2015-07-01 20:41:49| NULL| 379|
| 2|NULL| ARGENTINA| NULL| AR| NULL|2015-07-01 20:41:49| NULL| 379|
i tried an approach by creating a custom schema and read csv file, but this has to be done for 600 plus files with diff sizes and columns
>>> abc=StructType([StructField('ID',StringType(),True),StructField('c1',StringType(),True),StructField('name',StringType(),True),StructField('c2',StringType(),True),StructField('abbreviation',StringType(),True),StructField('c3',StringType(),True),StructField('CreatedTime',StringType(),True),StructField('c4',StringType(),True),StructField('CreatedByAccount',StringType(),True),StructField('c5',StringType(),True),StructField('UpdatedTime',StringType(),True),StructField('c6',StringType(),True),StructField('UpdatedByAccount',StringType(),True),StructField('c7',StringType(),True),StructField('inc_country_id',StringType(),True),StructField('c8',StringType(),True),StructField('time_zone_id',StringType(),True),StructField('c9',StringType(),True)])
>>> df=spark.read.csv("s3://xyz.csv/",schema=abc)
>>> df.show()
+---+----+--------------------+-----------+----------------+-----------+-------------------+--------------+----------------+----+-------------------+----+----------------+----+--------------+----+------------+----+
| ID| c1| name| c2| abbreviation| c3| CreatedTime| c4|CreatedByAccount| c5| UpdatedTime| c6|UpdatedByAccount| c7|inc_country_id| c8|time_zone_id| c9|
+---+----+--------------------+-----------+----------------+-----------+-------------------+--------------+----------------+----+-------------------+----+----------------+----+--------------+----+------------+----+
| 1|NULL|UNITED ARAB EMIRATES| NULL| AE| NULL|2015-07-01 20:41:49| NULL| 379|NULL|2016-03-16 07:38:49|NULL| 8215|NULL| 262|NULL| 9|NULL|
| 2|NULL| ARGENTINA| NULL| AR| NULL|2015-07-01 20:41:49| NULL| 379|NULL|2015-10-28 21:07:47|NULL| 379|NULL| 187|NULL| None|NULL|
is there any generic way to reload all those files without NULL's using pyspark?
My solution is reading the file twice: one is for fetching schema (then manipulate it), and one is for actual reading
# keep original fields so we can `select` later
df_schema = spark.read.csv('a.csv', header=True)
original_fields = df_schema.schema.fields
# adding extra dummy column after each valid column
expanded_fields = []
for i, field in enumerate(original_fields):
expanded_fields.append(field)
expanded_fields.append(T.StructField(f'col_{i}', T.StringType()))
# build a "fake" schema to fit with csv
schema = T.StructType(expanded_fields)
# using "fake" schema to load CSV, then select only valid columns from original fields
df = spark.read.csv('a.csv', header=True, schema=schema).select([field.name for field in original_fields])
df.show()
# +---+--------------------+------------+-------------------+----------------+-------------------+----------------+--------------+------------+
# | ID| name|abbreviation| CreatedTime|CreatedByAccount| UpdatedTime|UpdatedByAccount|inc_country_id|time_zone_id|
# +---+--------------------+------------+-------------------+----------------+-------------------+----------------+--------------+------------+
# | 1|UNITED ARAB EMIRATES| AE|2015-07-01 20:41:49| 379|2016-03-16 07:38:49| 8215| 262| 9|
# +---+--------------------+------------+-------------------+----------------+-------------------+----------------+--------------+------------+
I have a csv file marks.csv. I have read it using pyspark and created a data frame df.
It looks like this (the csv file):
sub1,sub2,sub3
a,a,b
b,b,a
c,a,b
How can I get the count of ‘a’ in each column in the data frame df?
Thanks.
As we can leverage SQL's features in Spark, we can simply do as below:
df.selectExpr("sum(if( sub1 = 'a' , 1, 0 )) as count1","sum(if( sub2 = 'a' , 1, 0 )) as count2","sum(if( sub3 = 'a' , 1, 0 )) as count3").show()
It should give output as below:
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 1| 2| 1|
+------+------+------+
To know more about spark SQL please visit this.
:EDIT:
If you want to do it for all columns then you can try something like below:
from pyspark.sql.types import Row
final_out = spark.createDataFrame([Row()]) # create an empty dataframe
#Just loop through all columns
for col_name in event_df.columns:
final_out = final_out.crossJoin(event_df.selectExpr("sum(if( "+col_name+" = 'a' , 1, 0 )) as "+ col_name))
final_out.show()
It should give you output like below:
+----+----+----+
|sub1|sub2|sub3|
+----+----+----+
| 1| 2| 1|
+----+----+----+
You can use CASE when statement to get count of "a" in each column
import pyspark.sql.functions as F
df2 = df.select(
F.sum(when(df("sub1")=="a",1).otherwise(0)).alias("sub1_cnt"),
F.sum(when(df("sub2") == "a",1).otherwise(0)).alias("sub2_cnt"),
F.sum(when(df("sub3") == "a",1).otherwise(0)).alias("sub3_cnt"))
df2.show()
I am new to Spark. I am trying to read a JSONArray into a Dataframe and perform some transformations on it. I am trying to cleanse my data by removing some html tags and some newline characters. for example:
Initial dataframe read from JSON:
+-----+---+-----+-------------------------------+
|index| X|label| date |
+-----+---+-----+-------------------------------+
| 1| 1| A|<div>"2017-01-01"</div>|
| 2| 3| B|<div>2017-01-02</div> |
| 3| 5| A|<div>2017-01-03</div> |
| 4| 7| B|<div>2017-01-04</div> |
+-----+---+-----+-------------------------------+
Should be transformed to :
+-----+---+-----+------------+
|index| X|label| date |
+-----+---+-----+------------+
| 1| 1| A|'2017-01-01'|
| 2| 3| B|2017-01-02 |
| 3| 5| A|2017-01-03 |
| 4| 7| B|2017-01-04 |
+-----+---+-----+------------+
I know that we can perform these transformations using:
df.withColumn("col_name",regexp_replace("col_name",pattern,replacement))
I am able to cleanse my data using the withColumn as shown above. However, I have a large number of columns and writing a .withColumn method for every column doesn't seem to be elegant, concise or efficient. So I tried doing something like this:
val finalDF = htmlCleanse(intialDF, columnsArray)
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = hiveContext.emptyDataFrame
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = df.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}
I defined a new function htmlCleanse and I am passing the Dataframe to be transformed and the columns array to the function. The function creates a new emptyDataFrame and iterates over the columns list performing the cleansing on a column for a single iteration and assigns the transformed df to the retDF variable.
This gave me no errors, but it doesn't seem to remove the html tags from all the columns while some of the columns appear to be cleansed. Not sure what's the reason for this inconsistent behavior(any ideas on this?).
So, what would be an efficient way to cleanse my data? Any help would be appreciated. Thank you!
The first issue is that initializing for an empty frame does nothing, you just create something new. You can't then "add" things to it from another dataframe without a join (which would be a bad idea performance wise).
The second issue is that retDF is always defined from df. This means that you throw away everything you did except for cleaning the last column.
Instead you should initialize retDF to df and in every iteration fix a column and overwrite retDF as follows:
def htmlCleanse(df: DataFrame, columns: Array[String]): DataFrame = {
var retDF = df
for(i <- 0 to columns.size-1){
val name = columns(i)
retDF = retDF.withColumn(name,regexp_replace(col(name),"<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>",""))
.withColumn(name,regexp_replace(col(name),""","'"))
.withColumn(name,regexp_replace(col(name)," "," "))
.withColumn(name,regexp_replace(col(name),":",":"))
}
retDF
}