Spark DataFrame groupBy and aggregation is throwing NegativeArraySizeException

Spark DataFrame groupBy and aggregation is throwing NegativeArraySizeException - exception

I am doing the following query on Spark DataFrame
input
.select("id")
.groupBy("id")
.agg(count("*").as("count"))
I am getting java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:234)
at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:827)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:276)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:273)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:533)

Below should work
input.groupBy("id").count()

Related

How to flatten a dataframe column into columns and filter it?

I converted a JSON file into a Pandas dataframe.
Dataframe
I want to flatten the 'pos' column into the following columns and filter it:
Result
What should I do?

You can use:
df['word']=df['pos'].apply(lambda x: x[4]['word'])
df['tag']=df['pos'].apply(lambda x: x[3])

Spark: write a CSV with null values as empty columns

I'm using PySpark to write a dataframe to a CSV file like this:
df.write.csv(PATH, nullValue='')
There is a column in that dataframe of type string. Some of the values are null. These null values display like this:
...,"",...
I would like them to be display like this instead:
...,,...
Is this possible with an option in csv.write()?
Thanks!

Easily with emptyValue option setted
emptyValue: sets the string representation of an empty value. If None is set, it use the default value, "".
from pyspark import Row
from pyspark.shell import spark
df = spark.createDataFrame([
Row(col_1=None, col_2='20151231', col_3='Hello'),
Row(col_1=2, col_2='20160101', col_3=None),
Row(col_1=3, col_2=None, col_3='World')
])
df.write.csv(PATH, header=True, emptyValue='')
Output
col_1,col_2,col_3
,20151231,Hello
2,20160101,
3,,World

Collecting values across multiple fields for a given key - Apache Spark (Scala)

I have a Scala Data frame in the below format:
I need an o/p in the below format :
The o/p needs to be written to a json file .

Here it is. Change the formatting according to your need.
import org.apache.spark.sql.functions._
df.withColumn("arr", format_string("{%d,%d,%d}", $"pd_id", $"score",$"rank"))
.groupBy("event_tra", "customer", "itemId", "ckey").agg(collect_list("arr").as("collection"))
.select(format_string("{%s,%s,%s,%s,%s,%s}", $"event_tra", $"customer", $"itemId", $"ckey", col("collection").toString).as("data"))

How to omit the header in when use spark to read csv.file?

I am trying to use Spark to read a csv file in jupyter notebook. So far I have
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()
This is how the reviews_df looks like:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]
But each row of the data frame contains the column names, how can I reformat the data, so that it can become:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]

Dataframe always returns Row objects, thats why when you issue collect() on dataframe, it shows -
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')
to get what you want, you can do -
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()
this will return you tuple of values of rows

Json fields getting sorted by default when converted to spark DataFrame

When I create a dataframe from json file, the fields from the json file are sorted by default in the dataframe. How to avoid this sorting?
Jsonfile having one json message per line:
{"name":"john","age":10,"class":2}
{"name":"rambo","age":11,"class":3}
When I create Data frame from this file as:
val jDF = sqlContext.read.json("/user/inputfiles/sample.json")
a DF is created as jDF: org.apache.spark.sql.DataFrame = [age: bigint, class: bigint, name: string]
. In the DF the fields are sorted by default.
How do we avoid this from happening?
Im unable to understand what is going wrong here.
Appreciate any help in sorting out the problem.

For Question 1:
A simple way is to do select on the DataFrame:
val newDF = jDF.select("name","age","class")
The order of parameters is the order of the columns you want.
But this could be verbose if there are many columns and you have to define the order yourself.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Spark DataFrame groupBy and aggregation is throwing NegativeArraySizeException - exception

Below should work input.groupBy("id").count()

Related

How to flatten a dataframe column into columns and filter it?

Spark: write a CSV with null values as empty columns

Collecting values across multiple fields for a given key - Apache Spark (Scala)

How to omit the header in when use spark to read csv.file?

Json fields getting sorted by default when converted to spark DataFrame

Categories

Resources