I am doing the following query on Spark DataFrame
input
.select("id")
.groupBy("id")
.agg(count("*").as("count"))
I am getting java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:234)
at org.apache.spark.unsafe.types.UTF8String.toString(UTF8String.java:827)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:276)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:273)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:533)
Below should work
input.groupBy("id").count()
Related
I converted a JSON file into a Pandas dataframe.
Dataframe
I want to flatten the 'pos' column into the following columns and filter it:
Result
What should I do?
You can use:
df['word']=df['pos'].apply(lambda x: x[4]['word'])
df['tag']=df['pos'].apply(lambda x: x[3])
I'm using PySpark to write a dataframe to a CSV file like this:
df.write.csv(PATH, nullValue='')
There is a column in that dataframe of type string. Some of the values are null. These null values display like this:
...,"",...
I would like them to be display like this instead:
...,,...
Is this possible with an option in csv.write()?
Thanks!
Easily with emptyValue option setted
emptyValue: sets the string representation of an empty value. If None is set, it use the default value, "".
from pyspark import Row
from pyspark.shell import spark
df = spark.createDataFrame([
Row(col_1=None, col_2='20151231', col_3='Hello'),
Row(col_1=2, col_2='20160101', col_3=None),
Row(col_1=3, col_2=None, col_3='World')
])
df.write.csv(PATH, header=True, emptyValue='')
Output
col_1,col_2,col_3
,20151231,Hello
2,20160101,
3,,World
I have a Scala Data frame in the below format:
I need an o/p in the below format :
The o/p needs to be written to a json file .
Here it is. Change the formatting according to your need.
import org.apache.spark.sql.functions._
df.withColumn("arr", format_string("{%d,%d,%d}", $"pd_id", $"score",$"rank"))
.groupBy("event_tra", "customer", "itemId", "ckey").agg(collect_list("arr").as("collection"))
.select(format_string("{%s,%s,%s,%s,%s,%s}", $"event_tra", $"customer", $"itemId", $"ckey", col("collection").toString).as("data"))
I am trying to use Spark to read a csv file in jupyter notebook. So far I have
spark = SparkSession.builder.master("local[4]").getOrCreate()
reviews_df = spark.read.option("header","true").csv("small.csv")
reviews_df.collect()
This is how the reviews_df looks like:
[Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5'),
Row(reviewerID=u'A2YB0B3QOHEFR', asin=u'B000JJSRNY', overall=u'5'),
Row(reviewerID=u'AAI0092FR8V1W', asin=u'B0060MYKYY', overall=u'5'),
Row(reviewerID=u'A2TAPSNKK9AFSQ', asin=u'6303187218', overall=u'5'),
Row(reviewerID=u'A316JR2TQLQT5F', asin=u'6305364206', overall=u'5')...]
But each row of the data frame contains the column names, how can I reformat the data, so that it can become:
[(u'A1YKOIHKQHB58W', u'B0001VL0K2', u'5'),
(u'A2YB0B3QOHEFR', u'B000JJSRNY', u'5')....]
Dataframe always returns Row objects, thats why when you issue collect() on dataframe, it shows -
Row(reviewerID=u'A1YKOIHKQHB58W', asin=u'B0001VL0K2', overall=u'5')
to get what you want, you can do -
reviews_df.rdd.map(lambda row : (row.reviewerID,row.asin,row.overall)).collect()
this will return you tuple of values of rows
When I create a dataframe from json file, the fields from the json file are sorted by default in the dataframe. How to avoid this sorting?
Jsonfile having one json message per line:
{"name":"john","age":10,"class":2}
{"name":"rambo","age":11,"class":3}
When I create Data frame from this file as:
val jDF = sqlContext.read.json("/user/inputfiles/sample.json")
a DF is created as jDF: org.apache.spark.sql.DataFrame = [age: bigint, class: bigint, name: string]
. In the DF the fields are sorted by default.
How do we avoid this from happening?
Im unable to understand what is going wrong here.
Appreciate any help in sorting out the problem.
For Question 1:
A simple way is to do select on the DataFrame:
val newDF = jDF.select("name","age","class")
The order of parameters is the order of the columns you want.
But this could be verbose if there are many columns and you have to define the order yourself.