adding a unique consecutive row number to dataframe in pyspark - csv

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?

I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+

Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

Related

How to apply group by on pyspark dataframe and a transformation on the resulting object

I have a spark data frame
| item_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 brand Samsung
| id_1 ram 6GB
| id_2 brand Apple
| id_2 ram 4GB
_____________________________________________________________________________
I want to group this data frame by item_id and output as a file with each line being a json object
{id_1: "properties":[{"brand":['Samsung']},{"ram":['6GB']} ]}
{id_2: "properties":[{"brand":['Apple']},{"ram":['4GB']} ]}
This is a big distributed data frame so , converting to pandas is not an option.
Is this kind of transformation even possible in pyspark
In scala, but python version will be very similar (sql.functions):
val df = Seq((1,"brand","Samsung"),(1,"ram","6GB"),(1,"ram","8GB"),(2,"brand","Apple"),(2,"ram","6GB")).toDF("item_id","attribute_key","attribute_value")
+-------+-------------+---------------+
|item_id|attribute_key|attribute_value|
+-------+-------------+---------------+
| 1| brand| Samsung|
| 1| ram| 6GB|
| 1| ram| 8GB|
| 2| brand| Apple|
| 2| ram| 6GB|
+-------+-------------+---------------+
df.groupBy('item_id,'attribute_key)
.agg(collect_list('attribute_value).as("list2"))
.groupBy('item_id)
.agg(map(lit("properties"),collect_list(map('attribute_key,'list2))).as("prop"))
.select(to_json(map('item_id,'prop)).as("json"))
.show(false)
output:
+------------------------------------------------------------------+
|json |
+------------------------------------------------------------------+
|{"1":{"properties":[{"ram":["6GB","8GB"]},{"brand":["Samsung"]}]}}|
|{"2":{"properties":[{"brand":["Apple"]},{"ram":["6GB"]}]}} |
+------------------------------------------------------------------+

Pyspark - getting values from an array that has a range of min and max values

I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+

Convert Parquet to JSON [duplicate]

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Split string into multiple rows with one character each

I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.

Convert Json to separate columns in HIVE

I have 4 columns in Hive database table. First two columns are of type string, 3rd and 4th are of JSON. Type. How to extract json data in different columns.
SERDE available in Hive seems to be handling only json data. I have both normal (STRING) and JSON data. How can I extract data in separate colums here.
Example:
abc 2341 {max:2500e0,value:"20",Type:"1",ProviderType:"ABC"} {Name:"ABC",minA:1200e0,StartDate:1483900200000,EndDate:1483986600000,Flags:["flag4","flag3","flag2","flag1"]}
xyz 6789 {max:1300e0,value:"10",Type:"0",ProviderType:"foo"} {Name:"foo",minA:3.14159e0,StartDate:1225864800000,EndDate:1225864800000,Flags:["foo","foo"]}
Given a fixed JSON
create table mytable (str string,i int,jsn1 string, jsn2 string);
insert into mytable values
('abc',2341,'{"max":2500e0,"value":"20","Type":"1","ProviderType":"ABC"}','{"Name":"ABC","minA":1200e0,"StartDate":1483900200000,"EndDate":1483986600000,"Flags":["flag4","flag3","flag2","flag1"]}')
,('xyz',6789,'{"max":1300e0,"value":"10","Type":"0","ProviderType":"foo"}','{"Name":"foo","minA":3.14159e0,"StartDate":1225864800000,"EndDate":1225864800000,"Flags":["foo","foo"]}')
;
select str,i
,jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
,jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate
,jsn2_Flags
from mytable
lateral view json_tuple (jsn1,'max','value','Type','ProviderType')
j1 as jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
lateral view json_tuple (jsn2,'Name','minA','StartDate','EndDate','Flags')
j2 as jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate,jsn2_Flags
;
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| str | i | jsn1_max | jsn1_value | jsn1_type | jsn1_providertype | jsn2_name | jsn2_mina | jsn2_startdate | jsn2_enddate | jsn2_flags |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| abc | 2341 | 2500.0 | 20 | 1 | ABC | ABC | 1200.0 | 1483900200000 | 1483986600000 | ["flag4","flag3","flag2","flag1"] |
| xyz | 6789 | 1300.0 | 10 | 0 | foo | foo | 3.14159 | 1225864800000 | 1225864800000 | ["foo","foo"] |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+