I have a data similar to the below one:
Col1,col2,col3
a,1,#
b,2,$
c,3,%
I need to create a new column with col2 as key and col3 as value, similar to below:
Col1,col2,col3,col4
a,1,#,{1:#}
b,2,$,{2:$}
c,3,%,{4:%}
How can I achieve this using pyspark?
Try format_string:
import pyspark.sql.functions as F
df2 = df.withColumn('col4', F.format_string('{%d:%s}', 'col2', 'col3'))
df2.show()
+----+----+----+-----+
|Col1|col2|col3| col4|
+----+----+----+-----+
| a| 1| #|{1:#}|
| b| 2| $|{2:$}|
| c| 3| %|{3:%}|
+----+----+----+-----+
If you want a key-value relationship, maps might be more appropriate:
df2 = df.withColumn('col4', F.create_map('col2', 'col3'))
df2.show()
+----+----+----+--------+
|Col1|col2|col3| col4|
+----+----+----+--------+
| a| 1| #|[1 -> #]|
| b| 2| $|[2 -> $]|
| c| 3| %|[3 -> %]|
+----+----+----+--------+
You can also convert the map to a JSON string, similar to your expected output:
df2 = df.withColumn('col4', F.to_json(F.create_map('col2', 'col3')))
df2.show()
+----+----+----+---------+
|Col1|col2|col3| col4|
+----+----+----+---------+
| a| 1| #|{"1":"#"}|
| b| 2| $|{"2":"$"}|
| c| 3| %|{"3":"%"}|
+----+----+----+---------+
Related
While loading csv via databricks, below the 2nd row 4th column is not loaded.
The csv's no of columns varies per row.
In test_01.csv,
a,b,c
s,d,a,d
f,s
Loaded above csv file via databricks as below
>>> df2 = sqlContext.read.format("com.databricks.spark.csv").load("sample_files/test_01.csv")
>>> df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
Tried loading with textfile
rdd = sc.textFile ("sample_files/test_01.csv")
rdd.collect()
[u'a,b,c', u's,d,a,d', u'f,s']
But not conversion of above rdd to dataframe causes error
Was able to solve by specifying the schema as below.
df2 = sqlContext.read.format("com.databricks.spark.csv").schema(schema).load("sample_files/test_01.csv")
df2.show()
+---+---+----+----+----+
| e1| e2| e3| e4| e5|
+---+---+----+----+----+
| a| b| c|null|null|
| s| d| a| d|null|
| f| s|null|null|null|
+---+---+----+----+----+
Tried with inferschema. still not working
df2 = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load("sample_files/test_01.csv")
df2.show()
+---+---+----+
| C0| C1| C2|
+---+---+----+
| a| b| c|
| s| d| a|
| f| s|null|
+---+---+----+
But is there any other way without using schema as the no of column varies?
Ensure you have fixed headers ie rows can have a data missing but column names should be fixed.
If you don't specify column names, you can still create the schema while reading the csv:
val schema = new StructType()
.add(StructField("keyname", StringType, true))
I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.
I have the some dictionaries and a function defined:
dict_TEMPERATURE = {(0, 70): 'Low', (70.01, 73.99): 'Normal-Low',(74, 76): 'Normal', (76.01, 80): 'Normal-High', (80.01, 300): 'High'}
...
hierarchy_dict = {'TEMP': dict_TEMPERATURE, 'PRESS': dict_PRESSURE, 'SH_SP': dict_SHAFT_SPEED, 'POI': dict_POI, 'TRIG': dict_TRIGGER}
def function_definition(valor, atributo):
dict_atributo = hierarchy_dict[atributo]
valor_generalizado = None
if isinstance(valor, (int, long, float, complex)):
for key, value in dict_atributo.items():
if(isinstance(key, tuple)):
lista = list(key)
if (valor > key[0] and valor < key[1]):
valor_generalizado = value
else: # if it is not numeric
valor_generalizado = dict_atributo.get(valor)
return valor_generalizado
What this function basically do is: check the value which is passed as an argument to the "function_definition" function, and replace its value according to its dictionary's references.
So, if I call "function_definition(60, 'TEMP')" it will return 'LOW'.
On the other hand, I have a dataframe with the next structure (this is an example):
+----+-----+-----+---+----+
|TEMP|SH_SP|PRESS|POI|TRIG|
+----+-----+-----+---+----+
| 0| 1| 2| 0| 0|
| 0| 2| 3| 1| 1|
| 0| 3| 4| 2| 1|
| 0| 4| 5| 3| 1|
| 0| 5| 6| 4| 1|
| 0| 1| 2| 5| 1|
+----+-----+-----+---+----+
What I want to do is to replace the values of one column of the dataframe based on the function defined above, so I have the next code-line:
dataframe_new = dataframe.withColumn(atribute_name, function_definition(dataframe[atribute_name], atribute_name))
But I get the next error message when executing it:
AssertionError: col should be Column
What is wrong in my code? How could do that?
Your function_definition(valor,atributo) returns a single String (valor_generalizado) for a single valor.
AssertionError: col should be Column means that you are passing an argument to WithColumn(colName,col) that is not a Column.
So you have to transform your data, in order to have Column, for example as you can see below.
Dataframe for example (same structure as yours):
a = [(10.0,1.2),(73.0,4.0)] # like your dataframe, this is only an example
dataframe = spark.createDataFrame(a,["tp", "S"]) # tp and S are random names for these columns
dataframe.show()
+----+---+
| tp| S|
+----+---+
|10.0|1.2|
|73.0|4.0|
+----+---+
As you can see here
udf Creates a Column expression representing a user defined function (UDF).
Solution:
from pyspark.sql.functions import udf
attr = 'TEMP'
udf_func = udf(lambda x: function_definition(x,attr),returnType=StringType())
dataframe_new = dataframe.withColumn("newCol",udf_func(dataframe.tp))
dataframe_new.show()
+----+---+----------+
| tp| S| newCol|
+----+---+----------+
|10.0|1.2| Low|
|73.0|4.0|Normal-Low|
+----+---+----------+
tx = 'a,b,c,"[""d"", ""e""]""'
file=open('temp.csv','wt')
file.writelines(tx)
file.close()
sparkSession.read.csv('temp.csv', quote='"').show()
+---+---+---+-------+---------+
|_c0|_c1|_c2| _c3| _c4|
+---+---+---+-------+---------+
| a| b| c|"[""d""| ""e""]""|
+---+---+---+-------+---------+
Where the desired output is
+---+---+---+-------------------+
|_c0|_c1|_c2| _c3 |
+---+---+---+-------------------+
| a| b| c|"[""d"", ""e""]""| |
+---+---+---+-------------------+
I am not familiar with PySpark, but there seems to be something wrong with the quotes (one too many) - should probably be:
'a,b,c,"[""d"", ""e""]"'
and the output should then be:
+---+---+---+-------------------+
|_c0|_c1|_c2| _c3 |
+---+---+---+-------------------+
| a| b| c|["d", "e"] |
+---+---+---+-------------------+
Need your help on the following: need to select last three comments for each client and insert it into columns. So, the input looks like this:
ID| Client_ID| Comment_Date| Comments|
1| 1| 29-Apr-13| d|
2| 1| 30-Apr-13| dd|
3| 1| 01-May-13| ddd|
4| 1| 03-May-13| dddd|
5| 2| 02-May-13| a|
6| 2| 04-May-13| aa|
7| 2| 06-May-13| aaa|
8| 3| 03-May-13| b|
9| 3| 06-May-13| bb|
10| 4| 01-May-13| c|
The output I need to get is as follows:
Client_ID| Last comment| (Last-1) comment| (Last-2) comment|
1| dddd| ddd| dd|
2| aaa| aa| a|
3| bb| b|
4| c|
Please, help!!
SELECT x.*
FROM my_table x
JOIN my_table y
ON y.client_id = x.client_id
AND y.id >= x.id
GROUP
BY x.client_id
, x.id
HAVING COUNT(*) <=3;
If don't think you can get this with an SQL request. Maybe you can, but i think it's easier with PHP. For example, you can get your comments with this request :
SELECT * FROM Comment
WHERE Client_ID = ?
LIMIT 0,3
ORDER BY Date DESC
It will return to you the three last comments of an user. Then, you can do whatever you want with that !
Hope it'll help.