How can I use a function in dataframe withColumn function in Pyspark? - function

I have the some dictionaries and a function defined:
dict_TEMPERATURE = {(0, 70): 'Low', (70.01, 73.99): 'Normal-Low',(74, 76): 'Normal', (76.01, 80): 'Normal-High', (80.01, 300): 'High'}
...
hierarchy_dict = {'TEMP': dict_TEMPERATURE, 'PRESS': dict_PRESSURE, 'SH_SP': dict_SHAFT_SPEED, 'POI': dict_POI, 'TRIG': dict_TRIGGER}
def function_definition(valor, atributo):
dict_atributo = hierarchy_dict[atributo]
valor_generalizado = None
if isinstance(valor, (int, long, float, complex)):
for key, value in dict_atributo.items():
if(isinstance(key, tuple)):
lista = list(key)
if (valor > key[0] and valor < key[1]):
valor_generalizado = value
else: # if it is not numeric
valor_generalizado = dict_atributo.get(valor)
return valor_generalizado
What this function basically do is: check the value which is passed as an argument to the "function_definition" function, and replace its value according to its dictionary's references.
So, if I call "function_definition(60, 'TEMP')" it will return 'LOW'.
On the other hand, I have a dataframe with the next structure (this is an example):
+----+-----+-----+---+----+
|TEMP|SH_SP|PRESS|POI|TRIG|
+----+-----+-----+---+----+
| 0| 1| 2| 0| 0|
| 0| 2| 3| 1| 1|
| 0| 3| 4| 2| 1|
| 0| 4| 5| 3| 1|
| 0| 5| 6| 4| 1|
| 0| 1| 2| 5| 1|
+----+-----+-----+---+----+
What I want to do is to replace the values of one column of the dataframe based on the function defined above, so I have the next code-line:
dataframe_new = dataframe.withColumn(atribute_name, function_definition(dataframe[atribute_name], atribute_name))
But I get the next error message when executing it:
AssertionError: col should be Column
What is wrong in my code? How could do that?

Your function_definition(valor,atributo) returns a single String (valor_generalizado) for a single valor.
AssertionError: col should be Column means that you are passing an argument to WithColumn(colName,col) that is not a Column.
So you have to transform your data, in order to have Column, for example as you can see below.
Dataframe for example (same structure as yours):
a = [(10.0,1.2),(73.0,4.0)] # like your dataframe, this is only an example
dataframe = spark.createDataFrame(a,["tp", "S"]) # tp and S are random names for these columns
dataframe.show()
+----+---+
| tp| S|
+----+---+
|10.0|1.2|
|73.0|4.0|
+----+---+
As you can see here
udf Creates a Column expression representing a user defined function (UDF).
Solution:
from pyspark.sql.functions import udf
attr = 'TEMP'
udf_func = udf(lambda x: function_definition(x,attr),returnType=StringType())
dataframe_new = dataframe.withColumn("newCol",udf_func(dataframe.tp))
dataframe_new.show()
+----+---+----------+
| tp| S| newCol|
+----+---+----------+
|10.0|1.2| Low|
|73.0|4.0|Normal-Low|
+----+---+----------+

Related

Pyspark - how to group by and create a key value pair column

I have a data similar to the below one:
Col1,col2,col3
a,1,#
b,2,$
c,3,%
I need to create a new column with col2 as key and col3 as value, similar to below:
Col1,col2,col3,col4
a,1,#,{1:#}
b,2,$,{2:$}
c,3,%,{4:%}
How can I achieve this using pyspark?
Try format_string:
import pyspark.sql.functions as F
df2 = df.withColumn('col4', F.format_string('{%d:%s}', 'col2', 'col3'))
df2.show()
+----+----+----+-----+
|Col1|col2|col3| col4|
+----+----+----+-----+
| a| 1| #|{1:#}|
| b| 2| $|{2:$}|
| c| 3| %|{3:%}|
+----+----+----+-----+
If you want a key-value relationship, maps might be more appropriate:
df2 = df.withColumn('col4', F.create_map('col2', 'col3'))
df2.show()
+----+----+----+--------+
|Col1|col2|col3| col4|
+----+----+----+--------+
| a| 1| #|[1 -> #]|
| b| 2| $|[2 -> $]|
| c| 3| %|[3 -> %]|
+----+----+----+--------+
You can also convert the map to a JSON string, similar to your expected output:
df2 = df.withColumn('col4', F.to_json(F.create_map('col2', 'col3')))
df2.show()
+----+----+----+---------+
|Col1|col2|col3| col4|
+----+----+----+---------+
| a| 1| #|{"1":"#"}|
| b| 2| $|{"2":"$"}|
| c| 3| %|{"3":"%"}|
+----+----+----+---------+

MySQL : Having a table with minimum and maximum value columns, how could I find the "highest" values if my search term exceeds maximum value?

Sorry if my question is a bit confusing, but database design (nor queries) are not my strong points.
Let's say I sell a product, which are cables. And those products have three "variations", which
prices would be applied in layers like 'post-processings':
Type 1: Ordinary cable.
Type 2: Ordinary cable plus custom color.
Type 3: Ordinary cable plus custom color plus terminals.
Also, the final price of the cable will depend on the length, and the more meters of cable you buy, the less price per meter I will apply.
So, I've designed a cable_pricings table like this:
id|product_id|product_type_id|min_length|max_length|price|
--|----------|---------------|----------|----------|-----|
1| 1| 1| 0| 10| 0.50|
2| 1| 1| 10| 20| 0.45|
3| 1| 1| 20| 40| 0.40|
4| 1| 1| 40| 50| 0.30|
5| 1| 1| 50| 60| 0.25|
6| 1| 1| 60| 0| 0.15|
7| 1| 2| 0| 10| 0.35|
8| 1| 2| 10| 20| 0.30|
9| 1| 2| 20| 40| 0.30|
10| 1| 2| 40| 50| 0.20|
11| 1| 2| 50| 60| 0.20|
12| 1| 2| 60| 0| 0.20|
13| 1| 3| 0| 10| 0.40|
14| 1| 3| 10| 20| 0.40|
15| 1| 3| 20| 40| 0.30|
16| 1| 3| 40| 50| 0.30|
17| 1| 3| 50| 60| 0.25|
18| 1| 3| 60| 0| 0.25|
Now with this structure, let's say I want to buy 47 meters of cable, with custom color. With a single query like this:
SELECT * FROM cable_pricings
WHERE product_id = 2
AND product_type_id IN (1,2)
AND min_length <= 47
AND max_length > 47;
I got two rows which will hold those type of cable and be in the length intervals, then on my server code, I iterate over results and get final price. Up to here, everything good.
But my problem is on the "edge" cases:
If I want to buy 60 meters of cable, my query won't work, as max_length is 0.
If I want to buy more than 60 meters of cable, my approach won't work as well, because in that case none of the conditions will apply.
I've already tried with MAXs, MINs, but I'm not getting the expected results (and I think aggregate functions check the whole table, so I'd like to -if that's possible- not to use aggregates).
I also thought to put on the 'edge' max_length the value of 9999999 but I think that's just... a dirty fix. Also, this will be managed from a backend, and I don't expect the final user writing lots of 999999s on edge case.
Then my questions are:
Can I solve "edge" cases with a single query? Or I have to split my cases into two separate queries?
Is my table design correct at all?
you can change
AND max_length > 47
To:
AND (max_length > 47 OR max_length = 0)
I wozuld use this query
SELECT *
FROM cable_pricings
WHERE product_id = 1
AND product_type_id IN (1,2)
AND min_length >= 60
AND (max_length > 60 OR max_length = 0);
dbfille exampole
You have to include the null else the max restriction will not trigger

CakePHP (sub)query to get youngest record of each group

In a project I manage invoices that have a status which is changed throughout their lifetime. The status changes are saved in another database table which is similar to this:
|id|invoice_id|user_id|old_status_id|new_status_id|change_date |
-----------------------------------------------------------------------
| 1| 1| 1| 1| 3|2013-11-11 12:00:00|
| 2| 1| 2| 3| 5|2013-11-11 12:30:00|
| 3| 2| 3| 1| 2|2013-11-10 08:00:00|
| 4| 1| 1| 5| 6|2013-11-11 13:10:00|
| 5| 2| 2| 2| 5|2013-11-10 09:00:00|
For each invoice, I would like to retrieve the last status change. Thus the result should contain the records with the ids 4 and 5.
|id|invoice_id|user_id|old_status_id|new_status_id|change_date |
-----------------------------------------------------------------------
| 4| 1| 1| 5| 6|2013-11-11 13:10:00|
| 5| 2| 2| 2| 5|2013-11-10 09:00:00|
If I group by the invoice_id and use max(change_date), I will retrieve the youngest date, but the field values of the other fields are not taken from those records containing the youngest date in the group.
That's challenge #1 for me.
Challenge #2 would be to realize the query with CakePHP's methods, if possible.
Challenge #3 would be to filter the result to those records belonging to the current user. SO if the current user has the id 1, the result is
|id|invoice_id|user_id|old_status_id|new_status_id|change_date |
-----------------------------------------------------------------------
| 4| 1| 1| 5| 6|2013-11-11 13:10:00|
If he or she has user id 2, the result is
|id|invoice_id|user_id|old_status_id|new_status_id|change_date |
-----------------------------------------------------------------------
| 5| 2| 2| 2| 5|2013-11-10 09:00:00|
For the user with id 3 the result would be empty.
In other words, I do not want to find all latest changes that a user has made, regardless whether he was the last one that made a change. Instead, I want to find all invoice changes where that user was the ast one so far who made a change. The motivation is that I want to enable a user to undo his change, which is only possible if no other user after him performed another change.
In case anyone needs an answer
Strictly focusing on:
I want to find all invoice changes where that user was the last one so far who made a change
Write the SQL as
SELECT foo.*
FROM foo
LEFT JOIN foo AS after_foo
ON foo.invoice_id = after_foo.invoice_id
AND foo.change_date < after_foo.change_date
WHERE after_foo.id IS NULL
AND foo.user_id = 1;
Implement using the JOIN clause within Cakephp's find.
The SQL for the suggested algorithm is something like:
SELECT foo.*
FROM foo
JOIN (SELECT invoice_id, MAX(change_date) AS most_recent
FROM foo
GROUP BY invoice_id) AS recently
ON recently.invoice_id = foo.invoice_id
AND recently.most_recent = foo.change_date
WHERE foo.user_id = 1;

Select last 3 entries for each client and insert it into columns

Need your help on the following: need to select last three comments for each client and insert it into columns. So, the input looks like this:
ID| Client_ID| Comment_Date| Comments|
1| 1| 29-Apr-13| d|
2| 1| 30-Apr-13| dd|
3| 1| 01-May-13| ddd|
4| 1| 03-May-13| dddd|
5| 2| 02-May-13| a|
6| 2| 04-May-13| aa|
7| 2| 06-May-13| aaa|
8| 3| 03-May-13| b|
9| 3| 06-May-13| bb|
10| 4| 01-May-13| c|
The output I need to get is as follows:
Client_ID| Last comment| (Last-1) comment| (Last-2) comment|
1| dddd| ddd| dd|
2| aaa| aa| a|
3| bb| b|
4| c|
Please, help!!
SELECT x.*
FROM my_table x
JOIN my_table y
ON y.client_id = x.client_id
AND y.id >= x.id
GROUP
BY x.client_id
, x.id
HAVING COUNT(*) <=3;
If don't think you can get this with an SQL request. Maybe you can, but i think it's easier with PHP. For example, you can get your comments with this request :
SELECT * FROM Comment
WHERE Client_ID = ?
LIMIT 0,3
ORDER BY Date DESC
It will return to you the three last comments of an user. Then, you can do whatever you want with that !
Hope it'll help.

Know which condition didn't match in mysql consult

Into a very high selective MySQL query, is there a way to know which condition did not match?
Eg.: If I'm looking for 12 conditions and my MySQL returns 0, wich one of these 12 conditions did not match?
SELECT
article.id,
article.name,
GROUP_CONCAT(tags.name order by tags.name) AS nameTags,
GROUP_CONCAT(tags.id order by tags.id) AS idTags,
MAX(IF(article.name LIKE '%var1%',1,0)) AS var1match,
MAX(IF(article.name LIKE '%var2%',1,0)) AS var2match,
and more 10 conditions.
FROM
article
LEFT JOIN ....
LEFT JOIN ....
GROUP BY id
HAVING
(nameTags LIKE '%var1%' OR var1match=1)
AND (nameTags LIKE '%var2%' OR var2match=1)
AND more 10 conditions.
I mean, do it without make 12 single consults to see which one results nothing.
Update: Got some evolution
If I do this:
SELECT
article.id,
MAX(IF(article.name LIKE '%var1%',1,0) AS var1,
MAX(IF(article.name LIKE '%var2%',1,0) AS var2,
MAX(IF(article.name LIKE '%var3%',1,0) AS var3,
FROM
article
LEFT JOIN ....
LEFT JOIN ....
GROUP BY article.id
I'll get an aliased table like this
|id|var1|var2|var3|
| 1| 0| 0| 1|
| 2| 0| 1| 0|
| 3| 0| 0| 0|
| 4| 0| 1| 1|
So I know that 'var 1' doesn't match with the condition because all column value is equal 0. How I get the return of that column name?
Tried to sum value and get columns that sum = 0. Don't know how to do.
Update:
Found to do that is using temporary tables.
1) Create tmp table:
CREATE TEMPORARY TABLE tmp_tagReport AS
SELECT
article.id,
MAX(IF(article.name LIKE '%var1%',1,0) AS l_var1,
MAX(IF(article.name LIKE '%var2%',1,0) AS l_var2,
MAX(IF(article.name LIKE '%var3%',1,0) AS l_var3,
FROM
article
LEFT JOIN ....
LEFT JOIN ....
GROUP BY article.id
You'll have something like this:
|id|l_var1|l_var2|l_var3|
| 1| 0| 0| 1|
| 2| 0| 1| 0|
| 3| 0| 0| 0|
| 4| 0| 1| 1|
2) Sum values and you will know that column with 0 has no match
SELECT
SUM(l_var1) as l_var1,
SUM(l_var2) as l_var2,
SUM(l_var3) as l_var3
You'll get something like this:
|l_var1|l_var2|l_var3|
| 0| 2| 2|
3) Get name of columns with value = 0 in PHP
$querySelect = "SELECT
SUM(l_var1) as l_var1,
SUM(l_var2) as l_var2,
SUM(l_var3) as l_var3
";
$result = mysql_query($querySelect);
$arrayUnfound = array_keys(#mysql_fetch_assoc($result), "0");
print_r($arrayUnfound)
4) It is a temporary table, but I added a drop table just to make sure and try to free memory:
mysql_query("DROP TABLE tmp_tagReport; ");
I found myself a way to do it, but it still has to make 2 or more queries. However I hope it could be helpful.
sorry my english
Thanks