I have an issue relating to an earlier post I made (for ref: SQL - How to find optimal performance numbers for query)
Essentially, I am now trying to use the CASE WHEN function to make a variety of different groups. For example, and following the example from the previous post, I have the following:
SELECT
Vehicle_type,
case when Number_of_passengers::numeric = 0 then 'cat=0'
when Number_of_passengers::numeric < 2 then 'cat1<2'
when Number_of_passengers::numeric < 3 then 'cat1<3'
when Number_of_passengers::numeric < 4 then 'cat1<4'
when Number_of_passengers::numeric < 5 then 'cat1<5'
when Number_of_passengers::numeric < 6 then 'cat1<6'
when Number_of_passengers::numeric < 7 then 'cat1<7'
when Number_of_passengers::numeric > 2 then 'cat1>2'
when Number_of_passengers::numeric > 3 then 'cat1>3'
when Number_of_passengers::numeric > 4 then 'cat1>4'
when Number_of_passengers::numeric > 5 then 'cat1>5'
when Number_of_passengers::numeric > 6 then 'cat1>6'
when Number_of_passengers::numeric > 7 then 'cat1>7'end as category1,
case when Number_of_doors::numeric = 0 then 'cat2=0'
when Number_of_doors::numeric > 2 then 'cat2>2'
when Number_of_doors::numeric > 3 then 'cat2>3'
when Number_of_doors::numeric > 4 then 'cat2>4'
when Number_of_doors::numeric > 5 then 'cat2>5'
when Number_of_doors::numeric > 6 then 'cat2>6'
when Number_of_doors::numeric > 7 then 'cat2>7'
when Number_of_doors::numeric < 2 then 'cat2<2'
when Number_of_doors::numeric < 3 then 'cat2<3'
when Number_of_doors::numeric < 4 then 'cat2<4'
when Number_of_doors::numeric < 5 then 'cat2<5'
when Number_of_doors::numeric < 6 then 'cat2<6'
when Number_of_doors::numeric < 7 then 'cat2<7' end as category2,
round(sum(case when in_accident='t' then 1.0 end)/ count(*),3) as accident_rate,
FROM Accidents
GROUP by 1,2,3
This is actually giving me the correct output in terms of format, however, the numbers I receive in the 'accident_rate'column vary.
If i ran this whole query, I get a different accident_rate for the groups than if i looked at groups individually.
To explain, if i ran the above query and looked at : Car, Number_of_passengers > 2 and Number_of_doors > 2, my accident rate could be 60%.
Yet, if i ran the query as:
SELECT
Vehical_type,
case when Number_of_passengers::numeric > 2 then 'cat1>2' end as category1,
case when Number_of_doors::numeric > 2 then 'cat2>2' end as category2,
round(sum(case when in_accident='t' then 1.0 end)/ count(*),3) as accident_rate,
FROM Accidents
GROUP by 1,2,3
my accident rate might only be 20%.
Any suggestion guys?
Thanks
Related
I have a question , as I am new to programming , I Can't figure out this simple logic. I am creating buckets under this code and for some reason the last buckets that I have created i.e 5 and 6 are not working. Everything is falling under classification 1. Need some help to categorising the data in different buckets. Can someone please tell me what can be done so that the data that is meeting condition 5 & 6 falls under it.
case when Sum(SubscriptLineAmount) > 0 and sum([Previous Amount]) = 0 then 1
when Sum(SubscriptLineAmount) > 0 and Sum(differenceAmount) > 0 then 2
when Sum(SubscriptLineAmount) > 0 and Sum(differenceAmount) < 0 then 3
when Sum(SubscriptLineAmount) = 0 and Sum(differenceAmount) <> 0 then 4
when Sum(SubscriptLineAmount) > 0 and sum([Previous Amount]) = 0 and [SUBSCRIPTION ORDER TYPE LIST] = 'Acquisition' then 5
when Sum(SubscriptLineAmount) > 0 and sum([Previous Amount]) = 0 and [SUBSCRIPTION ORDER TYPE LIST] = 'Other' then 6
else 7
end classification
I have a sql query that calculates a field as follows:
Select
Case When b.xField > 10
then 20
when b.xField > 48
then 10
else 0
end as field1
from (Select CASE WHEN numberChosen < 15
THEN 10
WHEN numberChosen > 15
THEN 48
END as xField
From secondTable) B
What I need is, when field1 is 10, then do some other calculations to save on another field.
Example something like:
then 20 AND field2 = yData - 26
so when the case is on 20, then have another field equal to yData - 26.
Can that be achieved using Cases in Sql ? Have two fields calculated in a single case?
I want to calculate two fields in one Case
You Can't Do Thatâ„¢.
You can put two case clauses in your query like this:
Select
Case When xField > 10
then 20
when xField > 48
then 10
else 0
end as field1,
Case When xField > 10
then ydata - 26
else 0
end as field2
from myData
Or you can generate the extra field value in a wrapper query if it is really hard to compute field2
SELECT field1, CASE WHEN field1 > 10 THEN ydata -20 ELSE 0 END field2
FROM (
Select
ydata,
Case When xField > 10
then 20
when xField > 48
then 10
else 0
end as field1
from myData
) subquery
You can use the base logic that is deciding when field1 = 10 (ie. xField > 48) in the second case:
SELECT CASE WHEN xField > 48
THEN 10
WHEN xField > 10
THEN 20
ELSE 0
END as field1,
CASE WHEN xField > 48
THEN yData - 26
END as field2
FROM (SELECT CASE WHEN numberChosen < 15
THEN 10
WHEN numberChosen > 15
THEN 48
END as xField,
yData
FROM secondTable) B
I changed the order of your case because putting the > 10 condition before the > 48 condition will never let the > 48 be hit.
I have a database that has a large number of rows with people's names in them. However, the names may include trailing numbers.
For example, it could look like this:
Bob
Mike
Betsy
Bob 2/2
Kevin
Mike 2/3
Mike 3/3
I'd like to run a query so I can count the number of names, but I'm not sure how to do this so that "Mike X/Y" is included in the count for "Mike".
e.g. my counts would be:
Bob = 2
Mike = 3
Betsy = 1
Kevin = 1
Is this possible with mysql?
A bit clunky but you could try testing for existence of a number by using a regular expression then substringing to get everything before the first number
select newname, count(*)
from
(
select name,
case when name REGEXP '[0-9]' = 1 then
case
when locate('0',name) > 0 then substring(name,1,locate('0',name) -2)
when locate('1',name) > 0 then substring(name,1,locate('1',name) -2)
when locate('2',name) > 0 then substring(name,1,locate('2',name) -2)
when locate('3',name) > 0 then substring(name,1,locate('3',name) -2)
when locate('4',name) > 0 then substring(name,1,locate('4',name) -2)
when locate('5',name) > 0 then substring(name,1,locate('5',name) -2)
when locate('6',name) > 0 then substring(name,1,locate('6',name) -2)
when locate('7',name) > 0 then substring(name,1,locate('7',name) -2)
when locate('8',name) > 0 then substring(name,1,locate('8',name) -2)
when locate('9',name) > 0 then substring(name,1,locate('9',name) -2)
end
else name
end as newname
from t
) s
group by newname
;
I have a data frame that contains 508383 rows. I am only showing the first 10 row.
0 1 2
0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.
I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.
Here is my code:
#!/usr/bin/env python
import pandas as pd
data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)
import csv
file = open('sort_cov_emb_sg.bed')
readCSV = csv.reader(file, delimiter="\t")
first_row = readCSV.next()
print first_row
count_1 = 0
while count_1 < 100000:
next_row = readCSV.next()
value_1 = int(next_row[1]) - int(first_row[1])
count_1 = count_1 + 1
if value_1 < 5000:
continue
else:
break
print next_row
print count_1
print value_1
window_1 = data[0:63]
print window_1
first_row = readCSV.next()
print first_row
count_2 = 0
while count_2 < 100000:
next_row = readCSV.next()
value_2 = int(next_row[1]) - int(first_row[1])
count_2 = count_2 + 1
if value_2 < 5000:
continue
else:
break
print next_row
print count_2
print value_2
window_2 = data[0:74]
print window_2
I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.
Thanks.
Rodrigo
This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):
jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
print(k)
print(group)
produces
0
0 1 2
0 chr3R 4174822 4174922
1
0 1 2
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
2
0 1 2
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:
>>> jumps
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: 2, dtype: bool
>>> jumps.cumsum()
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: 2, dtype: int32
Is there a way to pass a variable value in ddply/sapply directly to a function without the function (x) notation?
E.g. Instead of:
ddply(bu,.(trial), function (x) print(x$tangle) )
Is there a way to do:
ddply(bu,.(trial), print(tangle) )
I am asking because with many variables this notation becomes very cumbersome.
Thanks!
You can use fn$ in the gsubfn package. Just preface the function in question with fn$ and then you can use a formula notation as shown here:
> library(gsubfn)
>
> # instead of specifying function(x) mean(x) / sd(x)
>
> fn$sapply(iris[-5], ~ mean(x) / sd(x))
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.056602 7.014384 2.128819 1.573438
> library(plyr)
> # instead of specifying function(x) colMeans(x[-5]) / sd(x[-5])
>
> fn$ddply(iris, .(Species), ~ colMeans(x[-5]) / sd(x[-5]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 14.20183 9.043319 8.418556 2.334285
2 versicolor 11.50006 8.827326 9.065547 6.705345
3 virginica 10.36045 9.221802 10.059890 7.376660
Just add your function parameters in the **ply command. For example:
ddply(my_data, c("var1","var2"), my_function, param1=something, param2=something)
where my_function usually looks like
my_function(x, param1, param2)
Here's a working example of this:
require(plyr)
n=1000
my_data = data.frame(
subject=1:n,
city=sample(1:4, n, T),
gender=sample(1:2, n, T),
income=sample(50:200, n, T)
)
my_function = function(data_in, dv, extra=F){
dv = data_in[,dv]
output = data.frame(mean=mean(dv), sd=sd(dv))
if(extra) output = cbind(output, data.frame(n=length(dv), se=sd(dv)/sqrt(length(dv)) ) )
return(output)
}
#with params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=T)
city gender mean sd n se
1 1 1 127.1158 44.64347 95 4.580324
2 1 2 125.0154 44.83492 130 3.932283
3 2 1 130.3178 41.00359 107 3.963967
4 2 2 128.1608 43.33454 143 3.623816
5 3 1 121.1419 45.02290 148 3.700859
6 3 2 120.1220 45.01031 123 4.058443
7 4 1 126.6769 38.33233 130 3.361968
8 4 2 125.6129 44.46168 124 3.992777
#without params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=F)
city gender mean sd
1 1 1 127.1158 44.64347
2 1 2 125.0154 44.83492
3 2 1 130.3178 41.00359
4 2 2 128.1608 43.33454
5 3 1 121.1419 45.02290
6 3 2 120.1220 45.01031
7 4 1 126.6769 38.33233
8 4 2 125.6129 44.46168