pandas query a grouping within a grouping - mysql

Having trouble writing a query where I can get the top 10 of a top 10 based on a count
My starting table from this query:
top_10_cars = 'CH', 'DA', 'AG', 'DC', 'LA', 'NY', 'SA', 'SE', 'DE', 'MI'
df = pd.read_sql("select\
count(*) as count\
,ID\
,CAR\
from "+db+"\
where pop in ('"+ "','".join(top_10_cars) +"')\
group by\
pop\
,asn\
order by reqs desc\
",conn)
Result is a list with all the IDs for every car grouping sorted by count:
Count ID CAR
67210048 7922 CH
2081655 20001 LA
488850583 7018 AG
567585985 7018 DA
450991 7922 SA
41123124 7018 CH
4135532 11427 DA
...
..
.
The dataframe above is too big. I only one that top 10 Ids for each car.
For example CH:
Count ID CAR
67210048 7922 CH
25100548 7546 CH
465100 8542 CH
67254828 5622 CH
1251048 3522 CH
...
..
.
The resulting table should look like this
Count ID CAR
67210048 7922 CH
25100548 7546 CH
..
.
7210048 1546 DA
251005 5678 DA
25100548 7546 DA
465100 8542 DA
...
..
67254828 5622 DA
and
so
on.. 'AG', 'DC', 'LA', 'NY', 'SA', 'SE', 'DE', 'MI'

This is probably not the correct way to do this but I just wrapped it some python:
df = pd.DataFrame()
for x in top_NA_cars:
dftemp = pd.read_sql("select\
count(*) as count\
,ID\
,CAR\
from "+db+"\
where CAR in ('"+x+"')\
group by\
ID\
,CAR\
order by count desc limit 10",conn)
df = df.append(dftemp)
Open to better solutions but ^ did work.

Related

How to truncate double precision value in PostgreSQL by keeping exactly first two decimals?

I'm trying to truncate double precision value when I'm build json using json_build_object() function in PostgreSQL 11.8 but with no luck. To be more precise I'm trying to truncate 19.9899999999999984 number to ONLY two decimals but making sure it DOES NOT round it to 20.00 (which is what it does), but to keep it at 19.98.
BTW, what I've tried so far was to use:
1) TRUNC(found_book.price::numeric, 2) and I get value 20.00
2) ROUND(found_book.price::numeric, 2) and I get value 19.99 -> so far this is closesest value but not what I need
3) ROUND(found_book.price::double precision, 2) and I get
[42883] ERROR: function round(double precision, integer) does not exist
Also here is whole code I'm using:
create or replace function public.get_book_by_book_id8(b_id bigint) returns json as
$BODY$
declare
found_book book;
book_authors json;
book_categories json;
book_price double precision;
begin
-- Load book data:
select * into found_book
from book b2
where b2.book_id = b_id;
-- Get assigned authors
select case when count(x) = 0 then '[]' else json_agg(x) end into book_authors
from (select aut.*
from book b
inner join author_book as ab on b.book_id = ab.book_id
inner join author as aut on ab.author_id = aut.author_id
where b.book_id = b_id) x;
-- Get assigned categories
select case when count(y) = 0 then '[]' else json_agg(y) end into book_categories
from (select cat.*
from book b
inner join category_book as cb on b.book_id = cb.book_id
inner join category as cat on cb.category_id = cat.category_id
where b.book_id = b_id) y;
book_price = trunc(found_book.price, 2);
-- Build the JSON response:
return (select json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', book_price,
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
));
end
$BODY$
language 'plpgsql';
select get_book_by_book_id8(186);
How do I achieve to keep EXACTLY ONLY two FIRST decimal digits 19.98 (any suggestion/help is greatly appreciated)?
P.S. PostgreSQL version is 11.8
In PostgreSQL 11.8 or 12.3 I cannot reproduce:
# select trunc('19.9899999999999984'::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984, 2);
trunc
-------
19.98
(1 row)
Actually I can reproduce with the right type and a special setting:
# set extra_float_digits=0;
SET
# select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.99
(1 row)
And a possible solution:
# show extra_float_digits;
extra_float_digits
--------------------
3
(1 row)
select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.98
(1 row)
But note that:
Note: The extra_float_digits setting controls the number of extra
significant digits included when a floating point value is converted
to text for output. With the default value of 0, the output is the
same on every platform supported by PostgreSQL. Increasing it will
produce output that more accurately represents the stored value, but
may be unportable.
As #pifor suggested I've managed to get it done by directly passing trunc(found_book.price::double precision::text::numeric, 2) as value in json_build_object like this:
json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', trunc(found_book.price::double precision::text::numeric, 2),
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
)
Using book_price = trunc(found_book.price::double precision::text::numeric, 2); and passing it as value for 'price' key didn't work.
Thank you for your help. :)

loading NA values from r to sql [duplicate]

This question already has answers here:
Insert into MySQL from R
(3 answers)
Closed 3 years ago.
I am trying to load a dataframe from r to sql and am having trouble getting the NAs to load to their equivalent NULL in sql. They are coming up as blank cells instead. Example data:
data.frame(name = c('Sara', 'Matt', 'Kyle', 'Steve', 'Maggie', NA, 'Alex', 'Morgan'),
student_id = c(123,124,125,126,127,128,129,130),
score = c(78, 83, 91, NA, 88, 92, NA, 77))
Table schema: student_score with columns name (varchar), student_id (int), and score (int)
R code:
load = "Insert into schema.student_score (name, student_id, score) values"
data = list()
for (i in seq(nrow(df))) {
info = paste0("('", df$name[i], "','",
df$student_id[i], "','",
df$score[i], "')")
data[[i]] = info
}
rows = do.call(rbind, data)
values = paste(rows[,1], collapse = ',')
send = paste0(load, values)
dbSendQuery(conn, send)
and when they are loaded in sql it comes out
name student_id score
Sara 123 78
Matt 124 83
Kyle 125 91
Steve 126
Maggie 127 88
128 92
Alex 129
Morgan 130 77
I want the blank values to be replaced by NULL
In your code NA gets translated as "NA". You need to replace all 'NA' in send to NULL. Simply add below code at the end -
send <- gsub("'NA'", "NULL", send)
send
"Insert into schema.student_score (name, student_id, score) values('Sara','123','78'),('Matt','124','83'),('Kyle','125','91'),('Steve','126',NULL),('Maggie','127','88'),(NULL,'128','92'),('Alex','129',NULL),('Morgan','130','77')"

Wrong data due to Joins

How can I remove unreal data that I'm getting after several joins that I ran.
my entire Query is:
SELECT
distinct vortex_dbo.vw_public_material_location.material_name
,vw_public_request_material_location_mir.material_request_id
,vw_public_request_material_location_mir.parttype_name
,operation_code
,vw_public_request_material_location_mir.result_name
,vw_public_request_material_location_mir.qdf_number
, requestor
,[vortex_hvc].[vortex_dbo].[material_request].created_by
,[vortex_hvc].[vortex_dbo].[material_request].created_datetime as time1
,[vortex_hvc].[vortex_dbo].[material_request].distribution_list
,[vortex_hvc].[vortex_dbo].[material_request].recipient_name
, DATEPART(WW,[vortex_hvc].[vortex_dbo].[material_request].created_datetime) as WW
,vw_public_request_material_location_mir.product_code_name
,task_name
,vw_public_request_material_location_mir.full_location_name
FROM [vortex_hvc].[vortex_dbo].[vw_public_request_material_location_mir]
left join request on vw_public_request_material_location_mir.material_request_id = request.request_key
left join vortex_dbo.material_request on vw_public_request_material_location_mir.material_request_id = vortex_dbo.material_request.material_request_id
left join vortex_dbo.vw_public_material_location on vw_public_request_material_location_mir.last_result_id = vortex_dbo.vw_public_material_location.last_result_id
left join vortex_dbo.vw_public_material_history on vw_public_request_material_location_mir.material_request_id like (substring(vw_public_material_history.comments,12,6))
where (vw_public_request_material_location_mir.qdf_number not like 'null' and vw_public_request_material_location_mir.qdf_number not like '')
and vw_public_request_material_location_mir.product_code_name like 'LAKE%'
and vw_public_request_material_location_mir.task_id not like 'null'
and (vw_public_request_material_location_mir.result_name like 'bin 100' or vw_public_request_material_location_mir.result_name like 'bin 01'
or vw_public_request_material_location_mir.result_name like 'bin 02' or vw_public_request_material_location_mir.result_name like 'pass')
and (requestor like 'BUGANIM, RINAT' and employee_name like 'BUGANIM, RINAT')
and ( DateDiff(DD,[vortex_hvc].[vortex_dbo].[material_request].created_datetime, getdate()) < 180)
and (concat('',substring(vortex_dbo.vw_public_material_location.comments,12,6)) like vw_public_request_material_location_mir.material_request_id
or vortex_dbo.vw_public_material_location.comments like 'Changed by Matrix Transaction Handler' or vortex_dbo.vw_public_material_location.comments like 'Unit Ownership:%')
and (unit_number = vortex_dbo.vw_public_material_location.material_name or unit_number is null)
and vortex_dbo.vw_public_material_location.material_name like 'D7QM748200403'
order by vortex_dbo.vw_public_material_location.material_name desc
The results I'm getting are:
two rows that only the 2nd one contains true data.
material_name material_request_id parttype_name operation_code result_name qdf_number requestor created_by time1 WW product_code_name task_name full_location_name
D7QM748200403 332160 H6 4GXDCV K Y 7295 BIN 01 Q1T5 BUGANIM, RINAT SMS_Interface 2017-12-03 20:27:30.327 49 CANNON LAKE Y 2+2 PPV-M SAMPLE: QDF INVENTORY
D7QM748200403 332176 H6 4GXDCV K Y 7295 BIN 01 Q1T5 BUGANIM, RINAT SMS_Interface 2017-12-03 21:02:33.247 49 CANNON LAKE Y 2+2 PPV-M SAMPLE: QDF INVENTORY
What can I do in order to retrieve true data only?, I have more cases like this.
Thanks!!

Issue with Union Sub-query

I'm attempting to use a union sub query to get the results for a couple different queries. What I'm looking to do is select all the players who hit a home run in the 2014 season, create a home run count for each player and find the average pitch speed of each home run. I'm also attempting to break things down by pitch type, my current code and result are as follows:
Select output.Batter_Name,
output.Qty,
output.speed,
output.avg_Speed,
output.break,
output.Type_Pitch,
Output.CH_Qty,
Output.CH_Pitch,
Output.Ch_Speed,
Output.CH_Avg_speed,
Output.CH_Break,
Output.CH_Type_Pitch
From(
SELECT
count(gameday.atbats.event) as Qty,
gameday.batters.name_display_first_last as Batter_Name,
gameday.pitches.type as Pitch,
gameday.pitches.start_speed as speed,
avg(gameday.pitches.start_speed) as avg_speed,
avg(gameday.pitches.break_length) as Break,
gameday.pitches.Pitch_type as Type_Pitch,
"0" as CH_Qty,
"0" as CH_Pitch,
"0" as Ch_Speed,
"0" as CH_Avg_speed,
"0" as CH_Break,
"0" as CH_Type_Pitch
FROM
gameday.atbats
JOIN
gameday.pitches ON gameday.atbats.num = gameday.pitches.gameAtBatID
AND gameday.pitches.gamename = gameday.atbats.gamename
INNER JOIN
gameday.batters ON gameday.atbats.batter = gameday.batters.ID
AND gameday.atbats.gamename = gameday.batters.gameName
INNER JOIN
gameday.pitchers ON gameday.atbats.pitcher = gameday.pitchers.ID
AND gameday.atbats.gamename = gameday.pitchers.gamename
WHERE
(gameday.atbats.event = 'Home Run')
AND gameday.pitches.type = 'x'
and gameday.pitches.Pitch_type = 'FF'
group by gameday.batters.name_display_first_last
UNION ALL
SELECT
"0" as Qty,
gameday.batters.name_display_first_last as Batter_Name,
"0" as Pitch,
"0" as Speed,
"0" as Avg_speed,
"0" as Break,
"0" as Type_Pitch,
count(gameday.atbats.event) as CH_Qty,
gameday.pitches.type as CH_Pitch,
gameday.pitches.start_speed as CH_speed,
avg(gameday.pitches.start_speed) as CH_avg_speed,
avg(gameday.pitches.break_length) as CH_Break,
gameday.pitches.Pitch_type as CH_Type_Pitch
FROM
gameday.atbats
JOIN
gameday.pitches ON gameday.atbats.num = gameday.pitches.gameAtBatID
AND gameday.pitches.gamename = gameday.atbats.gamename
INNER JOIN
gameday.batters ON gameday.atbats.batter = gameday.batters.ID
AND gameday.atbats.gamename = gameday.batters.gameName
INNER JOIN
gameday.pitchers ON gameday.atbats.pitcher = gameday.pitchers.ID
AND gameday.atbats.gamename = gameday.pitchers.gamename
WHERE
(gameday.atbats.event = 'Home Run')
AND gameday.pitches.type = 'x'
and gameday.pitches.Pitch_type = 'CH'
group by gameday.batters.name_display_first_last
) as Output
group by Output.Batter_name
A Sample of my results are below:
Batter_Name, Qty, speed, avg_Speed, break, Type_Pitch, CH_Qty, CH_Pitch, Ch_Speed, CH_Avg_speed, CH_Break, CH_Type_Pitch
A.J. Pollock 1 89 90 4.3 FF 0 0 0 0 0 0
Aaron Hicks 0 0 0 0 0 1 X 83 83 6 CH
The first player, Ellis shows that he had one home run on a FF, and zero on a CH. The 2nd player,Peirzynski, had 0 home runs on a FF, but 1 on a CH. The issue is that I know these players had home runs on both types of pitches, but the query is only one or the other, not both. My intended results are something like this:
Batter_Name, Qty, speed, avg_Speed, break, Type_Pitch, CH_Qty, CH_Pitch, Ch_Speed, CH_Avg_speed, CH_Break, CH_Type_Pitch
A.J. Pollock 1 89 90 4.3 FF 2 X 84 82 3.2 CH
Aaron Hicks 4 90 91 2.5 FF 1 X 83 83 6 CH
I'm thinking the issue has to be my setting some fields to 0, kind of like a place holder, but i cant seem to find a workable solution that gets me the results I want.

Perl (or R, or SQL): Count how often string appears across columns

I have a text file that looks like this:
gene1 gene2 gene3
a d c
b e d
c f g
d g
h
i
(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).
What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:
a 1 gene1
b 1 gene1
c 2 gene1 gene3
d 3 gene1 gene2 gene3
e 1 gene2
f 1 gene2
g 2 gene2 gene3
h 1 gene2
i 1 gene2
I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.
This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr and reshape packages, though I'm sure you could find base R alternatives. The trick is that function melt lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.
library(reshape)
library(plyr)
#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
gene2 = letters[4:9],
gene3 = c("c", "d", "g", NA, NA, NA)
)
#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)
#Tabulate counts
counts <- as.data.frame(table(dat.m$value))
#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))
#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Gives:
> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Var1 Freq V1
1 a 1 gene1
2 b 1 gene1
3 c 2 gene1 gene3
4 d 3 gene1 gene2 gene3
....
In perl, assuming the proteins in each column don't have duplicates that need to be removed. (If they do, a hash of hashes should be used instead.)
use strict;
use warnings;
my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
$column_genes{$-[1]} = "$1";
}
my %proteins;
while (my $line = <>) {
while ($line =~ /(\S+)/g) {
if (exists $column_genes{$-[1]}) {
push #{ $proteins{$1} }, $column_genes{$-[1]};
}
else {
warn "line $. column $-[1] unexpected protein $1 ignored\n";
}
}
}
for my $protein (sort keys %proteins) {
print join("\t",
$protein,
scalar #{ $proteins{$protein} },
join(' ', sort #{ $proteins{$protein} } )
), "\n";
}
Reads from stdin, writes to stdout.
A one liner (or rather 3 liner)
ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize,
len = length(variable),
var = paste(variable, collapse = " "))
If it's not a lot of columns, you can do something like this in sql. You basically flatten out the data into a 2 column derived table of protein/gene and then summarize it as needed.
;with cte as (
select gene1 as protein, 'gene1' as gene
union select gene2 as protein, 'gene2' as gene
union select gene3 as protein, 'gene3' as gene
)
select protein, count(*) as cnt, group_concat(gene) as gene
from cte
group by protein
In mysql, like so:
select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;
assuming data like:
create table gene_protein (gene varchar(255) not null, protein varchar(255) not null);
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d');
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i');
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');