SQL: value higher than percentage of population of values - mysql

I wish to calculate the value which is higher than a percentage of the population of values, this per group.
Suppose I have:
CREATE TABLE project
(
id int,
event int,
val int
);
INSERT INTO project(id,event,val)
VALUES
(1, 11, 43),
(1, 12, 19),
(1, 13, 19),
(1, 14, 53),
(1, 15, 45),
(1, 16, 35),
(2, 21, 22),
(2, 22, 30),
(2, 23, 25),
(2, 24, 28);
I now want to calculate for each id what is the val that will be for example higher than 5%, or 30% of the val for that id.
For example, for id=1, we have the following values: 43, 19, 19, 53, 45, 35.
So the contingency table would look like this:
19 35 43 45 53
2 1 1 1 1
and the val=20 (higher than 19) would be chosen to be higher than 5% (actuall 2 out of 6) of the rows.
The contengency table for id 2 is:
22 25 28 30
1 1 1 1
My expected out is:
id val_5p_coverage val_50p_coverage
1 20 36
2 23 26
val_5p_coverage is the value val needed to be above at least 5% of val in the id.
val_50p_coverage is the value val needed to be above at least 50% of val in the id.
How can I calculate this with SQL ?

I managed to do it in HiveQL (for Hadoop) as follows:
create table prep as
select *,
CUME_DIST() OVER(PARTITION BY id ORDER BY val ASC) as proportion_val_equal_or_lower
from project
SELECT id,
MIN(IF(proportion_val_equal_or_lower>=0.05, val, NULL)) AS val_5p_coverage,
MIN(IF(proportion_val_equal_or_lower>=0.50, val, NULL)) AS val_50p_coverage
FROM prep
GROUP BY id
Although this is not MySQL nor SQL per se, it might help to do it in MySQL or SQL.

Related

MySQL: Join records 1:1 that have a one to many relationship

I have a table that contains properties and their types:
INSERT INTO properties (property_id, year_built, type_id, code)
VALUES
(1, 2000, 3, 'ABC'),
(2, 2001, 3, 'ABC'),
(3, 2002, 3, 'ABC'),
(4, 2003, 3, 'ABC'),
(5, 2004, 3, 'ABC'),
(6, 2005, 3, 'ABC'),
(7, 2000, 3, 'DEF'),
(8, 2001, 3, 'DEF'),
(9, 2002, 3, 'DEF'),
(10, 2003, 3, 'DEF'),
(11, 2004, 3, 'DEF'),
(12, 2005, 3, 'DEF'),
(13, 2000, 3, 'GHI'),
(14, 2001, 3, 'GHI'),
(15, 2002, 3, 'GHI'),
(16, 2003, 3, 'GHI'),
(17, 2004, 3, 'GHI'),
(18, 2005, 3, 'GHI');
I have a second table 'agents' with the same number of records as the properties table.
INSERT INTO agents (agent_id, year_built, type_id)
VALUES
(50, 2000, 3),
(51, 2001, 3),
(52, 2002, 3),
(53, 2003, 3),
(54, 2004, 3),
(55, 2005, 3),
(56, 2000, 3),
(57, 2001, 3),
(58, 2002, 3),
(59, 2003, 3),
(60, 2004, 3),
(61, 2005, 3),
(62, 2000, 3),
(63, 2001, 3),
(64, 2002, 3),
(65, 2003, 3),
(66, 2004, 3),
(67, 2005, 3);
There is a field in the properties table: 'agent_id' that should be populated with a single agent of the same year and type. For example, this is the expected result of the properties table for the year 2000 after running an update statement:
SELECT (*) FROM properties WHERE year_built = 2000;
property_id year_built type_id code agent_id
1 2000 3 ABC 50
7 2000 3 DEF 56
13 2000 3 GHI 62
Every join that I try results in all matching agent records returned for each property_id. For example:
SELECT properties.*, agents.agent_id
FROM properties
JOIN agents
USING(year_built, type_id)
WHERE properties.year_built = 2000;
Would give the result:
property_id year_built type_id code agent_id
1 2000 3 ABC 50
1 2000 3 ABC 56
1 2000 3 ABC 62
7 2000 3 DEF 50
7 2000 3 DEF 56
7 2000 3 DEF 62
13 2000 3 GHI 50
13 2000 3 GHI 56
13 2000 3 GHI 62
I'm aware that a simple join will return all the agent records, but I'm not sure how to match a single agent record to a single properties record with just the fields I have to work with. In addition, I would want these to be ordered - so that the first property_id for a year/type matches with the first agent_id of the same year/type. I should also add that neither table's fields, keys, or properties can be modified.
As the data from table properties can be evenly matched to the data from table agents, we can capitalize on the row number added to each table for precise matching. This is written and tested in workbench using MySQL5.7 :
select p.property_id,p.year_built,p.type_id,p.code,agent_id from
(select property_id,year_built,type_id,code,#row_id:=#row_id+1 as rowid
from properties,(select #row_id:=0) t ) p
join
(select agent_id,year_built,type_id,#row_number:=#row_number+1 as rownumber
from agents,(select #row_number:=0) t ) a
on p.year_built=a.year_built and p.type_id=a.type_id and p.rowid=a.rownumber
where p.year_built=2000
;

Calculate the number of jobs reviewed per hour per day for November 2020?

This question is really confusing me. They didn't provide enough details of it. Whatever they have provided I have written below.
job_id: unique identifier of jobs
actor_id: unique identifier of actor
event: decision/skip/transfer
language: language of the content
time_spent: time spent to review the job in seconds
org: organization of the actor
ds: date in the yyyy/mm/dd format. It is stored in the form of text and we use presto to run. no need for date function
CREATE TABLE job_data
(
ds DATE,
job_id INT NOT NULL,
actor_id INT NOT NULL,
event VARCHAR(15) NOT NULL,
language VARCHAR(15) NOT NULL,
time_spent INT NOT NULL,
org CHAR(2)
);
INSERT INTO job_data (ds, job_id, actor_id, event, language, time_spent, org)
VALUES ('2020-11-30', 21, 1001, 'skip', 'English', 15, 'A'),
('2020-11-30', 22, 1006, 'transfer', 'Arabic', 25, 'B'),
('2020-11-29', 23, 1003, 'decision', 'Persian', 20, 'C'),
('2020-11-28', 23, 1005,'transfer', 'Persian', 22, 'D'),
('2020-11-28', 25, 1002, 'decision', 'Hindi', 11, 'B'),
('2020-11-27', 11, 1007, 'decision', 'French', 104, 'D'),
('2020-11-26', 23, 1004, 'skip', 'Persian', 56, 'A'),
('2020-11-25', 20, 1003, 'transfer', 'Italian', 45, 'C');
Below is the data. Points to be considered :
What does the event mean? What to consider for reviewing?
And here's the query I've tried:
SELECT ds, COUNT(*)/24 AS no_of_job
FROM job_data
WHERE ds BETWEEN '2020-11-01' AND '2020-11-30'
GROUP BY ds;
Check below approach if it is what you are looking for.
select ds,count(job_id) as jobs_per_day, sum(time_spent)/3600 as hours_spent
from job_data
where ds >='2020-11-01' and ds <='2020-11-30'
group by ds ;
Demo MySQL 5.6: https://www.db-fiddle.com/f/7yUJcuMJPncBBnrExKbzYz/26
Demo MySQL 8.0.26: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=83e89a2ad2a7e73b7ca990ac36ae4df0
The difference between the demos as #FaNo_FN pointed out in the comments is: in MySQL 8.0.26version it will provide an error if date 2020-11-31 it is used, because there is no 31 Novembre.
Use and condition instead of between , it performs faster.
You need to sum the time_spent for the day.

Recursive query with optional depth limit with MySQL 5.6

I have two table schemas (MySQL 5.6 so no CTE), roughly looking like this:
CREATE TABLE nodes (
node_id INT PRIMARY KEY,
name VARCHAR(10)
);
CREATE TABLE edges (
edge_id INT PRIMARY KEY,
source INT,
target INT,
FOREIGN KEY (source) REFERENCES nodes(node_id),
FOREIGN KEY (target) REFERENCES nodes(node_id)
);
In our design, a logical edge between two nodes (logically n1 -> n2) is actually represented as (n1 -> proxy node -> n2) in the db. The reason we use two edges and a proxy node for a logical edge is so that we can store properties on the edge. Therefore, when a client queries for two nodes connected by an edge, the query is translated to query three connected nodes instead.
I have written a query to get a path with a fixed length. For example, "give me all the paths that start with a node with some properties, and end with a node with some properties, with exactly 5 edges on the path." This is done without using recursion on the SQL side; I just generate a long query programmatically with the specified fixed length.
The challenge is, we want to support querying of a variable-length path. For example, "give me all the paths that start with a node with some properties, and end with a node with some properties, with no fewer than 3 edges and no more than 10 edges on the path." Is this feasible without (or even with) CTE?
EDIT:
Some sample data:
-- Logical nodes are 1, 3, 5, 7, 9, 11. The rest are proxy nodes.
INSERT INTO nodes VALUES
(1, 'foo'),
(2, '_proxy_'),
(3, 'foo'),
(4, '_proxy_'),
(5, 'bar'),
(6, '_proxy_'),
(7, 'bar'),
(8, '_proxy_'),
(9, 'bar'),
(10, '_proxy_'),
(11, 'bar');
-- Connects 1 -> 2 -> ... -> 11.
INSERT INTO edges VALUES
(1, 1, 2),
(2, 2, 3),
(3, 3, 4),
(4, 4, 5),
(5, 5, 6),
(6, 6, 7),
(7, 7, 8),
(8, 8, 9),
(9, 9, 10),
(10, 10, 11);
The query can be, "select the ID and names of all the nodes on a path such that the path starts with a node named 'foo' and ends with a node named 'bar', with at least 2 nodes and at most 4 nodes on the path." Such paths include 1 -> 3 -> 5, 1 -> 3 -> 5 -> 7, 3 -> 5, 3 -> 5 -> 7, and 3 -> 5 -> 7 -> 9. So the result set should include the IDs and names of nodes 1, 3, 5, 7, 9.
The following query returns all paths of interest in comma separated strings.
with recursive rcte as (
select e.source, e.target, 1 as depth, concat(e.source) as path
from nodes n
join edges e on e.source = n.node_id
where n.name = 'foo' -- start node name
union all
select e.source, e.target, r.depth + 1 as depth, concat_ws(',', r.path, e.source)
from rcte r
join edges p on p.source = r.target -- p for proxy
join edges e on e.source = p.target
where r.depth < 4 -- max path nodes
)
select r.path
from rcte r
join nodes n on n.node_id = r.source
where r.depth >= 2 -- min path nodes
and n.name = 'bar' -- end node name
The result looks like this:
| path |
| ------- |
| 3,5 |
| 1,3,5 |
| 3,5,7 |
| 1,3,5,7 |
| 3,5,7,9 |
View on DB Fiddle
You can now parse the strings in application code and merge/union the arrays. If you only want the contained node ids, you can also change the outer query to:
select distinct r2.source
from rcte r
join nodes n on n.node_id = r.source
join rcte r2 on find_in_set(r2.source, r.path)
where r.depth >= 2 -- min path nodes
and n.name = 'bar' -- end node name
Result:
| source |
| ------ |
| 1 |
| 3 |
| 5 |
| 7 |
| 9 |
View on DB Fiddle
Note that a JOIN on FIND_IN_SET() might be slow, if rcte contains too many rows. I would rather do this step in application code, which should be quite simple in a procedural language.
MySQL 5.6 solution
Prior to MySQL 8.0 and MariaDB 10.2 there was no way for recursions. Farther there are many other limitations, which make a workaround difficult. For example:
No dynamic queries in stored functions
No way to use a temporary table twice in a single statement
No TEXT type in memmory engine
However - an RCTE can be emulated in a stored procedure moving rows between two (temporary) tables. The following procedure does that:
delimiter //
create procedure get_path(
in source_name text,
in target_name text,
in min_depth int,
in max_depth int
)
begin
create temporary table tmp_sources (id int, depth int, path text) engine=innodb;
create temporary table tmp_targets like tmp_sources;
insert into tmp_sources (id, depth, path)
select n.node_id, 1, n.node_id
from nodes n
where n.name = source_name;
set #depth = 1;
while #depth < max_depth do
set #depth = #depth+1;
insert into tmp_targets(id, depth, path)
select e.target, #depth, concat_ws(',', t.path, e.target)
from tmp_sources t
join edges p on p.source = t.id
join edges e on e.source = p.target
where t.depth = #depth - 1;
insert into tmp_sources (id, depth, path)
select id, depth, path
from tmp_targets;
truncate tmp_targets;
end while;
select t.path
from tmp_sources t
join nodes n on n.node_id = t.id
where n.name = target_name
and t.depth >= min_depth;
end //
delimiter ;
Use it as:
call get_path('foo', 'bar', 2, 4)
Result:
| path |
| ------- |
| 3,5 |
| 1,3,5 |
| 3,5,7 |
| 1,3,5,7 |
| 3,5,7,9 |
View on DB Fiddle
This is far from being optimal. If the result has many or long paths, you might need to define some indexes on the temprary tables. Also I don't like the idea of creating (temporary) tables in stroed procedures. See it as "proof of concept". Use it on your own risk.
I've solved this sort of problem with a transitive closure table. This enumerates every direct and indirect path through your nodes. The edges you currently have are paths of length 1. But you also need paths of length 0 (i.e., a node has a path to itself), and then every path from one source node to an eventual target node, for paths with length greater than 1.
create table closure (
source int,
target int,
length int,
is_direct bool,
primary key (source, target)
);
insert into closure values
(1, 1, 0, false), (1, 2, 1, true), (1, 3, 2, false), (1, 4, 3, false), (1, 5, 4, false), (1, 6, 5, false), (1, 7, 6, false), (1, 8, 7, false), (1, 9, 8, false), (1, 10, 9, false), (1, 11, 10, false),
(2, 2, 0, false), (2, 3, 1, true), (2, 4, 2, false), (2, 5, 3, false), (2, 6, 4, false), (2, 7, 5, false), (2, 8, 6, false), (2, 9, 7, false), (2, 10, 8, false), (2, 11, 9, false),
(3, 3, 0, false), (3, 4, 1, true), (3, 5, 2, false), (3, 6, 3, false), (3, 7, 4, false), (3, 8, 5, false), (3, 9, 6, false), (3, 10, 7, false), (3, 11, 8, false),
(4, 4, 0, false), (4, 5, 1, true), (4, 6, 2, false), (4, 7, 3, false), (4, 8, 4, false), (4, 9, 5, false), (4, 10, 6, false), (4, 11, 7, false),
(5, 5, 0, false), (5, 6, 1, true), (5, 7, 2, false), (5, 8, 3, false), (5, 9, 4, false), (5, 10, 5, false), (5, 11, 6, false),
(6, 6, 0, false), (6, 7, 1, true), (6, 8, 2, false), (6, 9, 3, false), (6, 10, 4, false), (6, 11, 5, false),
(7, 7, 0, false), (7, 8, 1, true), (7, 9, 2, false), (7, 10, 3, false), (7, 11, 4, false),
(8, 8, 0, false), (8, 9, 1, true), (8, 10, 2, false), (8, 11, 3, false),
(9, 9, 0, false), (9, 10, 1, true), (9, 11, 2, true),
(10, 10, 0, false), (10, 11, 1, true),
(11, 11, 0, false);
Now we can write your query:
select the ID and names of all the nodes on a path such that the path starts with a node named 'foo' and ends with a node named 'bar', with at least 2 nodes and at most 4 nodes on the path.
I translate this into paths of length 4,6,8 because you have a proxy node in between each, so it really takes two hops to go between nodes.
select source.node_id as source_node, target.node_id as target_node, c.length
from nodes as source
join closure as c on source.node_id = c.source
join nodes as target on c.target = target.node_id
where source.name='foo' and target.name = 'bar' and c.length in (4,6,8)
Here's the result, which in fact also includes node 11:
+-------------+-------------+--------+
| source_node | target_node | length |
+-------------+-------------+--------+
| 1 | 5 | 4 |
| 1 | 7 | 6 |
| 1 | 9 | 8 |
| 3 | 7 | 4 |
| 3 | 9 | 6 |
| 3 | 11 | 8 |
+-------------+-------------+--------+
Re comment from Paul Spiegel:
Once you have the endpoints of the path, you can query the closure for all paths that start at the source, and end at a node that also has a path to the target.
select source.node_id as source_node, target.node_id as target_node,
group_concat(i1.target order by i1.target) as interim_nodes
from nodes as source
join closure as c on source.node_id = c.source
join nodes as target on c.target = target.node_id
join closure as i1 on source.node_id = i1.source
join closure as i2 on target.node_id = i2.target and i1.target = i2.source
where source.name='foo' and target.name = 'bar' and c.length in (4,6,8)
group by source.node_id, target.node_id
+-------------+-------------+---------------------+
| source_node | target_node | interim_nodes |
+-------------+-------------+---------------------+
| 1 | 5 | 1,2,3,4,5 |
| 1 | 7 | 1,2,3,4,5,6,7 |
| 1 | 9 | 1,2,3,4,5,6,7,8,9 |
| 3 | 7 | 3,4,5,6,7 |
| 3 | 9 | 3,4,5,6,7,8,9 |
| 3 | 11 | 3,4,5,6,7,8,9,10,11 |
+-------------+-------------+---------------------+

Mysql update count of records

I have a table 'product_records' as follows:
id - int (6),
product_group - int (7),
product_subgroup - int (7),
type - int(3),
count_of_reports - int (6)
At the moment the values in column count_of_reports for all records are 0.
What is the most efficient way of adding count_of_reports for every row for matching product_group, product_subgroup and type?
Example:
1, 23, 1, 1, count here (i.e. 2);
2, 23, 2, 1, count here (i.e. 1);
3, 23, 1, 1, count here (i.e. 2);
4, 24, 1, 3, count here (i.e. 1);
Thank you in advance.
You can use below update statement -
UPDATE product_records PR
JOIN (SELECT CONCAT(product_group, product_subgroup) ID, COUNT(DISTINCT CONCAT(product_group, product_subgroup)) NUM
FROM product_records
GROUP BY CONCAT(product_group, product_subgroup)) PR2
ON CONCAT(PR.product_group, PR.product_subgroup) = PR2.ID
SET PR.count_of_reports = PR2.NUM

How to compute column by specific value?

How to query result like these picture.
First column select is this month plus next column by condition field (SelectColumn)
yellow background is the select column for sum
my example code.
declare #myDate date = getdate(),#qry varchar(max)
set #qry = 'select case v.SelectColumn
when 0 then (SELECT '+DATENAME(month,#myDate)+')
when 2 then (SELECT '+DATENAME(month,#myDate)+'+'+DATENAME(MONTH,DATEADD(MONTH,1,#myDate))+')
when 1 then (SELECT '+DATENAME(month,#myDate)+'+'+DATENAME(MONTH,DATEADD(MONTH,1,#myDate))+'+'+DATENAME(MONTH,DATEADD(MONTH,2,#myDate))+')
end
as SumColumn
from vwQC12Month v'
exec (#qry)
This problem requires the monthly values to be accessible as rows such that selected aggregation can be applied. Typically, you can use UNPIVOT or VALUES to rotate columns into rows. Below is a working example using UNPIVOT.
In the query below, you'd have to change the short form month names to match your column names for it to work. Here is a working demo of this: http://sqlfiddle.com/#!18/094e4/2
Also you might want to consider how you want to deal with year end wrapping and adjust the aggregation accordingly.
-- setup sample data
create table Test (
id int, Jan int, Feb int, Mar int, Apr int, May int,
Jun int, Jul int, Aug int, Sep int, Oct int, Nov int, Dec int,
SelectColumn int)
insert Test values
(1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 1),
(2, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 2),
(3, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 3),
(4, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 4)
-- query
DECLARE #currentMonthNumber int = MONTH(getdate())
-- CTE that rotates the data into rows
; WITH RowsByMonth(id, SelectColumn, monthNumber, val) AS
(
SELECT id, SelectColumn, CONVERT(int, month) AS monthNumber, val
FROM
(SELECT id, SelectColumn,
Jan AS [1], Feb AS [2], Mar AS [3], Apr AS [4],
May AS [5], Jun AS [6], Jul AS [7], Aug AS [8],
Sep AS [9], Oct AS [10], Nov AS [11], Dec AS [12]
FROM Test) AS Source
UNPIVOT
(val FOR month IN
([1], [2], [3], [4],
[5], [6], [7], [8],
[9], [10], [11], [12])
) AS asRows
) -- aggregation below
SELECT id, SUM(val) AS SumAcrossSelectedMonths
FROM RowsByMonth
WHERE
monthNumber >= #currentMonthNumber
AND monthNumber - #currentMonthNumber < SelectColumn
GROUP BY id
-- Results
| id | SumAcrossSelectedMonths |
|----|-------------------------|
| 1 | 40 |
| 2 | 90 |
| 3 | 150 |
| 4 | 220 |