Outliers of data by groups - mysql

I want to analyse outliers a of grouped data. Lets say I have data:
+--------+---------+-------+
| fruit | country | price |
+--------+---------+-------+
| apple | UK | 1 |
| apple | USA | 3 |
| apple | LT | 2 |
| apple | LV | 5 |
| apple | EE | 4 |
| pear | SW | 6 |
| pear | NO | 2 |
| pear | FI | 3 |
| pear | PL | 7 |
+--------+---------+-------+
Lets take pears. If my method of finding outliers would be to take 25% highest prices of pears and lowest 25%, outliers of pears would be
+--------+---------+-------+
| pear | NO | 2 |
| pear | PL | 7 |
+--------+---------+-------+
As for apples:
+--------+---------+-------+
| apple | UK | 1 |
| apple | LV | 5 |
+--------+---------+-------+
That I want is to create a view, which would show table of all fruits outliers union. If I had this view, I could analyse only tails, also intersect view with main table to get table without outliers - that's my goal. Solution to this would be:
(SELECT * FROM fruits f WHERE f.fruit = 'pear' ORDER BY f.price ASC
LIMIT (SELECT ROUND(COUNT(*) * 0.25,0)
FROM fruits f2
WHERE f2.fruit = 'pear')
)
union all
(SELECT * FROM fruits f WHERE f.fruit = 'pear' ORDER BY f.price DESC
LIMIT (SELECT ROUND(COUNT(*) * 0.25,0)
FROM fruits f2
WHERE f2.fruit = 'pear')
)
union all
(SELECT * FROM fruits f WHERE f.fruit = 'apple' ORDER BY f.price ASC
LIMIT (SELECT ROUND(COUNT(*) * 0.25,0)
FROM fruits f2
WHERE f2.fruit = 'apple')
)
union all
(SELECT * FROM fruits f WHERE f.fruit = 'apple' ORDER BY f.price DESC
LIMIT (SELECT ROUND(COUNT(*) * 0.25,0)
FROM fruits f2
WHERE f2.fruit = 'apple')
)
This would give me a table I want, however code after LIMIT doesn't seem to be correct... Another problem is number of groups. In this example there are only two groups(pears,apples), but in my actual data there are around 100 groups. So 'union all' should somehow automatically go thru all unique fruits without writing code for each unique fruit, find number of outliers of each unique fruit, take only that numbe of rows and show it all in another table(view).

You can't supply LIMIT with a value from a subquery, in any RDBMS I'm aware of. Some dbs don't even allow host variables/parameters in their versions of the clause (I'm thinking of iSeries DB2).
This is essentially a greatest-n-per-group problem. Similar queries in most other RDBMSs are solved with what are called Windowing functions - essentially, you're looking at a movable selection of data.
MySQL doesn't have this functionality, so we have to counterfeit it. The actual mechanics of the query will depend on the actual data you need, so I can only speak to what you're attempting here. The techniques should be generally adaptable, but may require rather more creativity than otherwise.
To start with you want a function that will return a number indicating it's position - I'm assuming duplicate prices should be given the same rank (ties), and that doing so won't create a gap in the number. This is essentially the DENSE_RANK() windowing function. We can get these results by doing the following:
SELECT fruit, country, price,
#Rnk := IF(#last_fruit <> fruit, 1,
IF(#last_price = price, #Rnk, #Rnk + 1)) AS Rnk,
#last_fruit := fruit,
#last_price := price
FROM Fruits
JOIN (SELECT #Rnk := 0) n
ORDER BY fruit, price
Example Fiddle
... Which generates the following for the 'apple' group:
fruit country price rank
=============================
apple UK 1 1
apple LT 2 2
apple USA 3 3
apple EE 4 4
apple LV 5 5
Now, you're trying to get the top/bottom 25% of rows. In this case, you need a count of distinct prices:
SELECT fruit, COUNT(DISTINCT price)
FROM Fruits
GROUP BY fruit
... And now we just need to join this to the previous statement to limit the top/bottom:
SELECT RankedFruit.fruit, RankedFruit.country, RankedFruit.price
FROM (SELECT fruit, COUNT(DISTINCT price) AS priceCount
FROM Fruits
GROUP BY fruit) CountedFruit
JOIN (SELECT fruit, country, price,
#Rnk := IF(#last_fruit <> fruit, 1,
IF(#last_price = price, #Rnk, #Rnk + 1)) AS rnk,
#last_fruit := fruit,
#last_price := price
FROM Fruits
JOIN (SELECT #Rnk := 0) n
ORDER BY fruit, price) RankedFruit
ON RankedFruit.fruit = CountedFruit.fruit
AND (RankedFruit.rnk > ROUND(CountedFruit.priceCount * .75)
OR RankedFruit.rnk <= ROUND(CountedFruit.priceCount * .25))
SQL Fiddle Example
...which yields the following:
fruit country price
=======================
apple UK 1
apple LV 5
pear NN 2
pear NO 2
pear PL 7
(I duplicated a pear row to show "tied" prices.)

Does round not need 2 / 3 arguments? I.e. do you not need to put in, to what decimal place you wish to round?
so
...
LIMIT (SELECT ROUND(COUNT(*) * 0.25)
FROM #fruits f2
WHERE f2.fruit = 'apple')
becomes
...
LIMIT (SELECT ROUND(COUNT(*) * 0.25,2)
FROM #fruits f2
WHERE f2.fruit = 'apple')
also, just having a quick look at lunch, but it looks like you're just expecting the min / max values. Could you not just use those functions instead?

Related

SQL Query sort and update by row number (SQLFiddle example)

I have a sports database where I want to sort the data by a custom field ('Rating') and update the field ('Ranking') with the row number.
I have tried the following code to sort the data by my custom field 'Rating'. It works when I sort it by a normal field, but not with a custom/calculated field. When the sorting has been done, I want it to update the field 'Ranking' with the row number.
Ie the fighter with the highest 'Rating' should have the value '1' as 'Ranking.
SELECT id,lastname, wins, Round(((avg(indrating)*13) + (avg(Fightrating)*5) * 20) / 2,2) as Rating,
ROW_NUMBER() OVER (ORDER BY 'Rating' DESC) AS num
from fighters
JOIN fights ON fights.fighter1 = fighters.id
GROUP BY id
The code above isn't sorting the Rating accurately. It sorts by row number, but the highest Rating isn't rated as #1. It seems a bit random.
SQL Fiddle: http://sqlfiddle.com/#!9/aa1fca/1 (This example is correctly sorted, but I want it to update the "Ranking" column by row number - meaning the highest rated fighter (by 'Rating') gets '1' in the Ranking column, the second highest reated fighter gets '2' in the Ranking column etc).
Also I would like to be able to add WHERE clause in the fighters table (where fighters.organization = 'UFC') for example.
First, let's fix your query so it runs on MySQL < 8.0. This requires doing the computing and sorting in a subquery, then using a variable to compute the rank:
select
id,
rating,
#rnk := #rnk + 1 ranking
from
(select #rnk := 0) r
cross join (
select
fighter1 id,
round(((avg(indrating)*13) + (avg(fightrating)*5) * 20) / 2,2) as rating
from fights
group by fighter1
order by rating desc
) x
Now we use the update ... join ... set ... syntax to update the fighters table:
update fighters f
inner join (
select
id,
rating,
#rnk := #rnk + 1 ranking
from
(select #rnk := 0) r
cross join (
select
fighter1 id,
round(((avg(indrating)*13) + (avg(fightrating)*5) * 20) / 2,2) as rating
from fights
group by fighter1
order by rating desc
) x
) y on y.id = f.id
set f.ranking = y.ranking;
Demo in a MySQL 5.6 fiddle based on the fiddle you provided in the comments.
The select query returns:
| id | rating | ranking |
| --- | ------ | ------- |
| 3 | 219.5 | 1 |
| 4 | 213 | 2 |
| 1 | 169.5 | 3 |
| 2 | 156.5 | 4 |
And here is the content of the fighters table after the update:
| id | lastname | ranking |
| --- | ---------- | ------- |
| 1 | Gustafsson | 3 |
| 2 | Cyborg | 4 |
| 3 | Jones | 1 |
| 4 | Sonnen | 2 |

How to use variable from another table for UNION

I have two sql commands I want to combine. I have changed the variables I am actually using in an attempt make it simpler to explain.
I would like to get the name of all fruits and vegetables with the colors that are a favoriteColor of everyone who's age is equal to the given value.
Currently I have these queries split up and I get the favorite color of people with SELECT favoriteColor FROM people WHERE age = ? and then I get all the fruits and vegetables where the color matches the favoriteColor of each person.
I get the matching fruits and vegetables like this:
SELECT * FROM ((SELECT 1 as type, name FROM fruits WHERE color = ?)
UNION ALL
(SELECT 2 as type, name FROM vegetables WHERE color = ?)) results
I basically want something like this, but I haven't been able to get it to work and I also do not want to have to run the same SELECT query twice:
SELECT * FROM ((SELECT 1 as type, name FROM fruits WHERE color =
(SELECT favoriteColor FROM people WHERE age = ?))
UNION ALL
(SELECT 2 as type, name FROM vegetables WHERE color =
(SELECT favoriteColor FROM people WHERE age = ?))) results
And I don't mind if I get duplicated fruits and vegetables, I need the duplicates for my situation.
For example:
If there are 2 people who are 30 years old and both of them like the color red, I want to get all fruits and vegetables that are red twice.
If there are 2 people who are 10 years old and one of them likes the color red and the other one also likes the color green, I want to get all fruits and vegetables that are red and green.
Not sure why you thought you had to test the colour in the union since the driver is people. And I have guessed what our output should be.
create table people(id int,name varchar(10),colour varchar(1),age int);
insert into fruits values
(1,'a','a'),(2,'b','a'),(3,'b','b'),(4,'b','c');
insert into vegetables values
(1,'t','a'),(2,'t','u'),(3,'v','v'),(4,'v','w');
insert into people values
(1,'aa','a',10),(2,'bb','b',10),(3,'cc','c',10),(4,'dd','c',11);
select p.name,p.age,p.name,s.`type`,s.name,s.colour
from people p
join
(
select 1 as type, name,colour from fruits
union
select 2 as type, name,colour from vegetables
) s
on s.colour = p.colour
where p.age = 10;
+------+------+------+------+------+--------+
| name | age | name | type | name | colour |
+------+------+------+------+------+--------+
| aa | 10 | aa | 2 | t | a |
| aa | 10 | aa | 1 | b | a |
| aa | 10 | aa | 1 | a | a |
| bb | 10 | bb | 1 | b | b |
| cc | 10 | cc | 1 | b | c |
+------+------+------+------+------+--------+
5 rows in set (0.00 sec)
i don't know but the simplied version may be
SELECT * FROM ((SELECT 1 as type, name, color FROM fruits WHERE color = ?) UNION ALL
(SELECT 2 as type, name, color FROM vegetables WHERE color = ?)) results
where results.color= (SELECT favoriteColor FROM people WHERE age = ?)
sorry for indentation
I'd do it as a pair of unions to create one unified dataset, joined to another (then filtered) dataset:
SELECT * FROM
(
SELECT 1 as type, name, color FROM fruits
UNION ALL
SELECT 2 as type, name, color FROM vegetables
) plants pl
INNER JOIN
people pe ON pl.color = pe.favoriteColor
WHERE
pe.age = 30
If you want different columns out of fruit and veg, and there might not be a fruit or veg row for a given color:
SELECT * FROM
people pe
LEFT JOIN fruits f on pe.favoriteColor = f.color
LEFT JOIN veg v on pe.favoriteColor = f.color
WHERE
pe.age = 30
But bear in mind that multiple fruits or veg of a given color will cause the result set to multiply in duplicate for the other plant, which could become a nightmare to deal with on the front end

Truncate and concatenate mysql results based on number of results

I would like to concatenate all the item names of an order; however, if the total number of unique item names exceeds a certain number, then I want to to truncate each name before concatenating the names. Below are the conditions:
If the total number of unique item names in the order is less than 5, then use the full-length item name and concatenate the names; else if the total number of unique item names is greater than 5, then truncate each item name to 20 characters and concatenate the truncated names. For example, below is my table:
order_id | item_name | item_name_len
---------|-------------------------------------|--------------
1 | "pampers diapers ultra sensitive" | 31
1 | "cabbage salad pure organic greens" | 33
1 | "milky way" | 9
1 | "sea salt" | 8
1 | "cool waters fruit juice" | 23
| |
2 | "pure clear glass crystals" | 25
2 | "simple sugar edible paper" | 25
I want the following results:
order_id | all_item_names
---------|-----------------------------------------------------------
1 | "pampers diapers ultr ; cabbage salad pure o ; milky way ;
| sea salt ; cool waters fruit ju"
|
2 | "pure clear glass crystals ; simple sugar edible paper"
For Order #1, since there are 5 unique item names in the order, we truncate each of the item names to 20 characters and concatenate the truncated names. For Order #2, since there are only 2 unique item names in the order, we take the full-length of the name and concatenate them. (I've included the item name strlen in the table above for illustration.)
I'm trying to use a ternary condition, but it's not working.
IF( COUNT(DISTINCT item_name) < 5, item_name, SUBSTRING(item_name, 1, 20) )
See query below. I get Error code: 1111. Invalid use of group function
SELECT
w.order_id,
(SELECT GROUP_CONCAT( IF ( COUNT(DISTINCT o.item_name) < 5 ,
o.item_name, SUBSTRING(o.item_name, 1, 20) ) ) separator ' ; ' )
FROM order_items o WHERE o.order_id = w.order_id)AS all_item_names
FROM order_items w
GROUP BY order_id
You can do this with one aggregation and no join:
select oi.order_id,
(case when count(*) < 5
then group_concat(oi.item_name separator '; ')
else group_concat(left(oi.item_name, 20) separator ';')
end) as all_item_names
from order_items oi
group by oi.order_id
Group once to get the number of items and join to the table for the final group_concat:
select
o.order_id,
group_concat(
case
when counter < 5 then item_name
else left(item_name, 20)
end SEPARATOR ' ; '
) all_item_names
from order_items o inner join (
select
order_id, count(*) counter
from order_items
group by order_id
) g on g.order_id = o.order_id
group by o.order_id
See the demo
Results:
| order_id | all_item_names |
| -------- | ----------------------------------------------------------------------------------------- |
| 1 | pampers diapers ultr ; cabbage salad pure o ; milky way ; sea salt ; cool waters fruit ju |
| 2 | pure clear glass crystals ; simple sugar edible paper |

MySQL: JOINs on 1-to-1 basis

I think, this problem is of more advanced SQL category (MySQL in this case): I have two tables (TABLE_FRUIT, TABLE_ORIGIN - just example names) which have columns that can be joined (fruit_name).
Consider the following diagram:
TABLE_FRUIT
fruit_id|fruit_name |variety
--------|----------------------
1|Orange |sweet
2|Orange |large
3|Lemon |wild
4|Apple |red
5|Apple |yellow
6|Pear |early
etc...
TABLE_ORIGIN
fuit_id |fruit_name|Origin
---------|----------|--------
1|Apple | Italy
2|Pear | Portugal
3|Grape | Italy
4|Orange | Spain
5|Orange | Portugal
6|Orange | Italy
etc...
Desired Result:
TABLE_FRUIT_ORIGIN
fuit_id |fruit_name|Origin
---------|----------|--------
1|Orange | Spain
2|Orange | Portugal
3|Apple | Italy
4|Pear | Portugal
The tables have multiple identical values in columns that compose the joins(fruit_name). Despite that, I need to join the values on 1-to-1 basis. In other words, there is "Orange" value 2 times in TABLE_FRUIT and 3 times in TABLE_ORIGIN. I am looking for a result of two matches, one for Spain, one for Portugal. Italy value from TABLE_ORIGIN must be ignored, because there is no available third Orange value in TABLE_FRUIT to match Orange value in TABLE_ORIGIN.
I tried what I could, but I can not find anything relevant on Google. For example, I tried adding one more column record_used and tried UPDATE but without success.
TABLE_ORIGIN
fuit_id |fruit_name|origin |record_used
---------|----------|-----------|-----------
1|Apple | Italy |
2|Pear | Portugal |
3|Grape | Italy |
4|Orange | Spain |
5|Orange | Portugal |
6|Orange | Italy |
etc...
UPDATE
TABLE_FRUIT t1
INNER JOIN
TABLE_ORIGIN t2
ON
(t1.fruit_name = t2.fruit_name)
AND
(t2.record_used IS NULL)
SET
t2.record_used = 1;
Summary:
Find matching records between two tables on 1-to-1 basis (probably JOIN)
For each record in TABLE_FRUIT find just one (next first) matching record in TABLE_ORIGIN
If a record in TABLE_ORIGIN was already matched once with a record from TABLE_FRUIT, it may not be considered again in the same query run.
Here is what I had in mind with RANK function. After commenting, I realized mysql doesn't have a built in RANK over GROUP BY function so had to find this work around.
SELECT *
FROM (SELECT fruit_name,
#f_rank := IF(#f_name = fruit_name, #f_rank + 1, 1) AS rank,
#f_name := fruit_name
FROM table_fruit
ORDER BY fruit_name DESC) f
INNER JOIN (SELECT fruit_name,
#f_rank := IF(#f_name = fruit_name, #f_rank + 1, 1) AS
rank,
#f_name := fruit_name
FROM table_origin
ORDER BY fruit_name DESC) o
ON f.fruit_name = o.fruit_name
AND f.rank = o.rank;
Explanation: Rank each item in the table for each fruit. So Orange in the first table would have rank 1 and 2 and so will Apple. In the second table, Orange will have rank 1, 2 and 3 but others will only have rank 1. Then when joining the tables based on names, you can also join based on rank so that way, you'll get Orange rank 1 and 2 match but Orange with rank 3 will not match.
This is based on my understanding of the problem. Let me know if the requirement is something different than what I have given here.
There is an arbitrary relationship between the number of entries and the order of those entries, so use techniques to match the number of items and order of those items. In MariaDB v10 which supports "window functions" dense_rank() and row_number() this is relatively easy:
select
row_number() over(order by fn.fruit_id) as fruit_id
, fn.fruit_name, o.Origin, fn.variety
from (
select fruit_name, variety, fruit_id
, dense_rank() over(partition by fruit_name order by fruit_id) rnk
from table_fruit
) fn
inner join (
select fruit_name, Origin
, dense_rank() over(partition by fruit_name order by fruit_id) rnk
from table_origin
) o on fn.fruit_name = o.fruit_name and fn.rnk = o.rnk
fruit_id | fruit_name | Origin | variety
-------: | :--------- | :------- | :------
1 | Orange | Spain | sweet
2 | Orange | Portugal | large
3 | Apple | Italy | red
4 | Pear | Portugal | early
dbfiddle here
A pure MySQL solution is a bit more complex because it requires use of #variables that will substitute for those window functions.

Select Most Frequent from Multiple Columns

I've got a table like;
ID | Winner | Loser | WinningCaster | LosingCaster
0 | Player A | Player B | Warcaster A | Warcaster B
1 | Player A | Player B | Warcaster C | Warcaster A
2 | Player C | Player D | Warcaster A | Warcaster B
etc..
With various values for Player, and Warcaster.
WinningCaster / LosingCaster is a finite namelist, and I want to make a query that will find me which name occurs the most often, across both columns, both with and without a particular player entry.
IE Player A should return WarcasterA with 2, and an overall Query should return WarcasterA with 3.
So far I've only been able to get the most frequent from either column, not from both, with the following;
SELECT
ID, Winner, Loser, CasterWinner, Count(CasterWinner) AS Occ
FROM
`Games`
GROUP BY
CasterWinner
ORDER BY
Occ DESC
LIMIT 1
Use union all:
select caster, count(*)
from ((select casterwinner as caster from games
) union all
(select casterloser from games
)
) c
group by caster
order by count(*) desc
limit 1;