MySql - AVG() and STD() function , weird results... - mysql

I'm going crasy with the results from MySql regarding standard functions:
- AVG() the average
- STD() the standard deviation
Check the following results from my table 'Auction':
mysql> SELECT avg(buyout) avg FROM auction where buyout <> 0 and item =72988;
+-------------+
| avg |
+-------------+
| 234337.3622 |
+-------------+
That result looks correct, no issue.
But when I run std:
mysql> SELECT std(buyout) std FROM auction where buyout <> 0 and item =72988;
+-------------+
| std |
+-------------+
| 574373.6098 |
+-------------+
! The SDT is greater than AVG (SDT > AVG), and that's... impossible because my AVG>0.
Where am I wrong here ... ?
thx in advance !

There is no mathematical constraint saying that if mean is positive it has to be smaller than the standard deviation.
I read the extract of your data in R
data <- read.table("extract_72988.csv", h=1, sep="\t")
And confirmed that
> mean(data$BUYOUT)
[1] 234337.4
> sd(data$BUYOUT)
[1] 574421.3
Further analysis of your data shows that it is far from being normally distributed
Here is an histogram of your data:
And here is the histogram of log-transformed data
And finally a normal Q-Q plot

Or said differently, we are looking at auction prices. Each price in the database is a positive value. Our mean is not reduced nor centered, and is around 2.35, but the computation of the st.dev returns a higher value than 2.35. If we put this result as a graph, it would mean that the prices move around the mean of a value greater than the mean itself, If we draw this standard deviation "to the left" from our mean, then it would say that there is probability to find a NEGATIVE price -> impossible !
Right ?

Related

Mysql Sort price , when price thousand to K, million to M

In a MySQL database, prices are stored in a way like this:
98.06K
97.44K
929.14K
91.87K
2.66M
146.64K
14.29K
when i try to sort price ASC or Price DESC, it returns unexpected result.
Kindly suggest me how can i sort price when price is in
10K, 20M, 1.6B
I want result
14.29K
91.87K
97.44K
98.06K
146.64K
929.14K
2.66M
MySQL ignores trailing non-digits when casting string to numeric. This will return the correct price:
price *
case right(price,1)
when 'K' then 1000
when 'M' then 1000000
else 1
end
Of course, you can order by this, but you better apply it during load and store the price in a numeric column.
The problem lies in your data model. I understand that 2.66M is not necessarily exactly 2,660,000, which is why you don't want to store the whole number, but store '2.66M' instead to indicate the precision. This, however, is two pieces of information: the value and the precision, so use two columns:
mytable
value | unit
-------+-----
98.06 | K
97.44 | K
929.14 | K
91.87 | K
2.66 | M
146.64 | K
14.29 | K
Along with a lookup table:
units
unit | factor
-----+--------
K | 1000
M | 1000000
A possible query would be:
select *
from mytable
join units using (unit)
order by mytable.value * units.factor;
where you may want to extend the ORDER BY clause to something like
order by mytable.value * units.factor, units.factor;
or apply some rounding or whatever to consider precision of two seemingly equal values.
It is possible, though not advisable:
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=0a837287c7646823fa6657706f9ae634
SELECT *
, CAST(LEFT(price, LENGTH(price) - 1) AS DECIMAL(10,2)) AS value
, RIGHT(price, 1) AS unit
, CASE RIGHT(price,1)
WHEN 'K' THEN 1000
WHEN 'M' THEN 1000000
ELSE 1
END AS amount
FROM test1
ORDER BY amount, value;
Why not advisable? As the Explain in the dbfiddle shows, this query uses filesort for sorting, which is not very fast. If you do not have too many rows in your data, this should be no problem though.

If more than 10% of results are over X in mysql

I have a database table with lists of temperature readings from many locations in a number of buildings. I need a query that will give me a true or false if more than 10% of the readings in a building, taken on a date, are greater than X
I am not looking for a average. If there are 100 measurements taken in a building on a date, and 10 of them are over X (say 80 degrees) then create a flag.
The table is laid out as
Building # location # date temperature
| 123 | 555 |2016-04-08 | 68.5 |
| 123 | 556 |2016-04-08 | 70.2 |
| 123 | 557 |2016-04-08 | 65.4 |
| 888 | 999 |2013-03 22 | 80.4 |
Typically a building would have over 100 readings. There are many hundreds of building/date entries in the table
Can this be done with a single mysql query and can you share that query with me?
I obviously haven't made my question clear.
The result I am looking for is a single True or False.
If more than 10% of the results for a building/date combination were over X (say 80%) then show true, or some flag equal to true.
The known fields will be building and date. The location is not relevant, and can be ignored. So given the input of building (123) and date (2016-04-08) are more than 10% of the entries in the table that have that building number and date greater than X (e.g. 80). The only data to be tested are those for that building and date. So the query would end in:
where building_id=`123` AND date =`2016-04-08`
I am NOT looking for an average or a median. I am NOT looking to see a list of the data for that 10%. I am just looking for true or false.
You can use conditional aggregation, something like this:
select building, date,
(case when avg(temperature > x) > 0.1 then 'Y' else 'N' end) as flag
from t
group by building, date;
To return building and date, and "create a flag" for rows where more than 10% of the readings for that building on that date are over a given value X ...
SELECT r.building
, DATE(r.date)
, ( SUM(r.reading > X ) > SUM(.10) ) AS _flag
FROM myreadings r
GROUP BY r.building, DATE(r.date)
Absent more specification about the actual resultset you want to return, we're just guessing at what result set you want to return.
FOLLOWUP
Based on the update to the question... to return a row for a single building and a single date, add the WHERE clause as shown in the question. And remove expressions from the SELECT list.
SELECT ( SUM(r.reading > X ) > SUM(.10) ) AS _flag
FROM myreadings r
WHERE r.building = '123'
AND r.date >= '2016-04-08'
AND r.date < '2016-04-08' + INTERVAL 1 DAY
If there are no rows for the given building and given date, the query will return zero rows. If there is at least one row, and the number of rows that have a reading greater than X is more than 10% of the total number of rows, the query will return a single row, with _flag having a value of 1 (TRUE). Otherwise, the query will return a single row with _flag having a value of 0 (FALSE).
If you want the query to return a row even when there are no matching rows in the table, that can be accomplished with a more complex SQL statement.
If you want the query to return string values 'TRUE' or 'FALSE', that can be accomplished as well.
Again, absent an example of the resultset you are expecting to have returned, (without an actual specification which we can compare a resultset to), we're just guessing.

Oracle SQL when querying a range of data

I have a table that for an ID, will have data in several bucket fields. I want a function to pull out a sum of buckets, but the function parameters will include the start and end bucket field.
So, if I had a table like this:
ID Bucket0 Bucket30 Bucket60 Bucket90 Bucket120
10 5.00 12.00 10.00 0.0 8.00
If I send in the ID and the parameters Bucket0, Bucket0, it would return only the value in the Bucket0 field: 5.00
If I send in the ID and the parameters Bucket30, Bucket120, it would return the sum of the buckets from 30 to 120, or (12+10+0+8) 30.00.
Is there a nicer way to write this other than a huge ugly
if parameter1=bucket0 and parameter2=bucket0
then select bucket0
else if parameter1=bucket0 and parameter2=bucket1
then select bucket0 + bucket1
else if parameter1=bucket0 and parameter2=bucket2
then select bucket0 + bucket1 + bucket2
and so on?
The table already exists, so I don't have a lot of control over that. I can make my parameters for the function however I want. I can safely say that if a set of buckets are wanted, none in the middle will be skipped, so specifying start and end buckets would work. I could have a single comma delimited string of all buckets wanted.
It would have been better if your table had been normalised, like this:
id | bucket | value
---+-----------+------
10 | bucket000 | 5
10 | bucket030 | 12
10 | bucket060 | 10
10 | bucket090 | 0
10 | bucket120 | 8
Also, the buckets should better have names that are easy to compare in ranges, so that bucket030 comes between bucket000 and bucket120 in the normal alphabetical order, which is not the case if you leave out the padded zeroes.
If the above normalisation is not possible, then use an unpivot clause to turn your current table into the structure depicted above:
select id, sum(value)
from (
select *
from mytable
unpivot (value for bucket_id in (bucket0 as 'bucket000',
bucket30 as 'bucket030',
bucket60 as 'bucket060',
bucket90 as 'bucket090',
bucket120 as 'bucket120'))
) normalised
where bucket_id between 'bucket000' and 'bucket060'
group by id
When you do this with parameter variables, make sure those parameters have the padded zeroes as well.
You could for instance ensure that as follows for parameter1:
if parameter1 like 'bucket%' then
parameter1 := 'bucket' || lpad(+substr(parameter1, 7), 3, '0');
end if;
...etc.

MS Access Calc Field with combined fields

I have been trying to resolve this calc field issue for about 30 mins, it looks like I have the single field conditions correct in the expression such as [points] and [contrib] but the combined ([points]+[contrib]) field is not meeting the requirement that sets the field to the correct member type, so when these are added it returns some other member type as basic. Might I use the between operator with the added fields...? I tried it, but there is some compositional error. So in other words if you got 45 points it sets you to basic only named in the points field, if you have contrib of 45 you are set to basic in the calc field as expected, but if it were 50 + 50, instead it is setting to basic when it should be "better" member label. Otherwise this simple statement should seem to be correct but the computer is not reading it so when adding. It must not be recognizing the combined value for some reason and calc fields do not have a sum() func.
Focus here: (([points]+[Contrib]) >= 45 And ([points]+[Contrib]) < 100),"Basic",
IIf(([points] >=45 And [points]<100) Or ([Contrib] >=45 And [Contrib] <100) Or (([points]+[Contrib]) > = 45 And ([points]+[contrib] < 100),"Basic",
IIf(([points] >=100 And [points] <250) Or ([Contrib] >=100 And [Contrib] <250) Or ((([points]+[Contrib]) >=100) And (([points]+[Contrib])<250)),"Better",
IIf(([points] >=250 And [points]<500) Or ([Contrib] >=250 And [Contrib] <500) Or ((([points]+[Contrib]) >=250) And (([points]+[Contrib])<500)),"Great",
IIf(([points] >=500) Or ([Contrib] >=500) Or (([points]+[Contrib]) >=500),"Best","Non-member"))))
Here is a data sample from an Access 2010 table which includes a calculated field named member_type:
id points Contrib member_type
-- ------ ------- ----------
1 1 1 Non-member
2 50 1 Basic
3 200 1 Better
4 300 1 Great
5 600 1 Best
If that is what you want for your calculated field, here is the expression I used for member_type:
IIf([points]+[Contrib]>=45 And [points]+[Contrib]<100,'Basic',IIf([points]+[Contrib]>=100 And [points]+[Contrib]<250,'Better',IIf([points]+[Contrib]>=250 And [points]+[Contrib]<500,'Great',IIf([points]+[Contrib]>=500,'Best','Non-member'))))
In case I didn't get it exactly correct, here is that same expression formatted so that you can better see where you need changes:
IIf([points]+[Contrib]>=45 And [points]+[Contrib]<100,'Basic',
IIf([points]+[Contrib]>=100 And [points]+[Contrib]<250,'Better',
IIf([points]+[Contrib]>=250 And [points]+[Contrib]<500,'Great',
IIf([points]+[Contrib]>=500,'Best','Non-member'
))))
Note if either points or Contrib is Null, member_type will display "Non-member". If that is not the behavior you want, you will need a more complicated expression. Since a calculated field expression can not use Nz(), you would have to substitute something like IIf([points] Is Null,0,[points]) for every occurrence of [points] and IIf([Contrib] Is Null,0,[Contrib]) for [Contrib]
It would be simpler to prohibit Null for those fields (set their Required property to Yes) and set Default Value to zero.
The BETWEEN operator returns TRUE if the value you are testing is >= or <= the limits you have for BETWEEN.
If you are looking at 50+50 then that total = 100 and you are Between 44 and 100. That would result in an answer of "Basic". Change the range for ([points]+[Contrib]) Between 44 And 100) to be ([points]+[Contrib]) Between 44 And 99)

Access partition function: Is there a way to make it show bin categories that don't have a count?

I'm trying to use the Access Partition function to generate the bins used to generate a histogram chart to show the frequency distribution of my % utilization data set. However, the Partition function only shows the category bin ranges (e.g. 0:9, 10:19 etc) only for the categories that have a count. I would like it to show up to 100.
Example:
Using this function:
% Utilization: Partition([Max],0,100,10)
The Full SQL is:
SELECT Count([qry].[Max]) AS Actuals, Partition([Max],0,100,10) AS [% Utilization]
FROM [qry]
GROUP BY Partition([Max],0,100,10);
gives me:
Actuals | % Utilization
4 | 0: 9
4 | 10: 19
4 | 20: 29
but I want it to show 0s for the ranges that don't have values up to 90:99. Can this be done?
Thanks in Advance
The only way I can think of doing this is with an additional Bins table that contains all the bins you wish to illustrate:
SELECT Bins.[% Utilization], t.Actuals FROM Bins
LEFT JOIN
(SELECT Count(max) AS Actuals,
Partition([max],0,100,10) AS [% Utilization]
FROM qry
GROUP BY Partition([max],0,100,10)) t
ON t.[% Utilization]=bins.[% Utilization]