Conditional probability p(y|x) in SQL - mysql

How to calculate conditional probability in vendor agnostic SQL code while reading a precomputed table (histogram) just once?
Let's imagine we have a query which returns a histogram relation. The histogram contains following attributes: {x, y, cnt}, where cnt is the count of occurrences of nominal attributes x and y. And calculation of the histogram is time consuming.
Once we have the histogram, we want to calculate conditional probability p(y|x). A possible way how to do that is to take p(y|x) = count(y,x) / count(x) as outlined in the following query:
with histogram as (
// Long and time consuming subquery returning {x, y, cnt}
), x_count as (
select x
, sum(cnt) as cnt
from histogram
group by x
)
select y
, x
, cnt/x_count.cnt as probability
from histogram
join x_count
using(x)
However, common table expressions (CTEs) are not portable (e.g. MySQL does not work with them). Is there a way how to rewrite the CTE that:
The same query can be executed without change at MySQL, MSSQL and PostgreSQL?
Relation histogram is calculated just once?
All I can think of is to materialize the histogram into a table. Process the histogram. And delete the histogram.

First, just because you declare something as a CTE does not mean that it is run only once. For instance, SQL Server does not materialize CTEs, so using your logic it would run the histogram once for each reference. It is the same as a view.
In addition, the using clause is not supported by all databases.
So, the one thing that you could do that is vendor agnostic is to use a view. There is a slight hitch, because dropping a view that already exists is vendor-specific. But the following would generally work to express the query:
create view histogram as -- you might want to give this a more unique name
// Long and time consuming subquery returning {x, y, cnt}
select h.y, h.x, cnt / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop view histogram;
Of course, this runs the histogram query multiple times. So, you could solve this using temporary tables:
create table histogram (
x ??, -- I don't know what the types are
y ??,
cnt ??
);
insert into histogram (x, y, cnt)
select . . . ; -- your complicated query here
select y, x, cnt * 1.0 / total.cnt as probability
from histogram h join
(select x, sum(cnt) as cnt
from histogram
group by x
) total
on h.x = total.x;
drop table histogram;
Unfortunately, dropping an existing table is database specific. This does meet your requirements, though.
My advice would be to drop MySQL from the requirement -- it is rather degraded from the perspective of ANSI functionality. Then simply do:
select h.*, cnt * 1.0 / sum(cnt) over (partition by x) as probability
from histogram h;
(The * 1.0 is because some databases do integer division and cnt sounds like it might be an integer.)
This would be the simplest way to represent the query without re-calculating histogram. And, it will work in a lot of databases -- SQL Server, Postgres, Oracle, Teradata, DB2, BigQuery, RedShift, Hive. In fact, I think it will work in pretty much all current versions of what is commonly called a "database" except MySQL, SQLite, and MS Access.

Related

GROUP BY Syntax Mysql - Leaving out a groupable column

I have Table A with columns X,Y,Z.
X is an FK, Y is a description. Each X has exactly one corresponding Y. So if X stays the same over multiple records, Y stays the same too.
So there may be any number of records where X and Y are the same.
Now I'm running the following query:
SELECT X, Y
FROM A
GROUP BY X;
Will this query work?
Y is supposed to be grouped alongside X, but I didnt explicitely specify it in the query.
Does MySQL still implicitely act this way though? And is this behavior reliable/standardized?
Furthermore, will the results vary based on the datatype of Y. For example, is there a difference if Y is either VARCHAR, CHAR or INT? In case of an int, will the result be a SUM() of the grouped records?
Is the behavior MySQL will expose in such a case normed/standardized and where can I look it up?
Each X has exactly one corresponding Y
SELECT X, Y FROM A GROUP BY X;
Will this query work?
Technically, what happens when you run this query under MySQL depends on wether sql mode ONLY_FULL_GROUP_BY is enabled or not:
it it is enabled, the query errors: all non-aggregated columns must appear in the GROUP BY clause (you need to add Y to the GROUP BY clause)
else, the query executes, and gives you an arbitrary value of Y for each X; but since Y is functionnaly dependant on X, the value is actually predictable, so this is OK.
Generally, although the SQL standard does recognizes the notion of functionnaly-dependant column, it is a good practice to always include all non-aggregated colums in the GROUP BY clause. It is also a requirement in most databases other than MySQL (and, starting MySQL 5.7, ONLY_FULL_GROUP_BY is enabled by default). This also prevents you from various pitfalls and unpredictable behaviors.
Using ANY_VALUE() makes the query both valid and explicit about its purpose:
SELECT X, ANY_VALUE(Y) FROM A GROUP BY X;
Note that if you only want the distinct combinations of X, Y, it is simpler to use SELECT DISTINCT:
SELECT DISTINCT X, Y FROM A;
Your query will work if Y is functionally dependent on X (depending on SQL mode being used), but if you are trying to get distinct X,Y pairs from the table, it is better to use DISTINCT. The GROUP BY is meant to be used with the aggregate functions.
So you should use:
SELECT DISTINCT X, Y
FROM A;
A sample case where you would use GROUP BY would be with an aggregate functions:
SELECT DISTINCT X, Y, COUNT(*)
FROM A
GROUP BY X, Y;

How can I apply arithmetic operations to aggregated columns in MySQL?

TL;DR
Is there a way to use aggregated results in arithmetic operations?
Details
I want to take two aggregated columns (SUM(..), COUNT(..)) and operate them together, eg:
-- doesn't work
SELECT
SUM(x) AS x,
COUNT(y) AS y,
(x / y) AS x_per_y -- Problem HERE
FROM
my_tab
GROUP BY groupable_col;
That doesn't work, but I've found this does:
SELECT
SUM(x) AS x,
COUNT(y) AS y,
SUM(x) / COUNT(y) AS x_per_y -- notice the repeated aggregate
FROM
my_tab
GROUP BY groupable_col;
But if I need many columns that operate on aggregates, it quickly becomes very repetitive, and I'm not sure how to tell whether or not MySQL can optimize so that I'm not calculating aggregates multiple times.
I've searched SO, for a while now, as well as asked some pros, and the best alternative I can come up with is nested selects, ~~which my db doesn't support.~~
EDIT: it did support them, I had been doing something wrong, and ruled out nested selects prematurely
Also, MySQL documentation seems to support it, but I can't get something like this to work (example at very bottom of link)
https://dev.mysql.com/doc/refman/5.5/en/group-by-handling.html
One way is using subquery:
select x,
y,
x / y as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t
Also note that the value of count(y) can be zero (when all y are null).
MySQL handles this case automatically and produce NULL in case the denominator is zero.
Some DBMSes throw divide by zero error in this case, which is usually handled by producing null in that case:
select x,
y,
case when y > 0 then x / y end as x_per_y
from (
select SUM(x) as x,
COUNT(y) as y
from my_tab
group by groupable_col
) t

Usage of weighting in pure SQL

My front-end (SourcePawn) currently does the following:
float fPoints = 0.0;
float fWeight = 1.0;
while(results.FetchRow())
{
fPoints += (results.FetchFloat(0) * fWeight);
fWeight *= 0.95;
}
In case you don't understand this code, it goes through the resultset of this query:
SELECT points FROM table WHERE auth = 'authentication_id' AND points > 0.0 ORDER BY points DESC;
The resultset is floating numbers, sorted by points from high to low.
My front-end takes the 100% of the first row, then 95% of the second one, and it drops by 5% every time. It all adds up to fPoints that is my 'sum' variable.
What I'm looking for, is a solution of how to replicate this code in pure SQL and receive the sum which is called fPoints in my front-end, so I will be able to run it for a table that has over 10,000 rows, in one query instead of 10,000.
I'm very lost. I don't know where to start and guidance of any kind would be very nice.
You can do this using variables:
SELECT points,
(points * (#f := 0.95 * #f) / 0.95) as fPoints
FROM table t CROSS JOIN
(SELECT #f := 1.0) params
WHERE auth = 'authentication_id' AND points > 0.0
ORDER BY points DESC;
A note about the calculation. The value of #f starts at 1. Because we are dealing with variables, the assignment and the use of the variable need to be in the same expression -- MySQL does not guarantee the order of evaluation of expressions.
So, the 0.95 * #f reduces the value by 5%. However, that is for the next iteration. The / 0.95 undoes that to get the right value for this iteration.
While I'm glad the answer Gordon Linoff provides works for you, you should understand it's quite specific. ORDER BY, per the SQL standard, has no effect on how a query is processed, and SQL does not recognize "iteration" in a SELECT statement. So the idea of "reducing a variable on each iteration", where the iteration order is governed by ORDER BY has no basis in standard SQL. You might want to check if it's guaranteed by MySQL, just for your own edification.
To achieve the effect you want in a standard way, proceed as follows.
Create a table Percentiles( Percentile int not null, Factor float not null )
Populate that table with your factors (20 rows).
Write a view or CTE that ranks your points in descending order. Let us call the rank column rank.
Then join your view to Percentiles:
SELECT auth, sum(points * factor) as weight
FROM "your view" as t join percentiles as p
ON r.rank = percentile
WHERE points > 0.0
GROUP BY auth
That query is simple, and its intent obvious. It might even be faster. Most important, it will definitely work, and doesn't depend on any idiosyncrasies of your current DBMS.

SQL select a sample of rows

I need to select sample rows from a set. For example if my select query returns x rows then if x is greater than 50 , I want only 50 rows returned but not just the top 50 but 50 that are evenly spread out over the resultset. The table in this case records routes - GPS locations + DateTime.
I am ordering on DateTime and need a reasonable sample of the Latitude & Longitude values.
Thanks in advance
[ SQL Server 2008 ]
To get sample rows in SQL Server, use this query:
SELECT TOP 50 * FROM Table
ORDER BY NEWID();
If you want to get every n-th row (10th, in this example), try this query:
SELECT * From
(
SELECT *, (Dense_Rank() OVER (ORDER BY Column ASC)) AS Rank
FROM Table
) AS Ranking
WHERE Rank % 10 = 0;
Source
More examples of queries selecting random rows for other popular RDBMS can be found here: http://www.petefreitag.com/item/466.cfm
Every n'th row to get 50:
SELECT *
FROM table
WHERE row_number() over() MOD (SELECT Count(*) FROM table) / 50 == 0
FETCH FIRST 50 ROWS ONLY
And if you want a random sample, go with jimmy_keen's answer.
UPDATE:
In regard to the requirement for it to run on MS SQL, I think it should be changed to this (no MS SQL Server around to test though):
SELECT TOP 50 *
FROM (
SELECT t.*, row_number() over() AS rn, (SELECT count(*) FROM table) / 50 AS step
FROM table t
)
WHERE rn % step == 0
I suggest that you add a calculated column to your resultset on selection that is obtained as a random number, and then select the top 50 sorted by that column. That will give you a random sample.
For example:
SELECT TOP 50 *, RAND(Id) AS Random
FROM SourceData
ORDER BY Random
where SourceData is your source data table or view. This assumes T-SQL on SQL Server 2008, by the way. It also assumes that you have an Id column with unique ids on your data source. If your ids are very low numbers, it is a good practice to multiply them by a large integer before passing them to RAND, like this:
RAND(Id * 10000000)
If you want an statically correct sample, tablesample is a wrong solution. A good solution as I described in here based on a Microsoft Research paper, is to create a materialized view over your table which includes an additional column like
CAST( ROW_NUMBER() OVER (...) AS BYTE ) AS RAND_COL_, then you can add an index on this column, plus other interesting columns and get statistically correct samples for your queries fairly quickly. (by using WHERE RAND_COL_ = 1).

Can there be a database-agnostic SQL query to fetch top N rows?

We want to be able to select top N rows using a SQL Query. The target database could be Oracle or MySQL. Is there an elegant approach to this? (Needless to say, we're dealing with sorted data here.)
To get the top 5 scorers from this table:
CREATE TABLE people
(id int,
name string,
score int)
try this SQL:
SELECT id,
name,
score
FROM people p
WHERE (SELECT COUNT(*)
FROM people p2
WHERE p2.score > p.score
) <=4
I believe this should work in most places.
No. The syntax is different.
You may, however, create views:
/* Oracle */
CREATE VIEW v_table
AS
SELECT *
FROM (
SELECT *
FROM table
ORDER BY
column
)
WHERE rownum <= n
/* MySQL */
CREATE VIEW v_table
AS
SELECT *
FROM table
ORDER BY
column
LIMIT n
I don't think that's possible even just between mysql and mssql. I do an option for simulating such behaviour though:
create views that have an auto incremented int column; say 'PagingHelperID'
write queries like: SELECT columns FROM viewname WHERE PagingHelperID BETWEEN startindex AND stopindex
This will make ordering difficult, you will need different views for every order in which you intend to retreive data.
You could also "rewrite" your sql on the fly when querying depending on the database and define your own method for the rewriter, but I don't think there is any "good" way to do this.
If there is a unique key on the table yes...
Select * From Table O
Where (Select Count(*) From Table I
Where [UniqueKeyValue] < O.UniqueKeyValue) < N
You can substitute your own criteria if you want the "Top" definition to be based on some other logic than on the unique key...
EDIT: If the "sort" that defines the meaning of "Top" is based on a non-unique column, or set of columns, then you can still use this, but you can't guarantee you will be able to get exactly N records out...
Select * From Table O
Where (Select Count(*) From Table I
Where nonUniqueCol < O.nonUniqueCol) < 10
If records 8, 9, 10, 11, and 12 all have the same value in [nonUniqueCol], then the query will either only generate 7 records, (with '<') ... , or 12 (if you use '<=')
NOTE: As this involves a correlated sub-query, the performance can be an issue for very large tables...
Starting with MySQL 8, you can use ROW_NUMBER() filtering to get the semantics of LIMIT (MySQL) or FETCH (Oracle) in a uniform, standards compliant syntax:
SELECT t.a, t.b, t.c, t.o
FROM (
SELECT a, b, c, o, ROW_NUMBER() OVER (ORDER BY o)
FROM x
) t
WHERE rn <= :limit
ORDER BY o
But this is likely to be less efficient than using the vendor specific syntax, so if you have some means of abstracting over LIMIT and FETCH (e.g. using an ORM like jOOQ or Hibernate, or even some templating language), that should be preferred.
The big problem, after looking this over, is that MySQL isn't ISO SQL:2003 compliant. If it was, you'd have these handy windowing functions:
SELECT * from
( SELECT
RANK() OVER (ORDER BY <blah>) AS ranking,
<rest of columns here>,
FROM <table>
)
WHERE ranking <= <N>
Alas, MySQL (and others that mimic it's behavior, eg SQLite), do not, hence the whole limiting issue.
Check out this snippet from Wikipedia (http://en.wikipedia.org/wiki/Window_function_(SQL)#Limiting_result_rows)