I have a simple MySQL table like below, used to compute MPG for a car.
+-------------+-------+---------+
| DATE | MILES | GALLONS |
+-------------+-------+---------+
| JAN 25 1993 | 20.0 | 3.00 |
| FEB 07 1993 | 55.2 | 7.22 |
| MAR 11 1993 | 44.1 | 6.28 |
+-------------+-------+---------+
I can easily compute the Miles Per Gallon (MPG) for the car using a select statement, but because the MPG varies widely from fillup to fillup (i.e. you don't fill the exact same amount of gas each time), I would like to computer a 'MOVING AVERAGE' as well. So for any row the MPG is MILES/GALLON for that row, and the MOVINGMPG is the SUM(MILES)/SUM(GALLONS) for the last N rows. If less than N rows exist by that point, just SUM(MILES)/SUM(GALLONS) up to that point.
Is there a single SELECT statement that will fetch the rows with MPG and MOVINGMPG by substituting N into the select statement?
Yes, it's possible to return the specified resultset with a single SQL statement.
Unfortunately, MySQL does not support analytic functions, which would make for a fairly simple statement. Even though MySQL does not have syntax to support them, it is possible to emulate some analytic functions using MySQL user variables.
One of the ways to achieve the specified result set (with a single SQL statement) is to use a JOIN operation, using a unique ascending integer value (rownum, derived by and assigned within the query) to each row.
For example:
SELECT q.rownum AS rownum
, q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM ( SELECT #s_rownum := #s_rownum + 1 AS rownum
, s.date
, s.miles
, s.gallons
FROM mytable s
JOIN (SELECT #s_rownum := 0) c
ORDER BY s.date
) q
JOIN ( SELECT #t_rownum := #t_rownum + 1 AS rownum
, t.date
, t.miles
, t.gallons
FROM mytable t
JOIN (SELECT #t_rownum := 0) d
ORDER BY t.date
) r
ON r.rownum <= q.rownum
AND r.rownum > q.rownum - 2
GROUP BY q.rownum
Your desired value of "n" to specify how many rows to include in each rollup row is specified in the predicate just before the GROUP BY clause. In this example, up to "2" rows in each running total row.
If you specify a value of 1, you will get (basically) the original table returned.
To eliminate any "incomplete" running total rows (consisting of fewer than "n" rows), that value of "n" would need to be specified again, adding:
HAVING COUNT(1) >= 2
sqlfiddle demo: http://sqlfiddle.com/#!2/52420/2
Followup:
Q: I'm trying to understand your SQL statement. Does your solution do a select of twenty rows for each row in the db? In other words, if I have 1000 rows will your statement perform 20000 selects? (I'm worried about performance)...
A: You are right to be concerned with performance.
To answer your question, no, this does not perform 20,000 selects for 1,000 rows.
The performance hit comes from the two (essentially identical) inline views (aliased as q and r). What MySQL does with these (basically) is create temporary MyISAM tables (MySQL calls them "derived tables"), which are basically copies of mytable, with an extra column, each row assigned a unique integer value from 1 to the number of rows.
Once the two "derived" tables are created and populated, MySQL runs the outer query, using those two "derived" tables as a row source. Each row from q, is matched with up to n rows from r, to calculate the "running total" miles and gallons.
For better performance, you could use a column already in the table, rather than having the query assign unique integer values. For example, if the date column is unique, then you could calculate "running total" over a certain period of days.
SELECT q.date AS latest_date
, SUM(q.miles)/SUM(q.gallons) AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM mytable q
JOIN mytable r
ON r.date <= q.date
AND r.date > q.date + INTERVAL -30 DAY
GROUP BY q.date
(For performance, you would want an appropriate index defined with date as a leading column in the index.)
For the first query, any predicates included (in the inline view definition queries) to reduce the number of rows returned (for example, return only date values in the past year) would reduce the number of rows to be processed, and would also likely improve performance.
Again, to your question about running 20,000 selects for 1,000 rows... a nested loops operation is another way to get the same result set. For a large number of rows, this can exhibit slower performance. (On the other hand, this approach can be fairly efficient, when only a few rows are being returned:
SELECT q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, ( SELECT SUM(r.miles)/SUM(r.gallons)
FROM mytable r
WHERE r.date <= q.date
AND r.date >= q.date + INTERVAL -90 DAY
) AS rtot_mpg
FROM mytable q
ORDER BY q.date
Something like this should work:
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallMiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal
FROM YourTable
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
SQL Fiddle Demo
Which produces the following:
DATE MILES GALLONS MILESPERGALLON RUNNINGTOTAL
January, 25 1993 20 3 6.666667 6.666666666667
February, 07 1993 55.2 7.22 7.645429 7.358121330724
March, 11 1993 44.1 6.28 7.022293 7.230303030303
--EDIT--
In response to the comment, you can add another Row Number to limit your results to the last N rows:
SELECT *
FROM (
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallmiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal,
#RowNumber:=#RowNumber+1 rowNumber
FROM (SELECT * FROM YourTable ORDER BY Date DESC) u
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
JOIN (SELECT #RowNumber:= 0) r
) t
WHERE rowNumber <= 3
Just change your ORDER BY clause accordingly. And here is the updated fiddle.
Related
As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY.
Specifically, if I've got a purchases table that looks like this:
SELECT * FROM purchases;
My Output:
id
customer
total
1
Joe
5
2
Sally
3
3
Joe
2
4
Sally
1
I'd like to query for the id of the largest purchase (total) made by each customer. Something like this:
SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY total DESC;
Expected Output:
FIRST(id)
customer
FIRST(total)
1
Joe
5
2
Sally
3
DISTINCT ON is typically simplest and fastest for this in PostgreSQL.
(For performance optimization for certain workloads see below.)
SELECT DISTINCT ON (customer)
id, customer, total
FROM purchases
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2)
id, customer, total
FROM purchases
ORDER BY 2, 3 DESC, 1;
If total can be null, add NULLS LAST:
...
ORDER BY customer, total DESC NULLS LAST, id;
Works either way, but you'll want to match existing indexes
db<>fiddle here
Major points
DISTINCT ON is a PostgreSQL extension of the standard, where only DISTINCT on the whole SELECT list is defined.
List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least
one column value. Null values are considered equal in this
comparison.
Bold emphasis mine.
DISTINCT ON can be combined with ORDER BY. Leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
I added id as last item to break ties:
"Pick the row with the smallest id from each group sharing the highest total."
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY. Example.
If total can be null, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. See:
Sort by column ASC, but NULL values first?
The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way:
You don't have to include any of the expressions in DISTINCT ON or ORDER BY.
You can include any other expression in the SELECT list. This is instrumental for replacing complex subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 15. But the feature has been there at least since version 7.1, so basically always.
Index
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized. But use it if read performance for the particular query is crucial. If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is perfectly applicable.
Effectiveness / Performance optimization
Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though. Example.
For few rows per customer (high cardinality in column customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. But generally setting work_mem too high can have adverse effects. Consider SET LOCAL for exceptionally big queries. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:
Configuration parameter work_mem in PostgreSQL on Linux
Optimize simple query using ORDER BY date and text
For many rows per customer (low cardinality in column customer), an "index skip scan" or "loose index scan" would be (much) more efficient. But that's not implemented up to Postgres 15. Serious work to implement it one way or another has been ongoing for years now, but so far unsuccessful. See here and here.
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Optimize GROUP BY query to retrieve latest row per user
Optimize groupwise maximum query
Query last N related rows per row
Benchmarks
See separate answer.
On databases that support CTE and windowing functions:
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total
Benchmarks
I tested the most interesting candidates:
Initially with Postgres 9.4 and 9.5.
Added accented tests for Postgres 13 later.
Basic test setup
Main table: purchases:
CREATE TABLE purchases (
id serial -- PK constraint added below
, customer_id int -- REFERENCES customer
, total int -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);
Dummy data (with some dead tuples), PK, index:
INSERT INTO purchases (customer_id, total, some_column) -- 200k rows
SELECT (random() * 10000)::int AS customer_id -- 10k distinct customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,200000) g;
ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);
DELETE FROM purchases WHERE random() > 0.9; -- some dead rows
INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int AS customer_id -- 10k customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k
CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);
VACUUM ANALYZE purchases;
customer table - used for optimized query:
CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM purchases
GROUP BY 1
ORDER BY 1;
ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);
VACUUM ANALYZE customer;
In my second test for 9.5 I used the same setup, but with 100000 distinct customer_id to get few rows per customer_id.
Object sizes for table purchases
Basic setup: 200k rows in purchases, 10k distinct customer_id, avg. 20 rows per customer.
For Postgres 9.5 I added a 2nd test with 86446 distinct customers - avg. 2.3 rows per customer.
Generated with a query taken from here:
Measure the size of a PostgreSQL table row
Gathered for Postgres 9.5:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
core_relation_size | 20496384 | 20 MB | 102
visibility_map | 0 | 0 bytes | 0
free_space_map | 24576 | 24 kB | 0
table_size_incl_toast | 20529152 | 20 MB | 102
indexes_size | 10977280 | 10 MB | 54
total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157
live_rows_in_text_representation | 13729802 | 13 MB | 68
------------------------------ | | |
row_count | 200045 | |
live_tuples | 200045 | |
dead_tuples | 19955 | |
Queries
1. row_number() in CTE, (see other answer)
WITH cte AS (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
)
SELECT id, customer_id, total
FROM cte
WHERE rn = 1;
2. row_number() in subquery (my optimization)
SELECT id, customer_id, total
FROM (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
) sub
WHERE rn = 1;
3. DISTINCT ON (see other answer)
SELECT DISTINCT ON (customer_id)
id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC, id;
4. rCTE with LATERAL subquery (see here)
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id > c.customer_id -- lateral reference
ORDER BY customer_id, total DESC
LIMIT 1
) u
)
SELECT id, customer_id, total
FROM cte
ORDER BY customer_id;
5. customer table with LATERAL (see here)
SELECT l.*
FROM customer c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id = c.customer_id -- lateral reference
ORDER BY total DESC
LIMIT 1
) l;
6. array_agg() with ORDER BY (see other answer)
SELECT (array_agg(id ORDER BY total DESC))[1] AS id
, customer_id
, max(total) AS total
FROM purchases
GROUP BY customer_id;
Results
Execution time for above queries with EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, best of 5 runs to compare with warm cache.
All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some only to benefit from the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per customer_id
1. 273.274 ms
2. 194.572 ms
3. 111.067 ms
4. 92.922 ms -- !
5. 37.679 ms -- winner
6. 189.495 ms
B. Same as A. with Postgres 9.5
1. 288.006 ms
2. 223.032 ms
3. 107.074 ms
4. 78.032 ms -- !
5. 33.944 ms -- winner
6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per customer_id
1. 381.573 ms
2. 311.976 ms
3. 124.074 ms -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms
Retest with Postgres 13 on 2021-08-11
Simplified test setup: no deleted rows, because VACUUM ANALYZE cleans the table completely for the simple case.
Important changes for Postgres:
General performance improvements.
CTEs can be inlined since Postgres 12, so query 1. and 2. now perform mostly identical (same query plan).
D. Like B. ~ 20 rows per customer_id
1. 103 ms
2. 103 ms
3. 23 ms -- winner
4. 71 ms
5. 22 ms -- winner
6. 81 ms
db<>fiddle here
E. Like C. ~ 2.3 rows per customer_id
1. 127 ms
2. 126 ms
3. 36 ms -- winner
4. 620 ms
5. 145 ms
6. 203 ms
db<>fiddle here
Accented tests with Postgres 13
1M rows, 10.000 vs. 100 vs. 1.6 rows per customer.
F. with ~ 10.000 rows per customer
1. 526 ms
2. 527 ms
3. 127 ms
4. 2 ms -- winner !
5. 1 ms -- winner !
6. 356 ms
db<>fiddle here
G. with ~ 100 rows per customer
1. 535 ms
2. 529 ms
3. 132 ms
4. 108 ms -- !
5. 71 ms -- winner
6. 376 ms
db<>fiddle here
H. with ~ 1.6 rows per customer
1. 691 ms
2. 684 ms
3. 234 ms -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms
db<>fiddle here
Conclusions
DISTINCT ON uses the index effectively and typically performs best for few rows per group. And it performs decently even with many rows per group.
For many rows per group, emulating an index skip scan with an rCTE performs best - second only to the query technique with a separate lookup table (if that's available).
The row_number() technique demonstrated in the currently accepted answer never wins any performance test. Not then, not now. It never comes even close to DISTINCT ON, not even when the data distribution is unfavorable for the latter. The only good thing about row_number(): it does not scale terribly, just mediocre.
More benchmarks
Benchmark by "ogr" with 10M rows and 60k unique "customers" on Postgres 11.5. Results are in line with what we have seen so far:
Proper way to access latest row for each individual identifier?
Original (outdated) benchmark from 2011
I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing #OMGPonies' first query (A) to the above DISTINCT ON solution (B):
Select the whole table, results in 5958 rows in this case.
A: 567.218 ms
B: 386.673 ms
Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.
A: 249.136 ms
B: 55.111 ms
Select a single customer with WHERE customer = x.
A: 0.143 ms
B: 0.072 ms
Same test repeated with the index described in the other answer:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms
1B: 193.547 ms
2A: 249.796 ms -- special index not used
2B: 28.679 ms
3A: 0.120 ms
3B: 0.048 ms
This is common greatest-n-per-group problem, which already has well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).
Note that bunch of solutions to this common problem can surprisingly be found in the MySQL manual -- even though your problem is in Postgres, not MySQL, the solutions given should work with most SQL variants. See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.
In Postgres you can use array_agg like this:
SELECT customer,
(array_agg(id ORDER BY total DESC))[1],
max(total)
FROM purchases
GROUP BY customer
This will give you the id of each customer's largest purchase.
Some things to note:
array_agg is an aggregate function, so it works with GROUP BY.
array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
You could use array_agg in a similar way for your third output column, but max(total) is simpler.
Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.
The Query:
SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p
ON
p.customer = purchases.customer
AND
purchases.total < p.total
WHERE p.total IS NULL
HOW DOES THAT WORK! (I've been there)
We want to make sure that we only have the highest total for each purchase.
Some Theoretical Stuff (skip this part if you only want to understand the query)
Let Total be a function T(customer,id) where it returns a value given the name and id
To prove that the given total (T(customer,id)) is the highest we have to prove that
We want to prove either
∀x T(customer,id) > T(customer,x) (this total is higher than all other
total for that customer)
OR
¬∃x T(customer, id) < T(customer, x) (there exists no higher total for
that customer)
The first approach will need us to get all the records for that name which I do not really like.
The second one will need a smart way to say there can be no record higher than this one.
Back to SQL
If we left joins the table on the name and total being less than the joined table:
LEFT JOIN purchases as p
ON
p.customer = purchases.customer
AND
purchases.total < p.total
we make sure that all records that have another record with the higher total for the same user to be joined:
+--------------+---------------------+-----------------+------+------------+---------+
| purchases.id | purchases.customer | purchases.total | p.id | p.customer | p.total |
+--------------+---------------------+-----------------+------+------------+---------+
| 1 | Tom | 200 | 2 | Tom | 300 |
| 2 | Tom | 300 | | | |
| 3 | Bob | 400 | 4 | Bob | 500 |
| 4 | Bob | 500 | | | |
| 5 | Alice | 600 | 6 | Alice | 700 |
| 6 | Alice | 700 | | | |
+--------------+---------------------+-----------------+------+------------+---------+
That will help us filter for the highest total for each purchase with no grouping needed:
WHERE p.total IS NULL
+--------------+----------------+-----------------+------+--------+---------+
| purchases.id | purchases.name | purchases.total | p.id | p.name | p.total |
+--------------+----------------+-----------------+------+--------+---------+
| 2 | Tom | 300 | | | |
| 4 | Bob | 500 | | | |
| 6 | Alice | 700 | | | |
+--------------+----------------+-----------------+------+--------+---------+
And that's the answer we need.
The solution is not very efficient as pointed by Erwin, because of presence of SubQs
select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;
I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29
-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $1;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
sfunc = public.first_agg,
basetype = anyelement,
stype = anyelement
);
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
Then your example should work almost as is:
SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;
CAVEAT: It ignore's NULL rows
Edit 1 - Use the postgres extension instead
Now I use this way: http://pgxn.org/dist/first_last_agg/
To install on ubuntu 14.04:
apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'
It's a postgres extension that gives you first and last functions; apparently faster than the above way.
Edit 2 - Ordering and filtering
If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:
http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES
So the equivalent example, with ordering would be something like:
SELECT first(id order by id), customer, first(total order by id)
FROM purchases
GROUP BY customer
ORDER BY first(total);
Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.
Use ARRAY_AGG function for PostgreSQL, U-SQL, IBM DB2, and Google BigQuery SQL:
SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer
In SQL Server you can do this:
SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1
Explaination:Here Group by is done on the basis of customer and then order it by total then each such group is given serial number as StRank and we are taking out first 1 customer whose StRank is 1
Very fast solution
SELECT a.*
FROM
purchases a
JOIN (
SELECT customer, min( id ) as id
FROM purchases
GROUP BY customer
) b USING ( id );
and really very fast if table is indexed by id:
create index purchases_id on purchases (id);
Snowflake/Teradata supports QUALIFY clause which works like HAVING for windowed functions:
SELECT id, customer, total
FROM PURCHASES
QUALIFY ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) = 1
In PostgreSQL, another possibility is to use the first_value window function in combination with SELECT DISTINCT:
select distinct customer_id,
first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from purchases;
I created a composite (id, total), so both values are returned by the same aggregate. You can of course always apply first_value() twice.
This way it work for me:
SELECT article, dealer, price
FROM shop s1
WHERE price=(SELECT MAX(s2.price)
FROM shop s2
WHERE s1.article = s2.article
GROUP BY s2.article)
ORDER BY article;
Select highest price on each article
This is how we can achieve this by using windows function:
create table purchases (id int4, customer varchar(10), total integer);
insert into purchases values (1, 'Joe', 5);
insert into purchases values (2, 'Sally', 3);
insert into purchases values (3, 'Joe', 2);
insert into purchases values (4, 'Sally', 1);
select ID, CUSTOMER, TOTAL from (
select ID, CUSTOMER, TOTAL,
row_number () over (partition by CUSTOMER order by TOTAL desc) RN
from purchases) A where RN = 1;
The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.
Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.
Supported by any database:
select * from purchase
join (
select min(id) as id from purchase
join (
select customer, max(total) as total from purchase
group by customer
) t1 using (customer, total)
group by customer
) t2 using (id)
order by customer
This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.
Remark:
t1, t2 are subquery alias which could be removed depending on database.
Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.
If you want to select any (by your some specific condition) row from the set of aggregated rows.
If you want to use another (sum/avg) aggregation function in addition to max/min. Thus you can not use clue with DISTINCT ON
You can use next subquery:
SELECT
(
SELECT **id** FROM t2
WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )
) id,
name,
MAX(amount) ma,
SUM( ratio )
FROM t2 tf
GROUP BY name
You can replace amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row
But if you wanna to do such things you probably looking for window functions
For SQl Server the most efficient way is:
with
ids as ( --condition for split table into groups
select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i)
)
,src as (
select * from yourTable where <condition> --use this as filter for other conditions
)
,joined as (
select tops.* from ids
cross apply --it`s like for each rows
(
select top(1) *
from src
where CommodityId = ids.i
) as tops
)
select * from joined
and don't forget to create clustered index for used columns
This can be achieved easily by MAX FUNCTION on total and GROUP BY id and customer.
SELECT id, customer, MAX(total) FROM purchases GROUP BY id, customer
ORDER BY total DESC;
My approach via window function dbfiddle:
Assign row_number at each group: row_number() over (partition by agreement_id, order_id ) as nrow
Take only first row at group: filter (where nrow = 1)
with intermediate as (select
*,
row_number() over ( partition by agreement_id, order_id ) as nrow,
(sum( suma ) over ( partition by agreement_id, order_id ))::numeric( 10, 2) as order_suma,
from <your table>)
select
*,
sum( order_suma ) filter (where nrow = 1) over (partition by agreement_id)
from intermediate
I have this:
SELECT * FROM history JOIN value WHERE history.the_date >= value.the_date
is it possible to somehow to ask this question like, where history.the_date is bigger then or equal to biggest possible value of value.the_date?
HISTORY
the_date amount
2014-02-27 200
2015-02-26 2000
VALUE
the_date interest
2010-02-10 2
2015-01-01 3
I need to pair the correct interest with the amount!
So value.the_date is the date since when the interest is valid. Interest 2 was valid from 2010-02-10 till 2014-12-31, because since 2015-01-01 the new interest 3 applies.
To get the current interest for a date you'd use a subquery where you select all interest records with a valid-from date up to then and only keep the latest:
select
the_date,
amount,
(
select v.interest
from value v
where v.the_date <= h.the_date
order by v.the_date desc
limit 1
) as interest
from history h;
use join condition after ON not in where clause...
SELECT * FROM history JOIN (select max(value.the_date) as d from value) as x on history.the_date >= x.d
WHERE 1=1
Presumably, you want this:
select h.*
from history h
where h.the_date >= (select max(v.the_date) from value v);
I have a Transactions table with over 2,500,000 rows and three columns (that are relevant): id, company_id, and created_at. id identifies the transaction, company_id identifies which company received it, created_at is a timestamp with the time that transaction was performed.
What I want is to get a list of the differences between every consecutive pair of transactions of a given company. In other words, if my table goes:
id | company_id | created_at
------------------------------
01 | ab | 2016/01/02
02 | ab | 2016/01/03
03 | cd | 2016/01/03
04 | ab | 2016/01/03
05 | cd | 2016/01/04
06 | ab | 2016/01/05
(Note that there may be an arbitrary number of transactions of other companies between two consecutive transaction of a given company.)
Then I want the output to be:
diff | company_id
-------------------
01 | ab
00 | ab
01 | cd
02 | ab
(I wrote the created_at and diff values in days, but that's just for ease of visualisation.)
I tried using this but it was too slow.
--EDIT:
"This" is:
SELECT (B.created_at - A.created_at) AS diff, A.company_id
FROM Transactions A CROSS JOIN Transactions B
WHERE B.id IN (SELECT MIN (C.id) FROM Transactions C WHERE C.id > A.id AND C.company_id = A.company_id)
ORDER BY A.id ASC
To get a result like the one it looks like you're expecting, I will sometimes make use of MySQL user-defined variables, and have MySQL perform the processing of the rows "in order", so I can compare the current row to values from the previous row.
For this to run efficiently, we'll want an appropriate index, to avoid an expensive "Using filesort" operation. (We're going to need the rows in company_id order, then by id order, so those will be the first two columns in the index. While we're at it, we might just as well include the created_at column and make it a covering index.
... ON Transactions (company_id, id, created_at)
Then we can try a query like this:
SELECT t.diff
, t.company_id
FROM (
SELECT IF(r.company_id = #pv_company_id, r.created_at - #pv_created_at, NULL) AS diff
, IF(r.company_id = #pv_company_id, 1, 0) AS include_
, #pv_company_id := r.company_id AS company_id
, #pv_created_at := r.created_at AS created_at
FROM (SELECT #pv_company_id := NULL, #pv_created_at := NULL) i
CROSS
JOIN Transactions r
ORDER
BY r.company_id
, r.id
) t
WHERE t.include_
The MySQL Reference Manual explicitly warns against using user-defined variables like this within a statement. But the behavior we observe in MySQL 5.1 and 5.5 is consistent. (The big problem is that some future version of MySQL could use a different execution plan.)
The inline view aliased as i is just to initialize a couple of user-defined variables. We could just as easily do that as a separate step, before we run our query. But I like to include the initialization right in the statement itself, so I don't need a separate SELECT/SET statement.
MySQL accesses the Transactions table, and processes the ORDER BY first, ordering the rows from Transactions in (company_id,id) order. (We prefer to have this done via an index, rather than via an expensive "Using filesort" operation, which is why we want that index defined, with company_id and id as the leading columns.
The "trick" is saving the values from the current row into user-defined variables. When processing the next row, the values from the previous row are available in the user-defined variables, for performing comparisons (is the current row for the same company_id as the previous row?) and for performing a calculation (the difference between the created_at values of the two rows.
Based on the usage of the subtraction operation, I'm assuming that the created_at columns is integer/numeric. That is, I'm assuming that created_at is not DATE, DATETIME, or TIMESTAMP datatype, because we don't use the subtraction operation to find a difference.
SELECT a
, b
, a - b AS `subtraction`
, DATEDIFF(a,b) AS `datediff`
, TIMESTAMPDIFF(DAY,b,a) AS `tsdiff`
FROM ( SELECT DATE('2015-02-17') AS a
, DATE('2015-01-16') AS b
) t
returns:
a b subtraction datediff tsdiff
---------- ---------- ----------- -------- ------
2015-02-17 2015-01-16 101 32 32
(The subtraction operation doesn't throw an error. But what it returns may be unexpected. In this example, it returns the difference between two integer values 20150217 and 20150116, which is not the number of days between the two DATE expressions.)
EDIT
I notice that the original query includes an ORDER BY. If you need the rows returned in a specific order, you can include that column in the inline view query, and use an ORDER BY on the outer query.
SELECT t.diff
, t.company_id
FROM (
SELECT IF(r.company_id = #pv_company_id, r.created_at - #pv_created_at, NULL) AS diff
, IF(r.company_id = #pv_company_id, 1, 0) AS include_
, #pv_company_id := r.company_id AS company_id
, #pv_created_at := r.created_at AS created_at
, r.id AS id
FROM (SELECT #pv_company_id := NULL, #pv_created_at := NULL) i
CROSS
JOIN Transactions r
ORDER
BY r.company_id
, r.id
) t
WHERE t.include_
ORDER BY t.id
Sorry, there's no getting around a "Using filesort" for the ORDER BY on the outer query.
You could use a Cursor functionality. If you open a cursor you slide every row and every two lines fetched you make a difference. I think this method is more efficient because slide all the rows of table instead make a join over 2 and half million.
Try this one too.
SELECT company_id,
(SELECT DATEDIFF(created_at,TR.created_at)
FROM transactions
WHERE id > TR.id AND company_id = TR.company_id LIMIT 0,1) AS diff
FROM transactions AS TR
HAVING diff is not null
Try this
SELECT
t1.company_id,
t2.created_at - t1.created_at as diff
FROM Transactions t1
LEFT JOIN Transactions t2
on t2.created_at > t1.created_at
and t2.company_id = t1.company_id
I have the following table (visits):
id(int) | fb_id(varchar)| flipbook(varchar) |
---- ---------- ---------
1 1123 november 2014
2 1124 november 2014
3 1123 december 2014
4 1124 december 2014
5 1123 december 2014
6 1123 january 2015
7 1126 january 2015
8 1125 february 2015
9 1123 february 2015
10 1124 march 2015
11 1125 march 2015
11 1123 march 2015
After the query runs, I want to get the following results:
sequence count
5 1 (1 user visited 5 flipbooks in a row: 1123)
2 2 (2 users visited 2 flipbooks in a row: 1124, 1125)
1 1 (1 user visited only 1 flipbook: 1126)
Any ideas how to achieve this?
fb_id=1124 visited only 1 flipbook "in a row". Unless row id=4 supposed to be "december 2014" and not "december 2015".
Yes, this is possible to accomplish in MySQL, using user-defined variables. The MySQL Reference Manual cautions that the behavior with this usage of user-defined variables within the same statement is undefined.
sequence count info
-------- ------ ---------------------------------------------
5 1 (1 user visited 5 flipbooks in a row: 1123
2 1 (1 user visited 2 flipbooks in a row: 1125
1 2 (2 users visited only 1 flipbook: 1124,1126
That result set is produced from the following SQL query:
SELECT d.seq AS `sequence`
, COUNT(1) AS `count`
, CONCAT('('
,COUNT(1)
,' user'
,IF(COUNT(1)>1,'s','')
,' visited'
,IF(d.seq>1,'',' only')
,' '
,d.seq
,' flipbook'
,IF(d.seq>1,'s in a row: ',': ')
,GROUP_CONCAT(d.fb_id ORDER BY d.fb_id)
) AS `info`
FROM ( SELECT c.fb_id
, MAX(c.cnt) AS seq
FROM ( SELECT #cnt := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #cnt + 1, 1) AS cnt
, #prev_yyyymm := v.yyyymm AS yyyymm
, #prev_fb_id := v.fb_id AS fb_id
FROM ( SELECT #prev_fb_id := NULL
, #prev_yyyymm := NULL
, #cnt := 0
) i
CROSS
JOIN ( SELECT t.fb_id
, DATE_FORMAT(STR_TO_DATE(CONCAT('01 ',t.flipbook),'%d %M %Y'),'%Y%m') AS yyyymm
FROM t
GROUP BY t.fb_id, yyyymm
ORDER BY t.fb_id, yyyymm
) v
) c
GROUP BY c.fb_id
) d
GROUP BY d.seq
ORDER BY d.seq DESC
FOLLOWUP
The table name of the source table goes into the query in the innermost inline view, aliased as v. (In the example above, the table name is t.
To understand how this works, you can run just the query for that innermost inline view, to see what it returns. It's big job is reformatting the flipbook column into a YYYYMM format, and ordering the rows. (We will later use the PERIOD_DIFF function to calculate the number of months between flipbook values.)
The inline view i is only there to initialize the user-defined variables we are going to use. We do it in an innermost inline view query so that gets done before those variables are referenced in the outer query. It's essentially equivalent to running separate SET statements immediately before the query runs. (We don't want any left-over values of the variables mucking with our results.)
Once the v and i views are materialized (as derived tables), the query in the inline view aliased as v can run. (The view queries in the FROM clause essentially serve like tables.)
This query is where the magic is. We're using user-defined variables to preserve the values of the "previous" row, so we can compare that to the current row. If the current row is for the same user, and is exactly one month after the previous row, we increment the sequence count by 1, otherwise, we set it to 1.
Once that query completes, we have a derived table we can use as the row source for yet another query. In this query, we want to find the "maximum" value of that sequence counter for each user. That will give us the longest sequence for each user.
With that set, the outermost query is almost trivial... order by the longest sequence in descending order, and collapse the rows to get a count of the number of users that had the same maximum sequence value.
To get the highest number of visits by an fb_id within a sequence, we can accumulate the number of visits in that innermost view query. A COUNT(1) or a SUM(1) will give us the number of visits in each month.
That can feed into the next query. We can do the same check we do for accumulating the contiguous months. Instead of incrementing by 1, we'll accumulate the total number of visits.
The next query has to be modified. We can't just wrap a MAX() around the tot, because we wouldn't be guaranteed that the total visits would be from that same longest sequence. We might have 6 visits in a 5 month sequence, but that same user might have visited 8 times in 3 months. So we scrap the MAX() function, and instead use ordering (from highest to lowest). We'll keep the value for the first row for a fb_id, and set the other values to NULL. Then, on the outermost query, we can use MAX() aggregate which will ignore the NULLs, and return the highest total visits from among all the users that had the same sequence value.
We can get this result:
sequence count highest_tot
-------- ------ -----------
5 1 6
2 1 2
1 2 1
From a query like this:
SELECT d.seq AS `sequence`
, COUNT(1) AS `count`
, MAX(FLOOR(d.tot)) AS `highest_tot`
FROM ( SELECT IF(#c_fb_id=c.fb_id,NULL,c.cnt) AS seq
, IF(#c_fb_id=c.fb_id,NULL,c.tot) AS tot
, #c_fb_id := c.fb_id AS fb_id
FROM ( SELECT #cnt := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #cnt + 1, 1) AS cnt
, #tot := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #tot + v.tot, v.tot) AS tot
, #prev_yyyymm := v.yyyymm AS yyyymm
, #prev_fb_id := v.fb_id AS fb_id
FROM ( SELECT #prev_fb_id := NULL
, #prev_yyyymm := NULL
, #cnt := 0
, #tot := 0
, #c_fb_id := NULL
) i
CROSS
JOIN ( SELECT t.fb_id
, DATE_FORMAT(STR_TO_DATE(CONCAT('01 ',t.flipbook),'%d %M %Y'),'%Y%m') AS yyyymm
, SUM(1) AS tot
FROM t
GROUP BY t.fb_id, yyyymm
ORDER BY t.fb_id, yyyymm
) v
) c
ORDER BY c.fb_id DESC, c.cnt DESC, c.tot DESC
) d
WHERE d.seq IS NOT NULL
GROUP BY d.seq
ORDER BY d.seq DESC
SQL is hard to achieve the result to find out the sequence visits of a user.
However it is easier if you use some other method, eg: create or reuse one of your table :
users(fb_id, best_sequence, current_sequence, last_modified)
-------------------------------------------------------
1123 5 3 2016-01
loop over your flipbook eg
SELECT * FROM visits WHERE fb_id=1123 AND flipbook >= ....
(you might redesign data to make SQL easier here)
Update current_sequence to 4 if sequence match, or 0 if gap found
If current_sequence > best_sequence, set best_sequence = current_sequence
You can do it via cron job, trigger, or some other methods you feel most comfortable.
This is an idea, and write your own code.
Suppose I have a table with 3 columns:
id (PK, int)
timestamp (datetime)
title (text)
I have the following records:
1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title
I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.
There is a similar question here: Mysql DateTime group by 15 mins
I also found this: http://www.artfulsoftware.com/infotree/queries.php#106
I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.
Does this make sense? Please let me know if further clarification is needed.
EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.
I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.
I'm using Tom H.'s excellent idea but doing it a little differently here:
Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.
Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:
SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL
And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:
SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id
Once we have that, we can GROUP BY it as you wanted.
SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime
I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!
Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.
SELECT
SQ1.my_id,
COUNT(*) -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
MT3.title = SQ1.title AND
MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ2 ON
SQ2.title = SQ1.title AND
SQ2.my_time > SQ1.my_time AND
SQ2.my_time <= MT3.my_time
WHERE
SQ2.my_id IS NULL
This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.
Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.
EDIT: Changed the queries to handle exact matches by timestamp.
Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.
I'm going to use this data as the example instead of yours:
id Timestamp
1 1:00:00
2 1:00:03
3 1:00:06
4 1:00:10
Here is the first query to write:
SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b
ON (a.Timestamp is within 3 seconds of b.Timestamp)
It returns:
aid bid
1 1
1 2
2 1
2 2
2 3
3 2
3 3
4 4
Let's create a nice table to hold those things that won't allow duplicates:
CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)
Now the challenge is to find something like the transitive closure of that relation.
To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:
INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)
This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.
Once this is over, you will have a transitively-closed relation like this:
aid bid
1 1
1 2
1 3 --added
2 1
2 2
2 3
3 1 --added
3 2
3 3
4 4
And now for the punchline:
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
returns:
aid Neighbors
1 1,2,3
2 1,2,3
3 1,2,3
4 4
So
SELECT DISTINCT Neighbors
FROM (
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
) Groupings
returns
Neighbors
1,2,3
4
Whew!
I like #Chris Cunningham's answer, but here's another take on it.
First, my understanding of your problem statement (correct me if I'm wrong):
You want to look at your event log as a sequence, ordered by the time of the event,
and partitition it into groups, defining the boundary as being an interval of
more than 3 seconds between two adjacent rows in the sequence.
I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.
So, first our event log table:
--
-- our event log table
--
create table dbo.eventLog
(
id int not null ,
dtLogged datetime not null ,
title varchar(200) not null ,
primary key nonclustered ( id ) ,
unique clustered ( dtLogged , id ) ,
)
Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:
The innermost select defines the upper bound of each group. That upper boundary defines a group.
The outer select defines the lower bound of each group.
Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.
[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]
select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
You could then pull rows from the event log and tag them with the group to which they belong thus:
select *
from ( select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
) period
join dbo.EventLog t on t.dtLogged >= period.dtFrom
and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged
Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.
Simple query:
SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);