I have one billion lines. Each line is a sequence of numbers:
32098;1278;23902;8469
42710;17864;32230
230984;812918;420322;182972
339028;232329;2190120;23302;182972
232329;17864;32230;23302;182972
How to store that data and search in it, so the search time is minimal to find any sub sequences:
Example: searching for sequence "17864;32230" outputs:
42710;17864;32230
232329;17864;32230;23302;182972
What i have tried:
storing lines in varchar (ascii), and searching: like "%17864;32230%" => very slow...
storing lines in varchar (ascii), will fulltext index and searching: against(' "17864;32230" ' in boolean mode) => faster...
storing lines in varchar (ascii), will fulltext index and searching: against(' +17864 +32230' in boolean mode) and line like "%17864;32230%" => fastest i found...
Any faster method ?
searching for sequence "17864;32230" outputs Does the next two values will be selected: "17864;123456;32230", "123456;32230;17864" ? – Akina
#akina, "17864;123456;32230", "123456;32230;17864" must not be outputs, because they do not contain the sequence "17864;32230" – JoJo
I.e. your sequence is positionally-dependent... well. Does the sequence to be found is always 2-valued, or its length (in elements) may vary? – Akina
#Akina, sequence to be found is always 2-valued. You are right :) – JoJo
Does each separate value in "array" has some upper limit? not more that 6 digits, for example... – Akina
#Akina, you are right, in my specific case, numbers in sequence are limited to 8 digits – JoJo 10 mins ago
Look for this solution:
fiddle
CREATE TABLE sourcetable ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
dataarray TEXT );
INSERT INTO sourcetable (dataarray) VALUES
('32098;1278;23902;8469'),
('42710;17864;32230'),
('230984;812918;420322;182972'),
('339028;232329;2190120;23302;182972'),
('232329;17864;32230;23302;182972');
-- create indexing table
CREATE TABLE indexingtable ( id BIGINT UNSIGNED NOT NULL,
sequence BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (sequence, id) );
-- and fill it
INSERT IGNORE INTO indexingtable
-- assume not more than 6 elements per "array"
WITH cte AS ( SELECT 1 num UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 )
SELECT id, CONCAT(LPAD(SUBSTRING_INDEX(SUBSTRING_INDEX(dataarray, ';', num), ';', -1), 9, '0'),
LPAD(SUBSTRING_INDEX(SUBSTRING_INDEX(dataarray, ';', num+1), ';', -1), 9, '0'))
FROM sourcetable, cte;
-- search for "17864;32230"
SET #criteria := 17864000032230;
-- perform searching
SELECT sourcetable.*
FROM sourcetable
JOIN indexingtable USING (id)
WHERE sequence = #criteria;
id | dataarray
-: | :------------------------------
2 | 42710;17864;32230
5 | 232329;17864;32230;23302;182972
EXPLAIN
SELECT sourcetable.*
FROM sourcetable
JOIN indexingtable USING (id)
WHERE sequence = #criteria;
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
-: | :---------- | :------------ | :--------- | :----- | :------------ | :------ | :------ | :------------------------------------------- | ---: | -------: | :-----------------------
1 | SIMPLE | indexingtable | null | ref | PRIMARY | PRIMARY | 8 | const | 2 | 100.00 | Using where; Using index
1 | SIMPLE | sourcetable | null | eq_ref | PRIMARY | PRIMARY | 8 | fiddle_KJQBRBTPCZAIOJRJHGJJ.indexingtable.id | 1 | 100.00 | null
db<>fiddle here
The indexingtable create by the query will be extremely long and expensive process on a billion source records. I'd recommend to export source data to text (SELECT .. INTO OUTFILE), convert it using any script/progrmming language, then import into the indexingtable. It will be also long, but much faster than by the query.
If you want to rebuild the data, you could do the following.
Structure the data vertically -- you'll have billions of rows:
id n val
1 1 32098
1 2 1278
1 3 23902
1 4 8469
2 1 42710
2 2 17864
2 3 32230
With another table:
id sequence
1 32098;1278;23902;8469
2 42710;17864;32230
Then you can try:
select ta.id
from table1 ta join
table1 tb
on tb.id = ta.id and
tb.n = ta.n + 1 and
tb.val = 32230
where ta.val = 17864
For this you want indexes on (id, n, val) and (val, id, n).
I would expect this to be fairly competitive with the full-text search method. I'm actually surprised that option (3) is faster than option (2).
The advantage is that it might give you more flexibility on the types of sequences that you are looking for.
Related
Consider a typical GROUP BY statement in SQL: you have a table like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| B | 2 |
| A | 3 |
| B | 4 |
+------+-------+
And you ask for
SELECT Name, SUM(Value) as Value
FROM table
GROUP BY Name
You'll receive
+------+-------+
| Name | Value |
+------+-------+
| A | 4 |
| B | 6 |
+------+-------+
In your head, you can imagine that SQL generates an intermediate sorted table like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| A | 3 |
| B | 2 |
| B | 4 |
+------+-------+
and then aggregates together successive rows: the "Value" column has been given an aggregator (in this case SUM), so it's easy to aggregate. The "Name" column has been given no aggregator, and thus uses what you might call the "trivial partial aggregator": given two things that are the same (e.g. A and A), it aggregates them into a single copy of one of the inputs (in this case A). Given any other input it doesn't know what to do and is forced to begin aggregating anew (this time with the "Name" column equal to B).
I want to do a more exotic kind of aggregation. My table looks like
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| BC | 2 |
| AY | 3 |
| AZ | 4 |
| B | 5 |
| BCR | 6 |
+------+-------+
And the intended output is
+------+-------+
| Name | Value |
+------+-------+
| A | 8 |
| B | 13 |
+------+-------+
Where does this come from? A and B are the "minimal prefixes" for this set of names: they occur in the data set and every Name has exactly one of them as a prefix. I want to aggregate data by grouping rows together when their Names have the same minimal prefix (and add the Values, of course).
In the toy grouping model from before, the intermediate sorted table would be
+------+-------+
| Name | Value |
+------+-------+
| A | 1 |
| AY | 3 |
| AZ | 4 |
| B | 5 |
| BC | 2 |
| BCR | 6 |
+------+-------+
Instead of using the "trivial partial aggregator" for Names, we would use one that can aggregate X and Y together iff X is a prefix of Y; in that case it returns X. So the first three rows would be aggregated together into a row with (Name, Value) = (A, 8), then the aggregator would see that A and B couldn't be aggregated and would move on to a new "block" of rows to aggregate.
The tricky thing is that the value we're grouping by is "non-local": if A were not a name in the dataset, then AY and AZ would each be a minimal prefix. It turns out that the AY and AZ rows are aggregated into the same row in the final output, but you couldn't know that just by looking at them in isolation.
Miraculously, in my use case the minimal prefix of a string can be determined without reference to anything else in the dataset. (Imagine that each of my names is one of the strings "hello", "world", and "bar", followed by any number of z's. I want to group all of the Names with the same "base" word together.)
As I see it I have two options:
1) The simple option: compute the prefix for each row and group by that value directly. Unfortunately I have an index on the Name, and computing the minimal prefix (whose length depends on the Name itself) prevents me from using that index. This forces a full table scan, which is prohibitively slow.
2) The complicated option: somehow convince MySQL to use the "partial prefix aggregator" for Name. This runs into the "non-locality" problem above, but that's fine as long as we scan the table according to my index on Name, since then every minimal prefix will be encountered before any of the other strings it is a prefix of; we would never try to aggregate AY and AZ together if A were in the dataset.
In a declarative programming language #2 would be rather easy: extract rows one at a time, in alphabetical order, keeping track of the current prefix. If your new row's Name has that as a prefix, it goes in the bucket you're currently using. Otherwise, start a new bucket with that as your prefix. In MySQL I am lost as to how to do it. Note that the set of minimal prefixes is not known beforehand.
Edit 2
It occurred to me that if the table is ordered by Name, this would be a lot easier (and faster). Since I don't know if your data is sorted, I've included a sort in this query, but if the data is sorted, you can strip out (SELECT * FROM table1 ORDER BY Name) t1 and just use FROM table1
SELECT prefix, SUM(`Value`)
FROM (SELECT Name, Value, #prefix:=IF(Name NOT LIKE CONCAT(#prefix, '_%'), Name, #prefix) AS prefix
FROM (SELECT * FROM table1 ORDER BY Name) t1
JOIN (SELECT #prefix := '~') p
) t2
GROUP BY prefix
Updated SQLFiddle
Edit
Having slept on the problem, I realised that there is no need to do the IN, it's enough to just have a WHERE NOT EXISTS clause on the JOINed table:
SELECT t1.Name, SUM(t2.Value) AS `Value`
FROM table1 t1
JOIN table1 t2 ON t2.Name LIKE CONCAT(t1.Name, '%')
WHERE NOT EXISTS (SELECT *
FROM table1 t3
WHERE t1.Name LIKE CONCAT(t3.Name, '_%')
)
GROUP BY t1.Name
Updated Explain (Name changed to UNIQUE key from PRIMARY)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t1 index Name Name 11 NULL 6 Using where; Using index; Using temporary; Using filesort
1 PRIMARY t2 ALL NULL NULL NULL NULL 6 Using where; Using join buffer (Block Nested Loop)
3 DEPENDENT SUBQUERY t3 index NULL Name 11 NULL 6 Using where; Using index
Updated SQLFiddle
Original Answer
Here is one way you could do it. First, you need to find all the unique prefixes in your table. You can do that by looking for all values of Name where it does not look like another value of Name with other characters on the end. This can be done with this query:
SELECT Name
FROM table1 t1
WHERE NOT EXISTS (SELECT *
FROM table1 t2
WHERE t1.Name LIKE CONCAT(t2.Name, '_%')
)
For your sample data, that will give
Name
A
B
Now you can sum all the values where the Name starts with one of those prefixes. Note we change the LIKE pattern in this query so that it also matches the prefix, otherwise we wouldn't count the values for A and B in your example:
SELECT t1.Name, SUM(t2.Value) AS `Value`
FROM table1 t1
JOIN table1 t2 ON t2.Name LIKE CONCAT(t1.Name, '%')
WHERE t1.Name IN (SELECT Name
FROM table1 t3
WHERE NOT EXISTS (SELECT *
FROM table1 t4
WHERE t3.Name LIKE CONCAT(t4.Name, '_%')
)
)
GROUP BY t1.Name
Output:
Name Value
A 8
B 13
An EXPLAIN says that both of these queries use the index on Name, so should be reasonably efficient. Here is the result of the explain on my MySQL 5.6 server:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t1 index PRIMARY PRIMARY 11 NULL 6 Using index; Using temporary; Using filesort
1 PRIMARY t3 eq_ref PRIMARY PRIMARY 11 test.t1.Name 1 Using where; Using index
1 PRIMARY t2 ALL NULL NULL NULL NULL 6 Using where; Using join buffer (Block Nested Loop)
3 DEPENDENT SUBQUERY t4 index NULL PRIMARY 11 NULL 6 Using where; Using index
SQLFiddle Demo
Here are some hints on how to do the task. This locates any prefixes that are useful. That's not what you asked for, but the flow of the query and the usage of #variables, plus the need for 2 (actually 3) levels of nesting, might help you.
SELECT DISTINCT `Prev`
FROM
(
SELECT #prev := #next AS 'Prev',
#next := IF(LEFT(city, LENGTH(#prev)) = #prev, #next, city) AS 'Next'
FROM ( SELECT #next := ' ' ) AS init
JOIN ( SELECT DISTINCT city FROM us ) AS dedup
ORDER BY city
) x
WHERE `Prev` = `Next` ;
Partial output:
+----------------+
| Prev |
+----------------+
| Alamo |
| Allen |
| Altamont |
| Ames |
| Amherst |
| Anderson |
| Arlington |
| Arroyo |
| Auburn |
| Austin |
| Avon |
| Baker |
Check the Al% cities:
mysql> SELECT DISTINCT city FROM us WHERE city LIKE 'Al%' ORDER BY city;
+-------------------+
| city |
+-------------------+
| Alabaster |
| Alameda |
| Alamo | <--
| Alamogordo | <--
| Alamosa |
| Albany |
| Albemarle |
...
| Alhambra |
| Alice |
| Aliquippa |
| Aliso Viejo |
| Allen | <--
| Allen Park | <--
| Allentown | <--
| Alliance |
| Allouez |
| Alma |
| Aloha |
| Alondra Park |
| Alpena |
| Alpharetta |
| Alpine |
| Alsip |
| Altadena |
| Altamont | <--
| Altamonte Springs | <--
| Alton |
| Altoona |
| Altus |
| Alvin |
+-------------------+
40 rows in set (0.01 sec)
We have a MyISAM table with a single column bit and two rows, containing 0 and 1. We group by this column, make a count and select it. The result as follows is expected.
select count( bit), bit from tab GROUP BY bit;
| count(bit) | bit |
|------------|-----|
| 1 | 0 |
| 1 | 1 |
But when using the distinct keyword, the output value of the column is always 1. Why?
select count(distinct bit), bit from tab GROUP BY bit;
| count(bit) | bit |
|------------|-----|
| 1 | 1 | # WHYYY
| 1 | 1 |
I've been crawling the documentation and the internet but with no luck.
Here is the setup:
CREATE TABLE `tab` (
`bit` bit(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8; # When using InnoDB everything's fine
INSERT INTO `tab` (`bit`) VALUES
(CONV('1', 2, 10) + 0),
(CONV('0', 2, 10) + 0);
PS: One more thing. I've been doing several experiments. Using group_concat, the column bit becomes independent again.
select count(distinct bit), group_concat(bit) from tab GROUP BY bit;
| count(bit) | bit |
|------------|------------|
| 1 | 1 byte (0) |
| 1 | 1 byte (1) |
Thanks to comments, I am from now on convinced of not using the bit column at all. The more reliable alternative is tinyint(1).
Inspired from the Adminer application bit handling, I recommend using bin function to cast bit on an expected value every time when selecting:
select count(distinct bit), BIN(bit) from tab GROUP BY bit;
I have a table that looks something like the following:
| id | sub_id | fk_id |
|----|--------|-------|
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |
| 4 | 4 | 1 |
| 5 | 5 | 1 |
| 6 | 1 | 2 |
| 7 | 2 | 2 |
| 8 | 3 | 2 |
| 9 | 4 | 2 |
| 10 | 5 | 2 |
Within this table id is the primary key, and sub_id and fk_id make up a compound unique key, where fk_id is the primary key in another table.
I've found myself in the situation where I need to be able to remove rows within the table, but then renumber sub_id so that there aren't any gaps, e.g. remove (1, 1, 1) and all rows where fk_id=1 have their respective sub_id renumbered as 1, 2, 3, 4, etc.
I also need to be able to remove one or more rows at a time, then trigger the re-numbering (as I assume it's inefficient to try and renumber them multiple times when once will suffice). However, there's a maximum of 60 rows for each value of fk_id but there can be thousands of different values of fk_id.
How should I go about the re-numbering? I'm think some sort of INSERT ... SELECT query, but I can't get my head around how it should work.
You can renumber the rows for a given fk_id using this query:
select t_renum.*, count(t_lower.id) as new_sub_id
from mytable t_renum
join mytable t_lower
on t_lower.fk_id = t_renum.fk_id
and t_lower.id <= t_renum.id
where t_renum.fk_id = #renumber_fk_id
group by t_renum.id
The result can be joined with the original table for update like this:
update mytable t
join (
select t_renum.*, count(t_lower.id) as new_sub_id
from mytable t_renum
join mytable t_lower
on t_lower.fk_id = t_renum.fk_id
and t_lower.id <= t_renum.id
where t_renum.fk_id = #renumber_fk_id
group by t_renum.id
) t_renum using (id)
set t.sub_id = t_renum.new_sub_id
sqlfiddle
After digging around more, I discovered another answer that was remarkably simple and avoided the need for a new table which is recommended in many similar questions. I converted it to a stored procedure which suits my needs better:
DELIMITER //
CREATE PROCEDURE reindex (IN fk_key INT UNSIGNED)
BEGIN
SET #num := 0;
UPDATE example
SET sub_id = (#num := #num + 1)
WHERE fk_id = fk_key
ORDER BY id;
END //
DELIMITER ;
If I have a MySQL table such as:
I want to use SQL to calculate the sum of the PositiveResult column and also the NegativeResult column. Normally I could simply do SUM(PositiveResult) in a query.
But what if I wanted to go a step further and place the totals in a row at the bottom of the result set:
Can this be achieved at the data level or is it a presentation layer issue? If it can be done by SQL, how might I do this? I am a bit of an SQL newbie.
Thanks to the respondents. I will now check things with the customer.
Also, can a text column be added so that the value of the last row of data is not shown in the summary row? Like this:
I would also do this in the presentation layer, but you can do it MySQL...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,pos DECIMAL(5,2)
,neg DECIMAL(5,2)
);
INSERT INTO my_table VALUES
(1,0,0),
(2,1,-2.5),
(3,1.6,-1),
(4,1,-2);
SELECT COALESCE(id,'total') my_id,SUM(pos),SUM(neg) FROM my_table GROUP BY id WITH ROLLUP;
+-------+----------+----------+
| my_id | SUM(pos) | SUM(neg) |
+-------+----------+----------+
| 1 | 0.00 | 0.00 |
| 2 | 1.00 | -2.50 |
| 3 | 1.60 | -1.00 |
| 4 | 1.00 | -2.00 |
| total| 3.60 | -5.50 |
+-------+----------+----------+
5 rows in set (0.02 sec)
Here's a hack for the amended problem - it ain't pretty but I think it works...
SELECT COALESCE(id,'') my_id
, SUM(pos)
, SUM(neg)
, COALESCE(string,'') n
FROM my_table
GROUP
BY id
, string
WITH ROLLUP
HAVING n <> '' OR my_id = ''
;
select keyword,sum(positiveResults)+sum(NegativeResults)
from mytable
group by
Keyword
if you need the absolute value put sum(abs(NegativeResults)
This should be handled at least one layer above the SQL query layer.
The initial query can fetch the detail info and then the application layer can calculate the aggregation (summary row). Or, a second db call to fetch the summary directly can be used (although this would be efficient only for cases where the calculation of the summary is very resource-intensive and a second db call is really necessary - most of the time the app layer can do it more efficiently).
The ordering/layout of the results (i.e. the detail rows followed by the "footer" summary row) should be handled at the presentation layer.
I'd recommend doing this at the presentation layer. To do something like this in SQL is also possible.
create table test (
keywordid int,
positiveresult decimal(10,2),
negativeresult decimal(10,2)
);
insert into test values
(1, 0, 0), (2, 1, -2.5), (3, 1.6, -1), (4, 1, -2);
select * from (
select keywordid, positiveresult, negativeresult
from test
union all
select null, sum(positiveresult), sum(negativeresult) from test
) main
order by
case when keywordid is null then 1000000 else keywordid end;
I added ordering using a arbitrarily high number if keywordid is null to make sure the ordered recordset can be pulled easily by the view for displaying.
Result:
+-----------+----------------+----------------+
| keywordid | positiveresult | negativeresult |
+-----------+----------------+----------------+
| 1 | 0.00 | 0.00 |
| 2 | 1.00 | -2.50 |
| 3 | 1.60 | -1.00 |
| 4 | 1.00 | -2.00 |
| NULL | 3.60 | -5.50 |
+-----------+----------------+----------------+
I run the following query on a weekly basis, but it is getting to the point where it now takes 22 hours to run! The purpose of the report is to aggregate impression and conversion data at the ad placement and date, so the main table I am querying does not have a primary key as there can be multiple events with the same date/placement.
The main data set has about 400K records, so it shouldn't take more than a few minutes to run this report.
The table descriptions are:
tbl_ads (400,000 records)
day_est DATE (index)
conv_day_est DATE (index)
placement_id INT (index)
adunit_id INT (index)
cost_type VARCHAR(20)
cost_value DECIMAL(10,2)
adserving_cost DECIMAL(10,2)
conversion1 INT
estimated_spend DECIMAL(10,2)
clicks INT
impressions INT
publisher_clicks INT
publisher_impressions INT
publisher_spend DECIMAL (10,2)
source VARCHAR(30)
map_external_id (75,000 records)
placement_id INT
adunit_id INT
external_id VARCHAR (50)
primary key(placement_id,adunit_id,external_id)
SQL Query
SELECT A.day_est,A.placement_id,A.placement_name,A.adunit_id,A.adunit_name,A.imp,A.clk, C.ads_cost, C.ads_spend, B.conversion1, B.conversion2,B.ID_Matched, C.pub_imps, C.pub_clicks, C.pub_spend, COALESCE(A.cost_type,B.cost_type) as cost_type, COALESCE(A.cost_value,B.cost_value) as cost_value, D.external_id
FROM (SELECT day_est, placement_id,adunit_id,placement_name,adunit_name,cost_type,cost_value,
SUM(impressions) as imp, SUM(clicks) as clk
FROM tbl_ads
WHERE source='delivery'
GROUP BY 1,2,3 ) as A LEFT JOIN
(
SELECT conv_day_est, placement_id,adunit_id, cost_type,cost_value, SUM(conversion1) as conversion1,
SUM(conversion2) as conversion2,SUM(id_match) as ID_Matched
FROM tbl_ads
WHERE source='attribution'
GROUP BY 1,2,3
) as B on A.day_est=B.conv_day_est AND A.placement_id=B.placement_id AND A.adunit_id=B.adunit_id
LEFT JOIN
(
SELECT day_est,placement_id,adunit_id,SUM(adserving_cost) as ads_cost, SUM(estimated_spend) as ads_spend,sum(publisher_clicks) as pub_clicks,sum(publisher_impressions) as pub_imps,sum(publisher_spend) as pub_spend
FROM tbl_ads
GROUP BY 1,2,3 ) as C on A.day_est=C.day_est AND A.placement_id=C.placement_id AND A.adunit_id=C.adunit_id
LEFT JOIN
(
SELECT placement_id,adunit_id,external_id
FROM map_external_id
) as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
INTO OUTFILE '/tmp/weekly_report.csv';
Results of EXPLAIN:
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 136518 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 5180 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 198190 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 23766 | |
| 5 | DERIVED | map_external_id | index | NULL | PRIMARY | 55 | NULL | 20797 | Using index |
| 4 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | |
| 3 | DERIVED | tbl_ads | ALL | NULL | NULL | NULL | NULL | 318400 | Using filesort |
| 2 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | Using where |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
More of a speculative answer, but I don't think 22 hours is too unrealistic..
First things first... you don't need the last subquery, just state
LEFT JOIN map_external_id as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
Second, in the first and second subqueries you have the field source in your WHERE statement and this field is not listed in your table scheme. Obviously it might be or enum or string type, does it have an index? I've had a table with 1'000'000 or so entries where a missing index caused a processing time of 30 seconds for a simple query (cant believe the guy who put the query in the login process).
Irrelevant question inbetween, what's the final result set size?
Thirdly, my assumption is that by running the aggregating subqueries mysql actually creates temporary tables that do not have any indices - which is bad.
Have you yet had a look on the result sets of the single subqueries? What is the typical set size? From your statements and my guesses about your typical data I would assume that the aggregation actually only marginally reduces the set size (apart from the WHERE statement). So let me guess in order of the subqueries: 200'000, 100'000, 200'000
Each of the subqueries then joins with the next on three assumably not indexed fields. So worst case for the first join: 200'000 * 100'000 = 20'000'000'000 comparisons. Going from my 30 sec for a query on 1'000'000 records experience that makes it 20'000 * 30 = 600'000 sec =+- 166 hours. obviously that's way too much, maybe there's a digit missing, maybe it was 20 sec not 30, the result sets might be different, worst case is not average case - but you get the image.
My solution approach then would be to try to create additional tables which replace your aggregation subqueries. Judging from your queries you could update it daily, as I guess you just insert rows on impressions etc, so you can just add the aggregation data incrementally. Then you transform your mega-query into the two steps of
updating your aggregation tables
doing the final dump.
The aggregation tables obviously should be indexed meaningfully. I think that should bring the final queries down to a few seconds.
Thanks for all your advice. I ended up splitting the sub queries and creating temporary tables (with PKs) for each, then joined the temp tables together at the end and it now takes about 10 mins to run.