Optimize MySQL Full outer join for massive amount of data - mysql

We have the following mysql tables (simplified for going straight to the point)
CREATE TABLE `MONTH_RAW_EVENTS` (
`idEvent` int(11) unsigned NOT NULL,
`city` varchar(45) NOT NULL,
`country` varchar(45) NOT NULL,
`ts` datetime NOT NULL,
`idClient` varchar(45) NOT NULL,
`event_category` varchar(45) NOT NULL,
... bunch of other fields
PRIMARY KEY (`idEvent`),
KEY `idx_city` (`city`),
KEY `idx_country` (`country`),
KEY `idClient` (`idClient`),
) ENGINE=InnoDB;
CREATE TABLE `compilation_table` (
`idClient` int(11) unsigned DEFAULT NULL,
`city` varchar(200) DEFAULT NULL,
`month` int(2) DEFAULT NULL,
`year` int(4) DEFAULT NULL,
`events_profile` int(10) unsigned NOT NULL DEFAULT '0',
`events_others` int(10) unsigned NOT NULL DEFAULT '0',
`events_total` int(10) unsigned NOT NULL DEFAULT '0',
KEY `idx_month` (`month`),
KEY `idx_year` (`year`),
KEY `idx_idClient` (`idClient`),
KEY `idx_city` (`city`)
) ENGINE=InnoDB;
MONTH_RAW_EVENTS contains almost 20M rows having user performed actions in a website, it sizes almost 4GB
compilation_table has a summary clients/cities per each month, we use it for displaying stats on a website in real time
We process the statistics (from first table to second one) once per month, and we're trying to optimize a query that performs such operation (as until now we're processing everything in PHP which takes loong loong time)
Here's the query we came up with, which seems doing the job when using small subsets of data,
the problem that takes more than 6hours to process for the full set of data
INSERT INTO compilation_table (idClient,city,month,year,events_profile,events_others)
SELECT IFNULL(OTHERS.idClient,AP.idClient) as idClient,
IF(IFNULL(OTHERS.city,AP.city)='','Others',IFNULL(OTHERS.city,AP.city)) as city,
01,2014,
IFNULL(AP.cnt,0) as events_profile,
IFNULL(OTHERS.cnt,0) as events_others
FROM
(
SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt
FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
AND `event_category`!='CLIENT PROFILE'
GROUP BY idClient,city
) as OTHERS
LEFT JOIN
(
SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt
FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
AND `event_category`='CLIENT PROFILE'
GROUP BY idClient,city
) as CLIPROFILE
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
UNION
SELECT IFNULL(OTHERS.idClient,CLIPROFILE.idClient) as idClient,
IF(IFNULL(OTHERS.city,CLIPROFILE.city)='','Others',IFNULL(OTHERS.city,CLIPROFILE.city)) as city,
01,2014,
IFNULL(CLIPROFILE.cnt,0) as events_profile,
IFNULL(OTHERS.cnt,0) as events_others
FROM
(
SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt
FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
AND `event_category`!='CLIENT PROFILE'
GROUP BY idClient,city
) as OTHERS
RIGHT JOIN
(
SELECT idClient,CONCAT(city,', ',country) as city,count(*) as cnt
FROM `MONTH_RAW_EVENTS` WHERE `ts`>'2014-01-01 00:00:00' AND `ts`<='2014-01-31 23:59:59'
AND `event_category`='CLIENT PROFILE'
GROUP BY idClient,city
) as CLIPROFILE
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
What we're trying to do is a FULL Outer Join in Mysql so the basic schema of the query is like: the one proposed here
How can we optimize the query? we've been trying diferent indexes, swiching things around but after 8 hours still didnt finished running,
The MySQL server is a Percona MySQL 5.5 dedicated machine with 2cpu, 2GB ram, and SSD disk,
we optimized the configuration of such server using Percona tools,
Any help would be really appreciated,
thanks

You're doing a UNION which results in DISTINCT processing.
It's usually better to rewrite a Full Join to a Left Join plus the non-matching rows of a Right Join (if it's proper 1:n join)
OTHERS LEFT JOIN CLIPROFILE
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
union all
OTHERS RIGHT JOIN CLIPROFILE
ON CLIPROFILE.city=OTHERS.city and CLIPROFILE.idClient=OTHERS.idClient
WHERE OTHERS.idClient IS NULL
Additionally you might materialize the results of the Derived Tables in temp tables before joining them, thus the calculation is only done once (I don't know if MySQL's optimizer is smart enough to do that automatically).
Plus it might be more efficient to group by and join on city/country as separate columns and do the CONCAT(city,', ',country) as city in the outer step.

Related

Improve query speed suggestions

For self education I am developing an invoicing system for an electricity company. I have multiple time series tables, with different intervals. One table represents consumption, two others represent prices. A third price table should be still incorporated. Now I am running calculation queries, but the queries are slow. I would like to improve the query speed, especially since this is only the beginning calculations and the queries will only become more complicated. Also please note that this is my first database i created and exercises I have done. A simplified explanation is preferred. Thanks for any help provided.
I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. This speed up the process from 60 seconds to 5 seconds.
The structure of the tables is the following:
CREATE TABLE `apxprice` (
`APX_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`PRICE` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`APX_id`)
) ENGINE=MyISAM AUTO_INCREMENT=28728 DEFAULT CHARSET=latin1
CREATE TABLE `imbalanceprice` (
`imbalanceprice_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PTU` tinyint(3) DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`UPWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`DOWNWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`UPWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`DOWNWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`INCENTIVE_COMPONENT` decimal(10,2) DEFAULT NULL,
`TAKE_FROM_SYSTEM` decimal(10,2) DEFAULT NULL,
`FEED_INTO_SYSTEM` decimal(10,2) DEFAULT NULL,
`REGULATION_STATE` tinyint(1) DEFAULT NULL,
`HOUR` int(2) DEFAULT NULL,
PRIMARY KEY (`imbalanceprice_id`),
KEY `DATE` (`DATE`,`PERIOD_FROM`,`PERIOD_UNTIL`)
) ENGINE=MyISAM AUTO_INCREMENT=117427 DEFAULT CHARSET=latin
CREATE TABLE `powerload` (
`powerload_id` int(11) NOT NULL AUTO_INCREMENT,
`EAN` varchar(18) DEFAULT NULL,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`POWERLOAD` int(11) DEFAULT NULL,
PRIMARY KEY (`powerload_id`)
) ENGINE=MyISAM AUTO_INCREMENT=61039 DEFAULT CHARSET=latin
Now when running this query:
SELECT i.DATE, i.PERIOD_FROM, i.TAKE_FROM_SYSTEM, i.FEED_INTO_SYSTEM,
a.PRICE, p.POWERLOAD, sum(a.PRICE * p.POWERLOAD)
FROM imbalanceprice i, apxprice a, powerload p
WHERE i.DATE = a.DATE
and i.DATE = p.DATE
AND i.PERIOD_FROM >= a.PERIOD_FROM
and i.PERIOD_FROM = p.PERIOD_FROM
AND i.PERIOD_FROM < a.PERIOD_UNTIL
AND i.DATE >= '2018-01-01'
AND i.DATE <= '2018-01-31'
group by i.DATE
I have run the query with explain and get the following result: Select_type, all simple partitions all null possible keys a,p = null i = DATE Key a,p = null i = DATE key_len a,p = null i = 8 ref a,p = null i = timeseries.a.DATE,timeseries.p.PERIOD_FROM rows a = 28727 p = 61038 i = 1 filtered a = 100 p = 10 i = 100 a extra: using where using temporary using filesort b extra: using where using join buffer (block nested loop) c extra: null
Preferably I run a more complicated query for a whole year and group by month for example with all price tables incorporated. However, this would be too slow. I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. The calculation result may not be changed, in this case quarter hourly consumption of two meters multiplied by hourly prices.
"Categorically speaking," the first thing you should look at is indexes.
Your clauses such as WHERE i.DATE = a.DATE ... are categorically known as INNER JOINs, and the SQL engine needs to have the ability to locate the matching rows "instantly." (That is to say, without looking through the entire table!)
FYI: Just like any index in real-life – here I would be talking about "library card catalogs" if we still had such a thing – indexes will assist both "equal to" and "less/greater than" queries. The index takes the computer directly to a particular point in the data, whether that's a "hit" or a "near miss."
Finally, the EXPLAIN verb is very useful: put that word in front of your query, and the SQL engine should "explain to you" exactly how it intends to carry out your query. (The SQL engine looks at the structure of the database to make that decision.) Although the EXPLAIN output is ... (heh) ... "not exactly standardized," it will help you to see if the computer thinks that it needs to do something very time-wasting in order to deliver your answer.

MySQL Query/Table in need of optimization

I have a query that is taking an embarrassingly long time. ~7 minutes embarrassing. I would really appreciate some help. Missing indexes? Rewrite the query? All of the above?
Many thanks
mysql Ver 14.14 Distrib 5.7.25, for Linux (x86_64)
The query looks like:
SELECT COUNT(*) AS count_all, name
FROM api_events ae
INNER JOIN products p on p.token=ae.product_token
WHERE (ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY name
Here are the two table definitions
api_events has ~31 million records
CREATE TABLE `api_events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`api_name` varchar(200) NOT NULL,
`hostname` varchar(200) NOT NULL,
`controller_action` varchar(2000) NOT NULL,
`duration` decimal(12,5) NOT NULL DEFAULT '0.00000',
`view` decimal(12,5) NOT NULL DEFAULT '0.00000',
`db` decimal(12,5) NOT NULL DEFAULT '0.00000',
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`product_token` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `product_token` (`product_token`)
) ENGINE=InnoDB AUTO_INCREMENT=64851218 DEFAULT CHARSET=latin1;
and
products has only 12 records
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`code` varchar(30) NOT NULL,
`name` varchar(100) NOT NULL,
`description` varchar(2000) NOT NULL,
`token` varchar(50) NOT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=latin1;
You could improve the join performance adding index
create index idx1 on api_events(product_token, created_at);
create index idx2 on products(token);
You could also trying inverting the columns ofr api_events
create index idx1 on api_events(created_at, product_token);
and trying add redundancy to product index
create index idx2 on products(token, name);
For the query as stated, you needed
api_events: INDEX(created_at, product_token)
products: INDEX(token, name)
Because the WHERE mentions api_events, the Optimizer is likely to start with that table. created_at is in the WHERE, so the index starts with that, even though starting with a 'range' is usually wrong. In this case, the pair is "covering".
Then, INDEX(token, name) is also "covering".
"Covering" indexes give a small, but widely varying, amount of performance improvement.
What happens if you group by the token instead of the name?
SELECT ae.product_token, COUNT(*) AS count_all
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY ae.product_token;
For this query, an index on api_events(created_at, product_token) will probably help.
If this is faster, then you can bring in the name using a subquery.
It seems like the criteria on created_at is very selective (looking at only the past 7 days?). That's crying out to explore an index with created_at as a leading column.
The query is also referencing the product_token column from the same table, so we can include that column in the index, to make it a covering index.
api_events_IX ON api_events ( created_at, product_token )
Using that index, we can probably avoid looking at the vast majority of the 31 million rows, and quickly narrow in on the subset of rows we actually need to look at.
Using the index, the query will still need a "Using filesort" operation to satisfy the GROUP BY.
(My guess here is that the join to the 12 rows in product doesn't exclude a lot of rows... that on the vast majority of rows in api_event the product_token refers to a row that exists in product.
Use MySQL EXPLAIN to see the query execution plan.
A further possible refinement (to test the performance of) would be to do some of the aggregation in an inline view:
SELECT SUM(s.count_all) AS count_all
, p.name
FROM ( SELECT COUNT(*) AS count_all
, ae.product_token
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732'
GROUP
BY ae.product_token
) s
JOIN products p
ON p.token = s.product_token
GROUP
BY p.name
If the assumption about product_token is misinformed, if there are lots of rows in api_event that have product_token values that don't reference a row in product ... we might take a different tack ...

MySQL select optimization

A table with a few Million rows, something like this:
my_table (
`CONTVISITID` bigint(20) NOT NULL AUTO_INCREMENT,
`NODE_ID` bigint(20) DEFAULT NULL,
`CONT_ID` bigint(20) DEFAULT NULL,
`NODE_NAME` varchar(50) DEFAULT NULL,
`CONT_NAME` varchar(100) DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`HITS` bigint(20) DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CLIENT_TYPE` varchar(20) DEFAULT NULL,
`TYPE` bigint(1) DEFAULT NULL,
`PLAY_TIMES` bigint(20) DEFAULT NULL,
`FIRST_PUBLISH_TIME` bigint(20) DEFAULT NULL,
PRIMARY KEY (`CONTVISITID`),
KEY `cont_visit_contid` (`CONT_ID`),
KEY `cont_visit_createtime` (`CREATE_TIME`),
KEY `cont_visit_publishtime` (`FIRST_PUBLISH_TIME`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=57676834 DEFAULT CHARSET=utf8
I had a query that I have managed to optimize to the following departing from a flat select:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits,type,first_publish_time
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
Can this be further optimized?
Edit:
I started with a FLAT select like I mentioned before, what I mean by flat select not to have a composite select like my current one. Instead of the single select that someone responded with. A single select is twice slower, so not viable in my case.
Edit2: I have a DBA friend who suggested me to change the query to this:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
As I do not need the fields extra (type,first_publish_time) and the TMP table is smaller, this makes the query faster about about 1/4 total time of the fastest version I have. He also suggested to add a composite index between (create_time, cont_id, hits). He says with this index I will get really good performance, but I have not done that as this is a production DB and the alter might affect replication. I will post results once done.
INDEX(type, first_publish_time)
INDEX(type, create_time)
Then do
SELECT cont_id, SUM(hits) AS tot_hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time > 1398310263000
AND type = 1
group by cont_id
order by tot_hits DESC
LIMIT 10;
Start the index with any = filters (type, in this case); then you get one chance to us a range.
The reason for 2 indexes -- The Optimizer will look at statistics and decide which look better based on the values given.
Consider shrinking the BIGINTs (8 bytes) to some smaller INT type. Saving space will help speed, especially if the table is too big to be cached.
For further discussion, please provide EXPLAIN SELECT ...;.

Ordering in MySQL Bogs Down

I've been working on a small Perl program that works with a table of articles, displaying them to the user if they have not been already read. It has been working nicely and it has been quite speedy, overall. However, this afternoon, the performance has degraded from fast enough that I wasn't worried about optimizing the query to a glacial 3-4 seconds per query. To select articles, I present this query:
SELECT channelitem.ciid, channelitem.cid, name, description, url, creationdate, author
FROM `channelitem`
WHERE ciid NOT
IN (
SELECT ciid
FROM `uninet_channelitem_read`
WHERE uid = '1030'
)
AND (
cid =117
OR cid =308
OR cid =310
)
ORDER BY `channelitem`.`creationdate` DESC
LIMIT 0 , 100
The list of possible cid's varies and could be quite a bit more. In any case, I noted that about 2-3 seconds of the total time to make the query is devoted to "ORDER BY." If I remove that, it only takes about a half second to give me the query back. If I drop the subquery, the performance goes back to normal... but the subquery didn't seem to be problematic until just this afternoon, after working fine for a week or so.
Any ideas what could be slowing it down so much? What might I do to try to get the performance back up to snuff? The table being queried has 45,000 rows. The subquery's table has fewer than 3,000 rows at present.
Update: Incidentally, if anyone has suggestions on how to do multiple queries or some other technique that would be more efficient to accomplish what I am trying to do, I am all ears. I'm really puzzled how to solve the problem at this point. Can I somehow apply the order by before the join to make it apply to the real table and not the derived table? Would that be more efficient?
Here is the latest version of the query, derived from suggestions from #Gordon, below
SELECT channelitem.ciid, channelitem.cid, name, description, url, creationdate, author
FROM `channelitem`
LEFT JOIN (
SELECT ciid, dateRead
FROM `uninet_channelitem_read`
WHERE uid = '1030'
)alreadyRead ON channelitem.ciid = alreadyRead.ciid
WHERE (
alreadyRead.ciid IS NULL
)
AND `cid`
IN ( 6648, 329, 323, 6654, 6647 )
ORDER BY `channelitem`.`creationdate` DESC
LIMIT 0 , 100
Also, I should mention what my db structure looks like with regards to these two tables -- maybe someone can spot something odd about the structure:
CREATE TABLE IF NOT EXISTS `channelitem` (
`newsversion` int(11) NOT NULL DEFAULT '0',
`cid` int(11) NOT NULL DEFAULT '0',
`ciid` int(11) NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_unicode_ci,
`url` varchar(222) DEFAULT NULL,
`creationdate` datetime DEFAULT NULL,
`urgent` varchar(10) DEFAULT NULL,
`name` varchar(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`lastchanged` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`author` varchar(255) NOT NULL,
PRIMARY KEY (`ciid`),
KEY `newsversion` (`newsversion`),
KEY `cid` (`cid`),
KEY `creationdate` (`creationdate`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1638554365 ;
CREATE TABLE IF NOT EXISTS `uninet_channelitem_read` (
`ciid` int(11) NOT NULL,
`uid` int(11) NOT NULL,
`dateRead` datetime NOT NULL,
PRIMARY KEY (`ciid`,`uid`),
KEY `ciid` (`ciid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
It never hurts to try the left outer join version of such a query:
SELECT ci.ciid, ci.cid, ci.name, ci.description, ci.url, ci.creationdate, ci.author
FROM `channelitem` ci left outer join
(SELECT ciid
FROM `uninet_channelitem_read`
WHERE uid = '1030'
) cr
on ci.ciid = cr.ciid
where cr.ciid is null and
ci.cid in (117, 308, 310)
ORDER BY ci.`creationdate` DESC
LIMIT 0 , 100
This query will be faster with an index on uninet_channelitem_read(ciid) and probably on channelitem(cid, ciid, createddate).
The problem could be that you need to create an index on the channelitem table for the column creationdate. Indexes help a database to run queries faster. Here is a link about MySQL Indexing

SQL query that fries the server

I'm having a really weird issue that burns my MySQL server. From my point of view (which is surely wrong), the query is pretty trivial.
I have a table to store PBX events and I try to get the last events for every agent to see his/her situation whenever my application is restarted or whatever.
Whenever I launch, the server goes up to 99% of CPU and lasts about 5 minutes to solve by itself.
It seems that's because the number of records, more than 100,000.
The table is as follows:
CREATE TABLE IF NOT EXISTS `eventos_centralita` (
`idEvent` int(11) NOT NULL AUTO_INCREMENT,
`fechaHora` datetime NOT NULL,
`codAgente` varchar(8) DEFAULT NULL,
`extension` varchar(20) DEFAULT NULL,
`evento` varchar(45) DEFAULT NULL,
PRIMARY KEY (`idEvent`),
KEY `codAgente` (`codAgente`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=105847 ;
And the query is as follows:
SELECT a.* FROM eventos_centralita a
LEFT JOIN eventos_centralita b ON b.codAgente = a.codAgente AND b.fechaHora > a.fechaHora
GROUP BY a.codAgente
I've tried to limit it by date but no luck as the query doesn't give me anything. How could I improve the query to avoid this ?
Please try below:
SELECT a.* FROM eventos_centralita a
INNER JOIN
(
SELECT idEvent, MAX(fechaHora)
FROM eventos_centralita
GROUP BY codAgente
) as b
ON a.idEvent = b.idEvent