I am in the middle of my first data science project and I am having some trouble with extremely slow queries using MySQL Workbench.
These are my 3 tables (each of which comes from data sets from various websites that have been cleaned up and inserted into MySQL):
CREATE TABLE IF NOT EXISTS `starbucks` (
`STORE_NUMBER` varchar(20) NOT NULL,
`CITY` varchar(50) NOT NULL,
`STATE` char(2) NOT NULL,
`ZIPCODE` char(5) NOT NULL,
`LONG` varchar(10) NOT NULL,
`LAT` varchar(10) NOT NULL,
PRIMARY KEY (`STORE_NUMBER`)
)ENGINE=InnoDB")
CREATE TABLE IF NOT EXISTS `income`(
`STATEFIPS` char(2) NOT NULL,
`STATE` char(2) NOT NULL,
`ZIPCODE` char(5) NOT NULL,
`AGI_STUB` tinyint NOT NULL,
`NUM_RETURNS` float(15,4) NOT NULL,
`TOTAL_INCOME` float(15,4) NOT NULL,
PRIMARY KEY (`STATE`, `ZIPCODE`, `AGI_STUB`)
)ENGINE=InnoDB")
CREATE TABLE IF NOT EXISTS `diversity`(
`COUNTY` varchar(50) NOT NULL,
`STATE` char(2) NOT NULL,
`INDEX` float(7,6) NOT NULL,
`1` float(3,1) NOT NULL,
`2` float(3,1) NOT NULL,
`3` float(3,1) NOT NULL,
`4` float(3,1) NOT NULL,
`5` float(3,1) NOT NULL,
`6` float(3,1) NOT NULL,
`7` float(3,1) NOT NULL,
PRIMARY KEY (`COUNTY`, `STATE`)
)ENGINE=InnoDB")
starbucks has 13,608 records,
income has 166,740 records,
diversity has 3,143 records.
The query I am trying to run:
SELECT i.TOTAL_INCOME,
CASE
WHEN s.STORE_NUMBER IS NOT NULL THEN 1
ELSE 0
END AS has_starbucks
FROM income as i
LEFT OUTER JOIN starbucks as s
ON i.ZIPCODE = s.ZIPCODE
If I limit the result to 1,000 rows it will run fast, however I need to get ALL of the records (no row limit) and this is resulting in the query never returning, and eventually timing out and disconnecting me from the MySQL server.
In the past when working for companies on their databases with MILLIONS of records in them I have never had this much trouble.
What table optimizations do I need to do to fix this? What MySQL settings do I need to change? Any other suggestions welcome.
EDIT
It appears that the 'Duration' of the query never exceeds 0.500 seconds, it is the 'Fetch' part that lasts > 120 seconds. I am not sure if this is useful information.
The first issue is create a proper index on join columns
CREATE INDEX idx1 ON starbucks (ZIPCODE );
CREATE INDEX idx2 ON income (ZIPCODE );
or a verbous index adding the column you select
CREATE INDEX idx2 ON income (ZIPCODE , TOTAL_INCOME);
and check the behavior using explain plan
This won't do much to fix your performance issue but will fix your duplicate row issue -
SELECT i.TOTAL_INCOME, 1 AS has_starbucks
FROM income as i
WHERE i.Zipcode in (Select zipcode from Starbucks)
UNION
SELECT i.TOTAL_INCOME, 0 AS has_starbucks
FROM income as i
WHERE i.Zipcode not in (Select zipcode from Starbucks)
EXISTS is sometimes more efficient than IN
SELECT i.TOTAL_INCOME, 1 AS has_starbucks
FROM income as i
WHERE EXISTS
( SELECT 1
FROM Starbucks s
WHERE s.zipcode = i.Zipcode
)
UNION
SELECT i.TOTAL_INCOME, 0 AS has_starbucks
FROM income as i
WHERE NOT EXISTS
( SELECT 1
FROM Starbucks s
WHERE s.zipcode = i.Zipcode
)
Related
I'm trying to optimize my query speed as much as possible. A side problem is that I cannot see the exact query speed, because it is rounded to a whole second. The query does get the expected result and takes about 1 second. The final query should be extended even more and for this reason i am trying to improve it. How can this query be improved?
The database is constructed as an electricity utility company. The query should eventually calculate an invoice. I basically have 4 tables, APX price, powerdeals, powerload, eans_power.
APX price is an hourly price, powerload is a quarterly hour volume. First step is joining these two together for each quarter of an hour.
Second step is that I currently select the EAN that is indicated in the table eans_power.
Finally I will join the Powerdeals that currently consist only of a single line and indicates from which hour, until which hour and weekday from/until it should be applicable. It consist of an hourly volume and price. Currently it is only joined on the hours, but it will be extended to weekdays as well.
MYSQL Query:
SELECT l.DATE, l.PERIOD_FROM, a.PRICE, l.POWERLOAD,
SUM(a.PRICE*l.POWERLOAD), SUM(d.hourly_volume/4)
FROM timeseries.powerload l
INNER JOIN timeseries.apxprice a ON l.DATE = a.DATE
INNER JOIN contracts.eans_power c ON l.ean = c.ean
LEFT OUTER JOIN timeseries.powerdeals d ON d.period_from <= l.period_from
AND d.period_until >= l.period_until
WHERE l.PERIOD_FROM >= a.PERIOD_FROM
AND l.PERIOD_FROM < a.PERIOD_UNTIL
AND l.DATE >= '2018-01-01'
AND l.DATE <= '2018-12-31'
GROUP BY l.date
Explain:
1 SIMPLE c NULL system PRIMARY,ean NULL NULL NULL 1 100.00 Using temporary; Using filesort
1 SIMPLE l NULL ref EAN EAN 21 const 35481 11.11 Using index condition
1 SIMPLE d NULL ALL NULL NULL NULL NULL 1 100.00 Using where; Using join buffer (Block Nested Loop)
1 SIMPLE a NULL ref DATE DATE 4 timeseries.l.date 24 11.11 Using index condition
Create table queries:
apxprice
CREATE TABLE `apxprice` (
`apx_id` int(11) NOT NULL AUTO_INCREMENT,
`date` date DEFAULT NULL,
`period_from` time DEFAULT NULL,
`period_until` time DEFAULT NULL,
`price` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`apx_id`),
KEY `DATE` (`date`,`period_from`,`period_until`)
) ENGINE=MyISAM AUTO_INCREMENT=29664 DEFAULT CHARSET=latin1
powerdeals
CREATE TABLE `powerdeals` (
`deal_id` int(11) NOT NULL AUTO_INCREMENT,
`date_deal` date NOT NULL,
`start_date` date NOT NULL,
`end_date` date NOT NULL,
`weekday_from` int(11) NOT NULL,
`weekday_until` int(11) NOT NULL,
`period_from` time NOT NULL,
`period_until` time NOT NULL,
`hourly_volume` int(11) NOT NULL,
`price` int(11) NOT NULL,
`type_deal_id` int(11) NOT NULL,
`contract_id` int(11) NOT NULL,
PRIMARY KEY (`deal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
powerload
CREATE TABLE `powerload` (
`powerload_id` int(11) NOT NULL AUTO_INCREMENT,
`ean` varchar(18) DEFAULT NULL,
`date` date DEFAULT NULL,
`period_from` time DEFAULT NULL,
`period_until` time DEFAULT NULL,
`powerload` int(11) DEFAULT NULL,
PRIMARY KEY (`powerload_id`),
KEY `EAN` (`ean`,`date`,`period_from`,`period_until`)
) ENGINE=MyISAM AUTO_INCREMENT=61039 DEFAULT CHARSET=latin1
eans_power
CREATE TABLE `eans_power` (
`ean` char(19) NOT NULL,
`contract_id` int(11) NOT NULL,
`invoicing_id` int(11) NOT NULL,
`street` varchar(255) NOT NULL,
`number` int(11) NOT NULL,
`affix` char(11) NOT NULL,
`postal` char(6) NOT NULL,
`city` varchar(255) NOT NULL,
PRIMARY KEY (`ean`),
KEY `ean` (`ean`,`contract_id`,`invoicing_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Sample data tables
apx_prices
apx_id,date,period_from,period_until,price
1,2016-01-01,00:00:00,01:00:00,23.86
2,2016-01-01,01:00:00,02:00:00,22.39
powerdeals
deal_id,date_deal,start_date,end_date,weekday_from,weekday_until,period_from,period_until,hourly_volume,price,type_deal_id,contract_id
1,2019-05-15,2018-01-01,2018-12-31,1,5,08:00:00,20:00:00,1000,50,3,1
powerload
powerload_id,ean,date,period_from,period_until,powerload
1,871688520000xxxxxx,2018-01-01,00:00:00,00:15:00,9
2,871688520000xxxxxx,2018-01-01,00:15:00,00:30:00,11
eans_power
ean,contract_id,invoicing_id,street,number,affix,postal,city
871688520000xxxxxx,1,1,road,14,postal,city
Result, without sum() and group by:
DATE,PERIOD_FROM,PRICE,POWERLOAD,a.PRICE*l.POWERLOAD,d.hourly_volume/4,
2018-01-01,00:00:00,27.20,9,244.80,NULL
2018-01-01,00:15:00,27.20,11,299.20,NULL
Result, with sum() and group by:
DATE, PERIOD_FROM, PRICE, POWERLOAD, SUM(a.PRICE*l.POWERLOAD), SUM(d.hourly_volume/4)
2018-01-01,08:00:00,26.33,21,46193.84,12250.0000
2018-01-02, 08:00:00,47.95,43,90623.98,12250.0000
Preliminary optimizations:
Use InnoDB, not MyISAM.
Use CHAR only for constant-lenght strings
Use consistent datatypes (see ean, for example)
For an alternative to using time-to-the-second, check out the Handler counts .
Because range tests (such as l.PERIOD_FROM >= a.PERIOD_FROM AND l.PERIOD_FROM < a.PERIOD_UNTIL) are essentially impossible to optimize, I recommend you expand the table to have one entry per hour (or 1 per quarter hour, if necessary). Looking up a row via a key is much faster than doing a scan of "ALL" the table. 9K rows for an entire year is trivial.
When you get past these recommendations (and the Comments), I will have more tips on optimizing the indexes, especially InnoDB's PRIMARY KEY.
How can I proceed to make my response time more faster, approximately the average time of response is 0.2s ( 8039 records in my items table & 81 records in my tracking table )
Query
SELECT a.name, b.cnt FROM `items` a LEFT JOIN
(SELECT guid, COUNT(*) cnt FROM tracking WHERE
date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day ) GROUP BY guid) b ON
a.`id` = b.guid WHERE a.`type` = 'streaming' AND a.`state` = 1
ORDER BY b.cnt DESC LIMIT 15 OFFSET 75
Tracking table structure
CREATE TABLE `tracking` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`ip` int(11) NOT NULL,
`date` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `i1` (`ip`,`guid`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=4303 DEFAULT CHARSET=latin1;
Items table structure
CREATE TABLE `items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`type` varchar(255) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`embed` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`description` text,
`tags` varchar(255) DEFAULT NULL,
`date` int(11) DEFAULT NULL,
`vote_val_total` float DEFAULT '0',
`vote_total` float(11,0) DEFAULT '0',
`rate` float DEFAULT '0',
`icon` text CHARACTER SET ascii,
`state` int(11) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=9258 DEFAULT CHARSET=latin1;
Your query, as written, doesn't make much sense. It produces all possible combinations of rows in your two tables and then groups them.
You may want this:
SELECT a.*, b.cnt
FROM `items` a
LEFT JOIN (
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE `date` > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day)
GROUP BY guid
) b ON a.guid = b.guid
ORDER BY b.cnt DESC
The high-volume data in this query come from the relatively large tracking table. So, you should add a compound index to it, using the columns (date, guid). This will allow your query to random-access the index by date and then scan it for guid values.
ALTER TABLE tracking ADD INDEX guid_summary (`date`, guid);
I suppose you'll see a nice performance improvement.
Pro tip: Don't use SELECT *. Instead, give a list of the columns you want in your result set. For example,
SELECT a.guid, a.name, a.description, b.cnt
Why is this important?
First, it makes your software more resilient against somebody adding columns to your tables in the future.
Second, it tells the MySQL server to sling around only the information you want. That can improve performance really dramatically, especially when your tables get big.
Since tracking has significantly fewer rows than items, I will propose the following.
SELECT i.name, c.cnt
FROM
(
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day )
GROUP BY guid
) AS c
JOIN items AS i ON i.id = c.guid
WHERE i.type = 'streaming'
AND i.state = 1;
ORDER BY c.cnt DESC
LIMIT 15 OFFSET 75
It will fail to display any items for which cnt is 0. (Your version displays the items with NULL for the count.)
Composite indexes needed:
items: The PRIMARY KEY(id) is sufficient.
tracking: INDEX(date, guid) -- "covering"
Other issues:
If ip is an IP-address, it needs to be INT UNSIGNED. But that covers only IPv4, not IPv6.
It seems like date is not just a "date", but really a date+time. Please rename it to avoid confusion.
float(11,0) -- Don't use FLOAT for integers. Don't use (m,n) on FLOAT or DOUBLE. INT UNSIGNED makes more sense here.
OFFSET is naughty when it comes to performance -- it must scan over the skipped records. But, in your query, there is no way to avoid collecting all the possible rows, sorting them, stepping over 75, and only finally delivering 15 rows. (And, with no more than 81, it won't be a full 15.)
What version are you using? There have been important changes to the Optimization of LEFT JOIN ( SELECT ... ). Please provide EXPLAIN SELECT for each query under discussion.
I have a query that is executed in 35s, which is waaaaay too long.
Here are the 3 tables concerned by the query (each table is approx. 13000 lines long, and should be much longer in the future) :
Table 1 : Domains
CREATE TABLE IF NOT EXISTS `domain` (
`id_domain` int(11) NOT NULL AUTO_INCREMENT,
`domain_domain` varchar(255) NOT NULL,
`projet_domain` int(11) NOT NULL,
`date_crea_domain` int(11) NOT NULL,
`date_expi_domain` int(11) NOT NULL,
`active_domain` tinyint(1) NOT NULL,
`remarques_domain` text NOT NULL,
PRIMARY KEY (`id_domain`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table 2 : Keywords
CREATE TABLE IF NOT EXISTS `kw` (
`id_kw` int(11) NOT NULL AUTO_INCREMENT,
`kw_kw` varchar(255) NOT NULL,
`clics_kw` int(11) NOT NULL,
`cpc_kw` float(11,3) NOT NULL,
`date_kw` int(11) NOT NULL,
PRIMARY KEY (`id_kw`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table 3 : Linking between domain and keyword
CREATE TABLE IF NOT EXISTS `kw_domain` (
`id_kd` int(11) NOT NULL AUTO_INCREMENT,
`kw_kd` int(11) NOT NULL,
`domain_kd` int(11) NOT NULL,
`selected_kd` tinyint(1) NOT NULL,
PRIMARY KEY (`id_kd`),
KEY `kw_to_domain` (`kw_kd`,`domain_kd`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The query is as follows :
SELECT ng.*, kd.*, kg.*
FROM domain ng
LEFT JOIN kw_domain kd ON kd.domain_kd = ng.id_domain
LEFT JOIN kw kg ON kg.id_kw = kd.kw_kd
GROUP BY ng.id_domain
ORDER BY kd.selected_kd DESC, kd.id_kd DESC
Basically, it selects all domains, with, for each one of these domains, the last associated keyword.
Does anyone have an idea on how to optimize the tables or the query ?
The following will get the last keyword, according to your logic:
select ng.*,
(select kw_kd
from kw_domain kd
where kd.domain_kd = ng.id_domain and kd.selected_kd = 1
order by kd.id_kd desc
limit 1
) as kw_kd
from domain ng;
For performance, you want an index on kw_domain(domain_kd, selected_kd, kw_kd). In this case, the order of the fields matters.
You can use this as a subquery to get more information about the keyword:
select ng.*, kg.*
from (select ng.*,
(select kw_kd
from kw_domain kd
where kd.domain_kd = ng.id_domain and kd.selected_kd = 1
order by kd.id_kd desc
limit 1
) as kw_kd
from domain ng
) ng left join
kw kg
on kg.id_kw = ng.kw_kd;
In MySQL, group by can have poor performance, so this might work better, particularly with the right indexes.
I want to subtract between two rows of different table:
I have created a view called leave_taken and table called leave_balance.
I want this result from both table:
leave_taken.COUNT(*) - leave_balance.balance
and group by leave_type_id_leave_type
Code of both table
-----------------View Leave_Taken-----------
CREATE ALGORITHM = UNDEFINED DEFINER=`1`#`localhost` SQL SECURITY DEFINER
VIEW `leave_taken`
AS
select
`leave`.`staff_leave_application_staff_id_staff` AS `staff_leave_application_staff_id_staff`,
`leave`.`leave_type_id_leave_type` AS `leave_type_id_leave_type`,
count(0) AS `COUNT(*)`
from
(
`leave`
join `staff` on((`staff`.`id_staff` = `leave`.`staff_leave_application_staff_id_staff`))
)
where (`leave`.`active` = 1)
group by `leave`.`leave_type_id_leave_type`;
----------------Table leave_balance----------
CREATE TABLE IF NOT EXISTS `leave_balance` (
`id_leave_balance` int(11) NOT NULL AUTO_INCREMENT,
`staff_id_staff` int(11) NOT NULL,
`leave_type_id_leave_type` int(11) NOT NULL,
`balance` int(3) NOT NULL,
`date_added` date NOT NULL,
PRIMARY KEY (`id_leave_balance`),
UNIQUE KEY `id_leave_balance_UNIQUE` (`id_leave_balance`),
KEY `fk_leave_balance_staff1` (`staff_id_staff`),
KEY `fk_leave_balance_leave_type1` (`leave_type_id_leave_type`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
------- Table leave ----------
CREATE TABLE IF NOT EXISTS `leave` (
`id_leave` int(11) NOT NULL AUTO_INCREMENT,
`staff_leave_application_id_staff_leave_application` int(11) NOT NULL,
`staff_leave_application_staff_id_staff` int(11) NOT NULL,
`leave_type_id_leave_type` int(11) NOT NULL,
`date` date NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`date_updated` date NOT NULL,
PRIMARY KEY (`id_leave`,`staff_leave_application_id_staff_leave_application`,`staff_leave_application_staff_id_staff`),
KEY `fk_table1_leave_type1` (`leave_type_id_leave_type`),
KEY `fk_table1_staff_leave_application1` (`staff_leave_application_id_staff_leave_application`,`staff_leave_application_staff_id_staff`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=32 ;
Well, I still don't think you've provided enough information. It would be very helpful to have some sample data and your expected output (in tabular format). That said, I may have something you can start working with. This query finds all staff members, calculates their current leave (grouped by type), and determines the difference between that and their balance by leave type. Take a look at it, and more importantly (perhaps) the sqlfiddle here that I used which has the sample data in it (very important to determining if this is the correct path for your data).
SELECT
staff.id_staff,
staff.name,
COUNT(`leave`.id_leave) AS leave_count,
leave_balance.balance,
(COUNT(`leave`.id_leave) - leave_balance.balance) AS leave_difference,
`leave`.leave_type_id_leave_type AS leave_type
FROM
staff
JOIN `leave` ON staff.id_staff = `leave`.staff_leave_application_staff_id_staff
JOIN leave_balance ON
(
staff.id_staff = leave_balance.staff_id_staff
AND `leave`.leave_type_id_leave_type = leave_balance.leave_type_id_leave_type
)
WHERE
`leave`.active = 1
GROUP BY
staff.id_staff, leave_type;
Good luck!
I have a table with 40,000,000 rows, and I'm trying to optimize my query, because takes too long.
First, this is my table:
CREATE TABLE resume (
yearMonth char(6) DEFAULT NULL,
type char(1) DEFAULT NULL,
agen_o char(5) DEFAULT NULL,
tar char(2) DEFAULT NULL,
cve_ent char(1) DEFAULT NULL,
cve_mun char(3) DEFAULT NULL,
cve_reg int(1) DEFAULT NULL,
id_ope char(1) DEFAULT NULL,
ope_tip char(2) DEFAULT NULL,
ope_cve char(3) DEFAULT NULL,
cob_u int(9) DEFAULT NULL,
tot_imp bigint(15) DEFAULT NULL,
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is my query:
SELECT m.name_ope AS cve,
SUBSTRING(r.yearMonth,5,2) AS period,
COUNT(DISTINCT(CONCAT(r.agen_ope,r.cve_ope))) AS num,
SUM(CASE WHEN r.type='A' THEN r.cob_u ELSE 0 END) AS tot_u,
FROM resume r, media m
WHERE CONCAT(r.id_ope,SUBSTRING(r.ope_cve,3,1))=m.ope_cve AND
r.type IN ('C','D','E') AND
SUBSTRING(r.yearMonth,1,4)='2012' AND
r.id_ope='X' AND
SUBSTRING(r.ope_cve,1,2) IN (SELECT cve_med FROM catNac WHERE numero='0')
GROUP BY SUBSTRING(r.yearMonth,5,2),SUBSTRING(r.ope_cve,3,1)
ORDER BY SUBSTRING(r.yearMonth,5,2),SUBSTRING(r.ope_cve,3,1)
So, I added an index with these fields: id_ope, yearMonth, agen_o, because I have other's queries that have this fields in WHERE, with this order
Now my explain output:
1 PRIMARY r ref indice indice 2 const 14774607 Using where; Using filesort
So i added another index with yearMonth, ope_cve, but I still have "using filesort". How can I optimize this?
Thanks
Without modifying your table structure, if you have an index on yearMonth, you can try this:
SELECT m.name_ope AS cve,
SUBSTRING(r.yearMonth,5,2) AS period,
COUNT(DISTINCT(CONCAT(r.agen_ope,r.cve_ope))) AS num,
SUM(CASE WHEN r.type='A' THEN r.cob_u ELSE 0 END) AS tot_u,
FROM resume r, media m
WHERE CONCAT(r.id_ope,SUBSTRING(r.ope_cve,3,1))=m.ope_cve AND
r.type IN ('C','D','E') AND
r.yearMonth LIKE '2012%' AND
r.id_ope='X' AND
SUBSTRING(r.ope_cve,1,2) IN (SELECT cve_med FROM catNac WHERE numero='0')
GROUP BY r.yearMonth,SUBSTRING(r.ope_cve,3,1)
The changes:
Using r.yearMonth LIKE '2012%' should allow an index to be used for that part of your where clause.
Since you're already filtering out every year but 2012, you can group by GROUP BY r.yearMonth alone.
The ORDER BY clause is not needed since MySQL sorts on GROUP BY, unless you include ORDER BY NULL