Difficult MySQL Update Query with Self-Join - mysql

Our website has listings. We use a connections table with the following structure to connect our members with these various listings:
CREATE TABLE `connections` (
`cid1` int(9) unsigned NOT NULL DEFAULT '0',
`cid2` int(9) unsigned NOT NULL DEFAULT '0',
`type` char(2) NOT NULL,
`created` datetime DEFAULT NULL,
`updated` datetime DEFAULT NULL,
PRIMARY KEY (`cid1`,`cid2`,`type`,`cid3`),
KEY `cid1` (`cid1`,`type`),
KEY `cid2` (`cid2`,`type`)
);
The problem we've run into is when we have to combine duplicate listings from time to time we also need to update our member connections and have been using the following query which breaks if a member is connected to both listings:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001;
What I can't figure out is how to do the following which would solve this issue:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (
select cid1 from connections
where type IN ('MC','MT','MW') AND cid2=100000
);
When I try to run that query I get the following error:
ERROR 1093 (HY000): You can't specify target table 'connections' for update in FROM clause
Here is some sample data. Notice the update conflict for cid1 = 10025925
+----------+--------+------+---------------------+---------------------+
| cid1 | cid2 | type | created | updated |
+----------+--------+------+---------------------+---------------------+
| 10010388 | 100000 | MC | 2010-08-05 18:04:51 | 2011-06-16 16:26:17 |
| 10025925 | 100000 | MC | 2010-10-31 09:21:25 | 2010-10-31 16:21:25 |
| 10027662 | 100000 | MC | 2011-06-13 16:31:12 | NULL |
| 10038375 | 100000 | MW | 2011-02-05 05:32:35 | 2011-02-05 19:51:58 |
| 10065771 | 100000 | MW | 2011-04-24 17:06:35 | NULL |
| 10025925 | 100001 | MC | 2010-10-31 09:21:45 | 2010-10-31 16:21:45 |
| 10034884 | 100001 | MC | 2011-01-20 18:54:51 | NULL |
| 10038375 | 100001 | MC | 2011-02-04 05:00:35 | NULL |
| 10041989 | 100001 | MC | 2011-02-26 09:33:18 | NULL |
| 10038259 | 100001 | MC | 2011-05-07 13:34:20 | NULL |
| 10027662 | 100001 | MC | 2011-06-13 16:33:54 | NULL |
| 10030855 | 100001 | MT | 2010-12-31 20:40:18 | NULL |
| 10038375 | 100001 | MT | 2011-02-04 05:00:36 | NULL |
+----------+--------+------+---------------------+---------------------+
I'm hoping that someone can suggest the right way to run the above query. Thanks in advance!

The reason for the error in your query is because in MySQL you cannot SELECT from the table you are trying to UPDATE in the same query.
Use UPDATE IGNORE to avoid duplicate conflicts.
I think you should try reading INSERT ON DUPLICATE KEY. The idea is that you frame an INSERT query that always create a DUPLICATE conflict and then the UPDATE part will do its part.

One possible way would be to use a temporary table for your subquery then select from the temp table. Efficiency could break down fast if you need to execute a lot of these queires though.
create temporary table subq as select cid1 from connections where type IN ('MC','MT','MW') AND cid2=100000
And
update connections set cid2=100000 where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (select cid1 from subq);

UPDATE connections cn1
LEFT JOIN connections cn2 ON cn1.cid1 != cn2.cid1
AND cn2.type IN ('MC','MT','MW')
AND cn2.cid2=100000
SET cn1.cid2=100000
WHERE cn1.TYPE IN ('MC','MT','MW')
AND cn1.cid2=100001
AND cn2.cid1 IS NULL -- i.e. there is no matching record

I was thinking something like the following, but I'm not 100% sure how your data looks before and after to determine accuracy. The idea is to join the table to itself on your subquery's where clause and the exclusion where cid1 must not match.
update connections c1 left outer join connections c2
on (c2.cid2 = 100000 and c2.type in ('MC','MT','MW') and c1.cid1 != c2.cid1)
set c1.cid2 = 100000
where c1.type in ('MC', 'MT', 'MW') and c1.cid2=100001 and c2.cid1 is null;
As near as I can tell, it'll work. I used your create table (minus the cid3 in the primary key) and made sure I had 2 rows with the same cid1 and different cid2 (one 100000 and the other as 100001) and that the statement only affected 1 row.

Related

Create a view and don't let the database update mysql?

create table Branch
(
BranchNo char(4),
Street varchar(30),
City varchar(30),
PostCode varchar(10)
)
INSERT INTO BRANCH
VALUES ('B002', '55 cOVER', 'LONDON',NULL)
INSERT INTO BRANCH
VALUES ('B003', '163 Main Street', 'Glasgow',NULL)
INSERT INTO BRANCH
VALUES ('B004', '32 Manse Road', 'Bristol',NULL)
INSERT INTO BRANCH
VALUES ('B005', '22 Dear Road', 'LONDON',NULL)
INSERT INTO BRANCH
VALUES ('B007', '16 Argyll', 'Abend',NULL)
Create a view named ViewDeC that displays information of all branches. Must say
make sure it is not possible to update the data for the branch table (Branch) through this View
Create a view and don't let the database update mysql?
enter image description here
If I am not mistaken, this is about how to create a readonly view. Though MySQL does not support creating a view with readonly attribute DIRECTLY, certain things can be done to make the view READONLY. One workaround is to make the view through joined tables.
create view ViewDeC as
select BranchNo,Street,City,PostCode
from Branch
join (select 1) t;
select * from ViewDec;
INSERT INTO ViewDec
VALUES ('B009', '99 Argyll', 'bender',NULL);
-- Error Code: 1471. The target table ViewDec of the INSERT is not insertable-into
Note, this is implemented at the cost of some performance, but not terribly unbearable. I have a table with 1.4 million rows. Here is the test with and without join using a table scan as the access method.
select * from proctable;
-- 1429158 rows in set (1.26 sec)
select * from proctable join (select 1) t;
-- 1429158 rows in set (1.40 sec)
However, for an index lookup access method, this is almost non-existent.
select * from proctable join (select 1) t where id between 100 and 500;
-- 401 rows in set (0.00 sec)
explain select * from proctable join (select 1) t where id between 100 and 500;
+----+-------------+------------+------------+--------+---------------+---------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------+---------+---------+------+------+----------+----------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | proctable | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 401 | 100.00 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+------------+--------+---------------+---------+---------+------+------+----------+----------------+

MySQL 8 is not using INDEX when subquery has a group column

We have just moved from mariadb 5.5 to MySQL 8 and some of the update queries have suddenly become slow. On more investigation, we found that MySQL 8 does not use index when the subquery has group column.
For example, below is a sample database. Table users maintain the current balance of the users per type and table 'accounts' maintain the total balance history per day.
CREATE DATABASE 'test';
CREATE TABLE `users` (
`uid` int(10) unsigned NOT NULL DEFAULT '0',
`balance` int(10) unsigned NOT NULL DEFAULT '0',
`type` int(10) unsigned NOT NULL DEFAULT '0',
KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `accounts` (
`uid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`balance` int(10) unsigned NOT NULL DEFAULT '0',
`day` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`uid`),
KEY `day` (`day`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Below is a explanation for the query to update accounts
mysql> explain update accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users) r
on r.uid=a.uid and r.day=a.day set a.balance=r.balance;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| 1 | UPDATE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table |
| 2 | DERIVED | users | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
2 rows in set, 1 warning (0.00 sec)
As you can see, mysql is not using index.
On more investigation, I found that if I remove sum() from the subquery, it starts using index. However, that's not the case with mariadb 5.5 which was correctly using the index in all the case.
Below are two select queries with and without sum(). I've used select query to cross check with mariadb 5.5 since 5.5 does not have explanation for update queries.
mysql> explain select * from accounts a inner join (
select uid, balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
| 1 | SIMPLE | a | NULL | ref | PRIMARY,day | day | 4 | const | 1 | 100.00 | NULL |
| 1 | SIMPLE | users | NULL | eq_ref | PRIMARY | PRIMARY | 4 | test.a.uid | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
2 rows in set, 1 warning (0.00 sec)
and with sum()
mysql> explain select * from accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table |
| 2 | DERIVED | users | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
2 rows in set, 1 warning (0.00 sec)
Below is output from mariadb 5.5
MariaDB [test]> explain select * from accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
| 1 | PRIMARY | a | ALL | PRIMARY,day | NULL | NULL | NULL | 1 | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | 10 | test.a.uid,test.a.day | 2 | Using where |
| 2 | DERIVED | users | ALL | NULL | NULL | NULL | NULL | 1 | |
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
3 rows in set (0.00 sec)
Any idea what are we doing wrong?
As others have commented, break your update query apart...
update accounts join
then your query
on condition of the join.
Your inner select query of
select uid, sum(balance) balance, day(current_date()) day from users
is the only thing that is running, getting some ID and the sum of all balances and whatever the current day. You never know which user is getting updated, let alone the correct amount. Start by getting your query to see your expected results per user ID. Although the context does not make sense that your users table has a "uid", but no primary key thus IMPLYING there is multiple records for the same "uid". The accounts (to me) implies ex: I am a bank representative and sign up multiple user accounts. Thus my active portfolio of client balances on a given day is the sum from users table.
Having said that, lets look at getting that answer
select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid
This will show you per user what their total balance is as of right now. The group by now gives you the "ID" key to tie back to the accounts table. In MySQL, the syntax of a correlated update for this scenario would be... (I am using above query and giving alias "PQ" for PreQuery for the join)
update accounts a
JOIN
( select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid ) PQ
-- NOW, the JOIN ON clause ties the Accounts ID to the SUM TOTALS per UID balance
on a.uid = PQ.uid
-- NOW you can SET the values
set Balance = PQ.allUserBalance,
Day = day( current_date())
Now, the above will not give a proper answer if you have accounts that no longer have user entries associated... such as all users get out. So, whatever accounts have no users, their balance and day record will be as of some prior day. To fix this, you could to a LEFT-JOIN such as
update accounts a
LEFT JOIN
( select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid ) PQ
-- NOW, the JOIN ON clause ties the Accounts ID to the SUM TOTALS per UID balance
on a.uid = PQ.uid
-- NOW you can SET the values
set Balance = coalesce( PQ.allUserBalance, 0 ),
Day = day( current_date())
With the left-join and COALESCE(), if there is no record summation in the user table, it will set the account balance to zero.

MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry)

I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.
It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.

Subquery for faster result

I have this query which takes me more than 117 seconds on a mysql database.
select users.*, users_oauth.* FROM users LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id WHERE (
(MATCH (user_email) AGAINST ('sometext')) OR
(MATCH (user_firstname) AGAINST ('sometext')) OR
(MATCH (user_lastname) AGAINST ('sometext')) )
ORDER BY user_date_accountcreated DESC LIMIT 1400, 50
How can I use a subquery in order to optimize it ?
The 3 fields are fulltext :
ALTER TABLE `users` ADD FULLTEXT KEY `email_fulltext` (`user_email`);
ALTER TABLE `users` ADD FULLTEXT KEY `firstname_fulltext` (`user_firstname`);
ALTER TABLE `users` ADD FULLTEXT KEY `lastname_fulltext` (`user_lastname`);
There is only one search input in a website to search in different table users fields.
If the limit is for example LIMIT 0,50, the query will run in less than 3 seconds but when the LIMIT increase the query becomes very slow.
Thanks.
Use a single FULLTEXT index:
FULLTEXT(user_email, user_firstname, user_lastname)
And change the 3 matches to just one:
MATCH (user_email, user_firstname, user_lastname) AGAINST ('sometext')
Here's another issue: ORDER BY ... DESC LIMIT 1400, 50. Read about the evils of pagination via OFFSET . That has a workaround, but I doubt if it would apply to your statement.
Do you really have thousands of users matching the text? Does someone (other than a search engine robot) really page through 29 pages? Think about whether it makes sense to really have such a long-winded UI.
And a 3rd issue. Consider "lazy eval". That is, find the user ids first, then join back to users and users_oauth to get the rest of the columns. It would be a single SELECT with the MATCH in a derived table, then JOIN to the two tables. If the ORDER BY an LIMIT can be in the derived table, it could be a big win.
Please indicate which table each column belongs to -- my last paragraph is imprecise because of not knowing about the date column.
Update
In your second attempt, you added OR, which greatly slows things down. Let's turn that into a UNION to try to avoid the new slowdown. First let's debug the UNION:
( SELECT * -- no mention of oauth columns
FROM users -- No JOIN
WHERE users.user_id LIKE ...
ORDER BY user_id DESC
LIMIT 0, 50
)
UNION ALL
( SELECT * -- no mention of oauth columns
FROM users
WHERE MATCH ...
ORDER BY user_id DESC
LIMIT 0, 50
)
Test it by timing each SELECT separately. If one of the is still slow, then let's focus on it. Then test the UNION. (This is a case where using the mysql commandline tool may be more convenient than PHP.)
By splitting, each SELECT can use an optimal index. The UNION has some overhead, but possibly less than the inefficiency of OR.
Now let's fold in users_oauth.
First, you seem to be missing a very important INDEX(oauth_user_id). Add that!
Now let's put them together.
SELECT u.*
FROM ( .... the entire union query ... ) AS u
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
ORDER BY user_id DESC -- yes, repeat
LIMIT 0, 50 -- yes, repeat
Yes #Rick
I changed the index fulltext to:
ALTER TABLE `users`
ADD FULLTEXT KEY `fulltext_adminsearch` (`user_email`,`user_firstname`,`user_lastname`);
And now there is some php conditions, $_POST['search'] can be empty:
if(!isset($_POST['search'])) {
$searchId = '%' ;
} else {
$searchId = $_POST['search'] ;
}
$searchMatch = '+'.str_replace(' ', ' +', $_POST['search']);
$sqlSearch = $dataBase->prepare(
'SELECT users.*, users_oauth.*
FROM users
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
WHERE ( users.user_id LIKE :id OR
(MATCH (user_email, user_firstname, user_lastname)
AGAINST (:match IN BOOLEAN MODE)) )
ORDER BY user_id DESC LIMIT 0,50') ;
$sqlSearch->execute(array('id' => $searchId,
'match' => $searchMatch )) ;
The users_oauth table has a column with user_id:
Table users:
+--------------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------+------+-----+---------+----------------+
| user_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| user_activation_key | varchar(40) | YES | | NULL | |
| user_email | varchar(40) | NO | UNI | | |
| user_login | varchar(30) | YES | | NULL | |
| user_password | varchar(40) | YES | | NULL | |
| user_firstname | varchar(30) | YES | | NULL | |
| user_lastname | varchar(50) | YES | | NULL | |
| user_lang | varchar(2) | NO | | en
+--------------------------+-----------------+------+-----+---------+----------------+
Table users_oauth:
+----------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+-----------------+------+-----+---------+----------------+
| oauth_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| oauth_user_id | int(8) unsigned | NO | | NULL | |
| oauth_google_id | varchar(30) | YES | UNI | NULL | |
| oauth_facebook_id | varchar(30) | YES | UNI | NULL | |
| oauth_windowslive_id | varchar(30) | YES | UNI | NULL | |
+----------------------+-----------------+------+-----+---------+----------------+
The Left Join is long, the request takes 3 seconds with, 0,0158 seconds wihtout.
It would be more rapid to make a sql request for each 50 rows.
Would it be more rapid with a subquery ? How to make it with a subquery ?
Thanks

Query taking very long (Explain included)

Goal of query:
Display race by district.
Query:
SELECT school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race,
ROUND(
SUM( school_data_race_ethnicity_raw_outer.count) /
(SELECT SUM(count)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner
INNER JOIN school_data_schools as school_data_schools_inner
USING (school_id)
WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id
AND school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year) * 100, 2)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer
INNER JOIN school_data_schools as school_data_schools_outer USING (school_id)
GROUP BY school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race
mysql> explain SELECT school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race,ROUND(SUM(school_data_race_ethnicity_raw_outer.count)/( SELECT SUM(count) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner INNER JOIN school_data_schools as school_data_schools_inner USING (school_id) WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id and school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year ) * 100,2) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer INNER JOIN school_data_schools as school_data_schools_outer USING (school_id) GROUP BY school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race;
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| 1 | PRIMARY | school_data_race_ethnicity_raw_outer | ALL | school_id,school_id_2 | NULL | NULL | NULL | 84012 | Using temporary; Using filesort |
| 1 | PRIMARY | school_data_schools_outer | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_outer.school_id | 1 | |
| 2 | DEPENDENT SUBQUERY | school_data_race_ethnicity_raw_inner | ref | school_id,year,school_id_2 | year | 4 | func | 8402 | |
| 2 | DEPENDENT SUBQUERY | school_data_schools_inner | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_inner.school_id | 1 | Using where |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
4 rows in set (0.00 sec)
mysql>
mysql> describe school_data_race_ethnicity_raw;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| school_id | varchar(255) | NO | MUL | NULL | |
| year | int(11) | NO | MUL | NULL | |
| race | varchar(255) | NO | | NULL | |
| count | int(11) | NO | | NULL | |
+-----------+--------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
mysql> describe school_data_schools;
+-------------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------------+------+-----+---------+-------+
| school_id | varchar(255) | NO | PRI | NULL | |
| grade_level | varchar(255) | NO | | NULL | |
| district_id | varchar(255) | NO | | NULL | |
| school_name | varchar(255) | NO | | NULL | |
| address | varchar(255) | NO | | NULL | |
| city | varchar(255) | NO | | NULL | |
| lat | decimal(20,10) | NO | | NULL | |
| lon | decimal(20,10) | NO | | NULL | |
+-------------+----------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
NOTE: I also have tried:
select sds.school_id,
detail.year,
detail.race,
ROUND((detail.count / summary.total) * 100 ,2) as percent
FROM school_data_race_ethnicity_raw as detail
inner join school_data_schools as sds USING (school_id)
inner join (
select sds2.district_id, year, sum(count) as total
from school_data_race_ethnicity_raw
inner join school_data_schools as sds2 USING (school_id)
group by sds2.district_id, year
) as summary on summary.district_id = sds.district_id
and summary.year = detail.year
This is slow beacuse:
You have no index in use on school_data_race_ethnicity_raw_outer, so it's scanning each of the ~84,000 rows
You are using a correlated subquery which means that your complex calculation has to be run once per row i.e. 84,000 times.
The best approach is not to use a correlated subquery, but if not, then to make it go fast, you need to use covering indexes so that the whole of that inner query (and the other parts via their own indexes) can be run lightning fast using just the index. For a great tutorial on the subject of indexes, check this out. It taught me a lot! Right now, your inner query just uses the year index on school_data_race_ethnicity_raw, so it has to look up the rest of the stuff it needs by reading 8000 rows for every one of the 84000 calculations. Indexes will make this far faster e.g. create a composite index on school_data_race_ethnicity_raw and you will find it helps:
CREATE index inner_composite ON school_data_race_ethnicity_raw (year, district_id, schoolid, count)
This will allow all the fields used in the WHERE to be gotten from the index, then the join field, then the field you want for the select. You should see it show up in the 'key' column of your explain result. Also, if you get it right, you'll see 'using index' in the right-most column, showing that no table access is happening, which is orders of magnitude faster.
You can experiment quick-and-dirty style by adding loads of indexes for the columns that the query mentions and see what gets picked up in the key column. If something appears, read your query to see what other columns from that table are in use, then add a new index with those columns added in too on the right hand side and see if that works better. Remember to delete the unused indexes once you find out what works.
MySQL doesn't allow you to directly index the SUM of a column, which would be the fastest way, so unless you want to move to another DB (good idea if you can), this will always be a little slow.
This should be all you need to aggregate your data to get a count of race by district, not sure why you are doing so much math in your original, as it is unnecessary to achieve your goal, and is forcing some crazy sub queries.
SELECT SUM(students.count) as studentCount, School.district_id, students.race
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE shools.school_id = students.school_id
GROUP BY district_id, race
You probably also want an index on school_data_race_ethnicity_raw.school_id (alone, not as part of a multiple column key)
EDIT was not aware OP was looking for a percentage breakdown, and not just totals
SELECT ((studentCount / districtTotal) * 100) as percentage, district_id, race
FROM(
SELECT SUM(students.count) as studentCount, Schools.district_id, students.race,
(SELECT SUM(inStudents.count)
FROM school_data_schools inSchools,
school_data_race_ethnicity_raw inStudents
WHERE inSchools.school_id = inStudents.school_id
AND inSchools.district_ID = Schools.district_id
GROUP BY inSchools.district_id) as districtTotal
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE schools.school_id = students.school_id
GROUP BY district_id, race
) table1
This will run pretty quick, still need to make sure there is an index on school_data_race_ethnicity_raw.school_id that is not part of a multiple column index. you can see it in action here, though my test case is rather small, it does seem to check out.