I wanted to simulate large number of data in a database and test how my query would perform under such conditions. I was not surprised when query turned out to be slow. So here I am seeking advice on how I could better index my tables and improve my queries.
Before I post tables's sql and the query I use, Let me explain what is what. I have a user's table, which is populated by 100 000 records. Most of the columns in it are enum type, like hair color, looking_for, etc... The first query I have is generated when a search is done. The query would consist of a where statement where some or all column values are searched for and only ids are retrieved limited by 20.
Then I have 3 more tables that hold about 50 - 1000 records per each user, so numbers could really grow. these tables hold information on who visited who's profile, who marked who as a favorite, who blocked who, and messaging table. My goal is to retrieve 20 records that match the search criteria, but also determine if I (user who's browsing) have:
blocked them
favorited them
was favorited by them
have unread messages from them
have sent or received any messages from them
For this I tried using both joins and subqueries, but the problem is that second query that retrieves users and data listed above is still slow. I think I need a better index and possibly a better queries. here is what I have right now, tables definitions first and 2 queries in the end. First des sarch and determiens IDs, second uses ids from first query to retrieve data. I hope you guys can help me create better indexes and optimize the query.
CREATE TABLE user (id BIGINT AUTO_INCREMENT, dname VARCHAR(255) NOT NULL, email VARCHAR(255) NOT NULL UNIQUE, email_code VARCHAR(255), email_confirmed TINYINT(1) DEFAULT '0', password VARCHAR(255) NOT NULL, gender ENUM('male', 'female'), description TEXT, dob DATE, height MEDIUMINT, looks ENUM('thin', 'average', 'athletic', 'heavy'), looking_for ENUM('marriage', 'dating', 'friends'), looking_for_age1 BIGINT, looking_for_age2 BIGINT, color_hair ENUM('black', 'brown', 'blond', 'red'), color_eyes ENUM('black', 'brown', 'blue', 'green', 'grey'), marital_status ENUM('single', 'married', 'divorced', 'widowed'), smokes ENUM('no', 'yes', 'sometimes'), drinks ENUM('no', 'yes', 'sometimes'), has_children ENUM('no', 'yes'), wants_children ENUM('no', 'yes'), education ENUM('school', 'college', 'university', 'masters', 'phd'), occupation ENUM('no', 'yes'), country_id BIGINT, city_id BIGINT, lastlogin_at DATETIME, deleted_at DATETIME, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX country_id_idx (country_id), INDEX city_id_idx (city_id), INDEX image_id_idx (image_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE block (id BIGINT AUTO_INCREMENT, blocker_id BIGINT, blocked_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX blocker_id_idx (blocker_id), INDEX blocked_id_idx (blocked_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE city (id BIGINT AUTO_INCREMENT, name_eng VARCHAR(30), name_geo VARCHAR(30), name_geo_shi VARCHAR(30), name_geo_is VARCHAR(30), country_id BIGINT NOT NULL, active TINYINT(1) DEFAULT '0', INDEX country_id_idx (country_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE country (id BIGINT AUTO_INCREMENT, code VARCHAR(2), name_eng VARCHAR(30), name_geo VARCHAR(30), name_geo_shi VARCHAR(30), name_geo_is VARCHAR(30), active TINYINT(1) DEFAULT '1', PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE favorite (id BIGINT AUTO_INCREMENT, favoriter_id BIGINT, favorited_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX favoriter_id_idx (favoriter_id), INDEX favorited_id_idx (favorited_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE message (id BIGINT AUTO_INCREMENT, body TEXT, sender_id BIGINT, receiver_id BIGINT, read_at DATETIME, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX sender_id_idx (sender_id), INDEX receiver_id_idx (receiver_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE visitor (id BIGINT AUTO_INCREMENT, visitor_id BIGINT, visited_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX visitor_id_idx (visitor_id), INDEX visited_id_idx (visited_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
SELECT s.id AS s__id FROM user s WHERE (s.gender = 'female' AND s.marital_status = 'single' AND s.smokes = 'no' AND s.deleted_at IS NULL) LIMIT 20
SELECT s.id AS s__id, s.dname AS s__dname, s.gender AS s__gender, s.height AS s__height, s.dob AS s__dob, s3.id AS s3__id, s3.code AS s3__code, s3.name_geo AS s3__name_geo, s4.id AS s4__id, s4.name_geo AS s4__name_geo, s5.id AS s5__id, s6.id AS s6__id, s7.id AS s7__id, s8.id AS s8__id, s9.id AS s9__id FROM user s LEFT JOIN country s3 ON s.country_id = s3.id LEFT JOIN city s4 ON s.city_id = s4.id LEFT JOIN block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1'))) WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399')) GROUP BY s.id ORDER BY s.id
Here are the results of EXPLAIN of the 2 queries above:
First:
1 SIMPLE s ALL NULL NULL NULL NULL 100420 Using Where
Second:
1 SIMPLE s range PRIMARY PRIMARY 8 NULL 20 Using where; Using temporary; Using filesort
1 SIMPLE s2 eq_ref PRIMARY PRIMARY 8 sagule.s.image_id 1 Using index
1 SIMPLE s3 eq_ref PRIMARY PRIMARY 8 sagule.s.country_id 1
1 SIMPLE s4 eq_ref PRIMARY PRIMARY 8 sagule.s.city_id 1
1 SIMPLE s5 ref blocker_id_idx,blocked_id_idx blocked_id_idx 9 sagule.s.id 5
1 SIMPLE s6 ref favoriter_id_idx,favorited_id_idx favorited_id_idx 9 sagule.s.id 6
1 SIMPLE s7 ref favoriter_id_idx,favorited_id_idx favoriter_id_idx 9 sagule.s.id 6
1 SIMPLE s8 ref sender_id_idx,receiver_id_idx sender_id_idx 9 sagule.s.id 7
1 SIMPLE s9 index_merge sender_id_idx,receiver_id_idx receiver_id_idx,sender_id_idx 9,9 NULL 66 Using union(receiver_id_idx,sender_id_idx); Using where
I'm a MSSQL guy and havent used mysql but the concepts should be the same.
Firstly can you remove the group and order by and comment out all tables except for the first one "user". Also comment out any columns of the removed tables. As I have below.
SELECT s.id AS s__id,
s.dname AS s__dname,
s.gender AS s__gender,
s.height AS s__height,
s.dob AS s__dob
-- s3.id AS s3__id,
-- s3.code AS s3__code,
-- s3.name_geo AS s3__name_geo,
-- s4.id AS s4__id,
-- s4.name_geo AS s4__name_geo,
-- s5.id AS s5__id,
-- s6.id AS s6__id,
-- s7.id AS s7__id,
-- s8.id AS s8__id,
-- s9.id AS s9__id
FROM user s --LEFT JOIN
-- country s3 ON s.country_id = s3.id LEFT JOIN
-- city s4 ON s.city_id = s4.id LEFT JOIN
-- block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN
-- favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN
-- favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN
-- message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN
-- message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1')))
WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399'))
Run the query and record the time. Then add one table and its columns back in at a time and run it until you find which one causes it to slow significantly.
SELECT s.id AS s__id,
s.dname AS s__dname,
s.gender AS s__gender,
s.height AS s__height,
s.dob AS s__dob,
s3.id AS s3__id,
s3.code AS s3__code,
s3.name_geo AS s3__name_geo
-- s4.id AS s4__id,
-- s4.name_geo AS s4__name_geo,
-- s5.id AS s5__id,
-- s6.id AS s6__id,
-- s7.id AS s7__id,
-- s8.id AS s8__id,
-- s9.id AS s9__id
FROM user s LEFT JOIN
country s3 ON s.country_id = s3.id --LEFT JOIN
-- city s4 ON s.city_id = s4.id LEFT JOIN
-- block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN
-- favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN
-- favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN
-- message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN
-- message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1')))
WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399'))
My guess is that it would be the block and both favorites and message joins that is giving you the performance hit (the one with the most rows will be the biggest hit).
For the block table, Can you remove one of the indexes and change the other to be something along the lines of (I am not sure of the syntax but you'll get the point)
INDEX blocker_id_idx (blocker_id,blocked_id),
and try it with the columns order swapped around to find witch order is best for your query
INDEX blocker_id_idx (blocked_id,blocker_id),
For the favorite table, change the indexes to
INDEX favoriter_id_idx (favoriter_id,favorited_id),
INDEX favorited_id_idx (favorited_id,favoriter_id),
Again try it with the columns swapped around to find which give better performance.
Do the same for the message indexes.
Do that and let me know if things improved. There are a few other things that can be done to improve it further. - EDIT: It seams I lied about the few other things, what I had intended would not have made any difference. But I can speed up your first query which is below.
EDIT This is for your first select query.
This one is a bit long, but I wanted to show you how indexes work so you can make your own.
Lets say the table contains 100,000 rows.
When you select from it, this is the general process it will take.
Are there any indexes that cover or
mostly cover the columns that I need.
(I your case, no there isn't.)
So use Primary Index and scan though
every row in the table to check for a
match.
Every row in the table will need to
be read from disk to find which
columns match you criteria. So to
return the approx 10,000 rows (this
is a guess) that match you data the
database engine has read all 100,000
rows.
You do have a top 20 in you query, so it will limit the amount of rows the engine will read from disk.
Example
read row 1: is match so add to result
read row 2: no match - skip
read row 3: no match - skip
read row 4: is match so add to
result.
stop after 20 rows identified
You potentially read about 5000 rows from disk to return 20.
We need to create an index that will help us read as few records as possible from the table/disk, but still get the rows we are after. So here goes.
Your query uses 4 filters to get to the data.
s.gender = 'female' AND
s.marital_status = 'single' AND
s.smokes = 'no' AND
s.deleted_at IS NULL
What we need to do now is identify which filter by itself will return the least amount of rows. I cant tell as I don't have any data, but this is what I would guess to be in your table.
The gender column support 2 values and it would be fair to estimate that half of the records in your database are male and the other female, so that filter you need will return approx 50,000 rows.
Now for marital status, supports four values, so if we say the data has an equal spread, it would mean we would get roughly 25,000 rows back. Of course, it depends on th actual data and I would say, that there are not too many widowed in the data, so a better estimate may be 30% share between the other three. So lets say 30,000 records marked as single.
Now for the smokes column. I have read that here in Australia about 10% of people smoke which is a fairly low number compared to other countries. So lets say 25% either smoke or smoke sometimes. That leaves us with approx 75,000 non smokers.
Now for the last column, deleted. A fair guess on my part but lets say 5% are marked as deleted. That leaves us with approx 95,000 rows.
So in summary (remember, this is all pure guess work on my part, your data may be different)
Gender 50,000 rows or 50%
Marital status 30,000 rows or 30%
Smokes 75,000 rows or 75%
Deleted 95,000 rows or 95%
So if we create an index with the four columns using the one that returns the least amount of rows first, we would get the following
INDEX index01_idx (marital_status,gender,smokes,deleted_at),
Now this is what will happen when we run the select.
The server will find an index that
covers all the columns in the WHERE
clause
It will narrow down the result set to
30,000 "single" records.
Of those 30,000, 50% will be female
that leaves 15,000 records
Of those 15,000, 75% will not smoke
that leaves 11,250 records
Of those 11,250, 95% will not be
deleted,
That leaves us with just over 10,000 records out of 100,000 total that we have identified as the records we want but not yet read from disk. You also have a limit 20 in the query so the database engine just needs to read the first 20 of the 10,000 and return the result. Its super quick, the hard disk will love you and the scary DBA will even mumble and grunt with approval.
In your second SELECT query, you can remove the GROUP BY clause because you aren't using any Aggregate functions (count, min, max...) in your SELECT clause.
I doubt this will help much improving performance, though.
In any case, I recommend to watch the first half of this talk "A Look into a MySQL DBA's Toolchest".
(The first two thirds of the video are about free open-source admin-tools for mysql on Unix, the last third or so is about replication)
Video A Look into a MySQL DBA's Toolchest
From the same talk: The Guide To Understanding mysqlreport
Without some Data to Test, it is not so easy to make a good advice.
Generating an Index for fields that are searched frequently, can help make your query faster. But with an Index your Inserts and Updates can get slower. You have to think about the tradeoff. So index the Columns that get searched freqeuently, but test the new Index on the Data so you can see if it runs faster.
I don't know which Tools you are using, but with the MySQL Workbench there is a Command "Explain Current Statement" under the "Query"-Menu. There you can see which actions were done by MySQL and which keys were used. Your Query shows "null" which means no key was used and MySQL had to run through the whole data comparing with the search term.
Hope this helps a bit.
Related
I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
I'm migrating a database a new one to change One-to-Many relationships to Many-to-Many (and to improve the column naming scheme). [Edit: I've created a SQLFiddle for this.]
oldDB newDB
========================= =======================================
individuals people
- individual_id - id
- individual_name_first - first_name
- individual_name_last - last_name
- individual_name_other - additional_identifier
- individual_position - role
- individual_group_code - (replaced with people-groups table)
(There are duplicate rows
in this table for individ's
who are in more than one
group.)
groups groups
- (no id in oldDB) - id
- group_code - short_name
- group_name - full_name
people_groups
- id
- person_id
- group_id
- start_date
- end_date
Specifically, I'm having trouble with creating the linking table between people and groups.
I've already created the people and groups tables:
CREATE TABLE IF NOT EXISTS people (
id int(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
first_name varchar(50) NOT NULL,
last_name varchar(50) NOT NULL,
additional_identifier varchar(50) DEFAULT NULL COMMENT 'In case of duplicate first and last names',
role varchar(50) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE newDB.people ADD UNIQUE `name` (last_name, first_name, additional_identifier);
INSERT INTO newDB.people
(id, first_name, last_name, role)
SELECT
individual_id, individual_name_last, individual_name_first, individual_position, COUNT(*)
FROM
oldDB.individuals
GROUP BY
individual_name_last, individual_name_first;
CREATE TABLE IF NOT EXISTS newDB.groups(
id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
full_name VARCHAR(255) NOT NULL UNIQUE,
short_name VARCHAR(255) NOT NULL UNIQUE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO newDB.groups
(full_name, short_name)
SELECT
group_name, group_code
FROM
oldDB.groups;
Next I will CREATE the newDB.people-groups table, but first I'm making sure I can SELECT the right values:
SELECT
newDB.groups.id 'group id',
newDB.people.id 'person id',
individual_group_code 'group short name',
individual_name_last 'last name',
individual_name_first 'first name'
FROM
oldDB.individuals
LEFT JOIN
newDB.groups ON(
newDB.groups.short_name = oldDB.individuals.individual_group_code
)
LEFT JOIN newDB.people ON (
newDB.people.last_name = oldDB.individuals.individual_name_last
AND
newDB.people.first_name = oldDB.individuals.individual_name_first
)
GROUP BY
individual_name_last ASC,
individual_name_first ASC,
individual_group_code
The first LEFT JOIN is just for displaying the group short names for easy verification. The second LEFT JOIN is important: it is supposed to allow for the 'person id' output to be pulled from newDB.people.id. Instead, I'm just getting NULL in that column for all rows of output. Everything else is displaying correctly. What am I missing?
Edit:
Desired output
Here is what I am hoping to get. (I generated it by replacing newDB.people.id 'person id' with oldDB.individuals.individual_id 'person id'. To exemplify the problem with it is that person 925 and 1232 are the same person in two different groups. The new database simplifies that to having person 925.)
Actual output
Here is what I am getting:
Edit 2:
Here is a SQLFiddle that does work. Why doesn't it work in my phpmyadmin?
You are performing a group by (3 columns) with 5 columns of non-aggregated columns in the select list. Also, not that it matters, there are no aggregates in the column output.
MySQL treats that as a distinct (for the 3 columns) and brings back the first row it encounters in the MRU cache, and if no cache, the first ones encountered in the clustered index or physical ordering to satisfy the 2 non-grouped-by columns.
In other words, it is a user error. A snafu. I recommend cleaning up you intention with the GROUP BY.
Somewhat related, read a recent answer of mine Here related to ONLY_FULL_GROUP_BY. See at the bottom of that link is MySQL Handling of GROUP BY which in my view is a gloss over of the real problems and non-standards that MySQL allowed which rendered unexpected and hard to explain data from violations of that Standard.
So what did the MySQL dev team do? They implemented the standard by default (starting in version 5.7) to disallow the types of queries you just performed.
Edit1
Your query, with no GROUP BY but with an order by newGroups.id,people.id, on a version 5.7.14 server:
I was wondering if anyone had some advice for me regarding a histogram-generating query. I have a query that I like (in that it works), but it is extremely slow. Here is the background:
I have a table of metadata, a table of data values where one row in meta_data is a key-row for many (perhaps several thousand) rows in data_values, and a table of histogram bin information:
create table meta_data (
id int not null primary key,
name varchar(100),
other_data char(10)
);
create table data_values (
id int not null primary key,
meta_data_id int not null,
data_value real
);
create table histogram_bins (
id int not null primary key,
bin_min real,
bin_max real,
bin_center real,
bin_size real
);
And a query that creates the histogram:
SELECT md.name AS `Name`,
md.other_data AS `OtherData`,
hist.bin_center AS `Bin`,
SUM(data.data_value BETWEEN hist.bin_min AND hist.bin_max) AS `Frequency`
FROM histogram_bins hist
LEFT JOIN data_values data ON 1 = 1
LEFT JOIN meta_data md ON md.id = data.meta_data_id
GROUP BY md.id, `Bin`;
In an earlier version of this query, the BETWEEN ... AND logical statement was down in the JOIN (replacing 1 = 1), but then I would only receive histogram rows with non-zero frequency. I need rows for all of the bins (even the zero-frequency ones), for analysis purposes.
Its pretty darn slow, to the tune of 10-15 minutes or so. The data_values table has about 7.9 million rows, and meta_data weighs in at 15,900 rows -- so maybe it is just going to take a long time!
Thanks very much!
I think this might help
SELECT h.bin_center AS `Bin`,
ISNULL(F.Frequency,0) AS `Frequency`
FROM histogram_bins h
LEFT JOIN
(SELECT hist.bin_center AS `Bin`,
COUNT(data_values) AS `Frequency`
FROM data_values data
LEFT JOIN histogram_bins hist ON data.data_value BETWEEN hist.bin_min AND hist.bin_max
GROUP BY md.name, md.other_data, hist.bin_center) F ON F.bin_center = h.bin_center
I changed the order of the tables because I think it's best to find the corresponding bin for every record in the data and then just count how many there are grouped by bin
Currently I am having an issue with slow queries to my DB - query time varies from 0.0005 seconds to 70 seconds.
Currently my table structure with content is following:
CREATE TABLE IF NOT EXISTS `content` (
`content_id` int(11) NOT NULL AUTO_INCREMENT,
`content_url` text NOT NULL,
`content_text` text NOT NULL,
`seed_id` int(11) NOT NULL,
`created_at` bigint(20) NOT NULL,
`image` varchar(2000) DEFAULT NULL,
`price` varchar(300) DEFAULT NULL,
PRIMARY KEY (`content_id`),
UNIQUE KEY `CONTENT_TEXT_UNIQUE` (`content_text`(255)),
KEY `FK_SEED_CODE` (`seed_id`),
KEY `CONTENT_TEXT_TIME_INDEX` (`content_text`(255),`created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=111357870 ;
ALTER TABLE `content`
ADD CONSTRAINT `FK_SEED_ID` FOREIGN KEY (`seed_id`) REFERENCES `seed` (`seed_id`) ON DELETE CASCADE ON UPDATE CASCADE;
Currently I have only 2 queries to Database:
SELECT seed.seed_code,content.content_id as id, content.content_url, content.content_text, content.created_at, content.image, content.price FROM content
LEFT JOIN seed ON content.seed_id = seed.seed_id
WHERE seed.seed_switch = 1 AND seed.seed_status_id = 3 AND seed.seed_id in (
SELECT seed_id FROM seed WHERE storage_id ='.$storage.') '.$filter.' ORDER BY content.content_id DESC, content.created_at DESC LIMIT 50
And
SELECT seed.seed_code,content.content_id as id, content.content_url, content.content_text, content.created_at, content.image, content.price FROM content
LEFT JOIN seed ON content.seed_id = seed.seed_id
WHERE seed.seed_switch = 1 AND seed.seed_status_id = 3 AND seed.seed_id in (
SELECT seed_id FROM seed WHERE storage_id ='.$storage.') ORDER BY content.content_id DESC, content.created_at DESC LIMIT 50
Table seed contains ± 20 entries. Which doesn't change mostly.
Indexes created on content table seems not working, because still I am having very big load time.
What could be the improvements of DB?
UPDATE 1
The content tables contains around 1mil entries and it grows every day with 1-2k entries.
$filter variable contains additional filters. So some other AND statements, which are generated randomly depending of user input. But it filters only content.text and created_at date.
EDIT
Ok, noticed the autoincrement in your create table. You have or have had millions of records (since increment is over 100 million) and are running a where-in subselect, not going to get ideal performance taking that approach. Try below query and see if that improves load times.
You haven't supplied all the details (for example, how many records the tables in question have and what the output of '.$filter.' is), but more than likely the subselect is the cause of the slow load time. Also, save yourself some typing and alias the tables! Cleaned up example:
SELECT s.seed_code, c.content_id as id, c.content_url, c.content_text, c.created_at, c.image, c.price
FROM content c
JOIN seed s USING(seed_id)
WHERE s.seed_switch = 1
AND s.seed_status_id = 3
AND s.storage_id ='.$storage.'
'.$filter.'
ORDER BY c.content_id DESC
LIMIT 50
Let's say I need to query the associates of a corporation. I have a table, "transactions", which contains data on every transaction made.
CREATE TABLE `transactions` (
`transactionID` int(11) unsigned NOT NULL,
`orderID` int(11) unsigned NOT NULL,
`customerID` int(11) unsigned NOT NULL,
`employeeID` int(11) unsigned NOT NULL,
`corporationID` int(11) unsigned NOT NULL,
PRIMARY KEY (`transactionID`),
KEY `orderID` (`orderID`),
KEY `customerID` (`customerID`),
KEY `employeeID` (`employeeID`),
KEY `corporationID` (`corporationID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It's fairly straightforward to query this table for associates, but there's a twist: A transaction record is registered once per employee, and so there may be multiple records for one corporation per order.
For example, if employees A and B from corporation 1 were both involved in selling a vacuum cleaner to corporation 2, there would be two records in the "transactions" table; one for each employee, and both for corporation 1. This must not affect the results, though. A trade from corporation 1, regardless of how many of its employees were involved, must be treated as one.
Easy, I thought. I'll just make a join on a derived table, like so:
SELECT corporationID FROM transactions JOIN (SELECT DISTINCT orderID FROM transactions WHERE corporationID = 1) AS foo USING (orderID)
The query returns a list of corporations who have been involved in trades with corporation 1. That's exactly what I need, but it's very slow because MySQL can't use the corporationID index to determine the derived table. I understand that this is the case for all subqueries/derived tables in MySQL.
I've also tried to query a collection of orderIDs separately and use a ridiculously large IN() clause (typhically 100 000+ IDs), but as it turns out MySQL has issues using indices on ridiculously large IN() clauses as well and as a result the query time does not improve.
Are there any other options available, or have I exhausted them both?
If I understand your requirement, you could try this.
select distinct t1.corporationID
from transactions t1
where exists (
select 1
from transactions t2
where t2.corporationID = 1
and t2.orderID = t1.orderID)
and t1.corporationID != 1;
or this:
select distinct t1.corporationID
from transactions t1
join transactions t2
on t2.orderID = t1.orderID
and t1.transactionID != t2.transactionID
where t2.corporationID = 1
and t1.corporationID != 1;
Your data makes no sense to me, I think you are using corporationID where you mean customer ID at some point in there, as your query joins the transaction table to the transaction table for corporationID=1 based on orderID to get the corporationIDs...which would then be 1, right?
Can you please specify what the customerID, employeeID, and corporationIDs mean? How do I know employees A and B are from corporation 1 - in that case, is corporation 1 the corporationID, and corporation 2 is the customer, and so stored in the customerID?
If that is the case, you just need to do a group by:
SELECT customerID
FROM transactions
WHERE corporationID = 1
GROUP BY customerID
(Or select and group by orderID if you want one row per order instead of one row per customer.)
By using the group by, you ignore the fact that there are multiple records that are duplicate except for the employeeID.
Conversely, to returns all corporations that have sold to corporation 2.
SELECT corporationID
FROM transactions
WHERE customerID = 2
GROUP BY corporationID