Optimize table to reduce index size - mysql

I have this schema which saves chat messages. Currently I have about 100k rows which is about 5.5MB of data. Index size is 6.5MB. When data size was ~4MB index size was ~3MB so it's growing exponentially?
CREATE TABLE `messages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`author` int(11) unsigned DEFAULT NULL,
`time` int(10) unsigned DEFAULT NULL,
`text` text,
`dest` int(11) unsigned DEFAULT NULL,
`type` tinyint(4) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `history` (`author`,`dest`,`id`) USING BTREE,
KEY `messages_ibfk_1` (`dest`),
FULLTEXT KEY `msg` (`text`),
CONSTRAINT `au` FOREIGN KEY (`author`) REFERENCES `users` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `messages_ibfk_1` FOREIGN KEY (`dest`) REFERENCES `users` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=105895 DEFAULT CHARSET=utf8;
The main query that I'm running against this table and that I've tried to optimize it for is when I need to show paginated history for a chat between 2 people
SELECT id, time, text, dest, type, author
FROM `messages`
WHERE (
(author = ? AND dest = ?) OR (author = ? AND dest = ?)
) AND id <= ? ORDER BY id DESC LIMIT ?, 25
The other queries for history are identical except they have additional filters for a search term or date range.
Is there anything that can be done to reduce index size and maintain optimal performance?

Don't worry about the growth of the indexes. It is probably a fluke; certainly not "exponential".
Assuming the main issue is performance of
SELECT id, time, text, dest, type, author
FROM `messages`
WHERE (
(author = ? AND dest = ?) OR (author = ? AND dest = ?)
) AND id <= ? ORDER BY id DESC LIMIT ?, 25
I see three techniques that will help significantly: Change OR to UNION, deal with LIMIT in UNION, and don't use OFFSET for pagination.
( SELECT id, time, text, dest, type, author
FROM `messages`
WHERE author = ? -- one author & dest
AND dest = ?
AND id < ? -- where you "left off"
ORDER BY id DESC
LIMIT 25
) UNION ALL
( SELECT id, time, text, dest, type, author
FROM `messages`
WHERE author = ? -- the other author & dest
AND dest = ?
AND id < ? -- same as above
ORDER BY id DESC
LIMIT 25
)
ORDER BY id DESC
LIMIT 25; -- get the desired 25 from the 50 above
Pagination discussion explains why the OFFSET should be removed. It discusses other techniques, including using 26 (in all three places) instead of 25 so that you know if this is the 'last' page.
On the first iteration, AND id < ? could be left off. Or (simpler), you could substitute a very large number.
Your index (author,dest,id) is optimal for my formuation.
This complex formulation will shine as messages gets bigger and/or the user pages farther through the list.

Related

Sum of averages raw query

I have the following code that I have to optimize:
These are the models:
class Question(models.Model):
question_id = models.CharField(max_length=20)
label = models.CharField(max_length=255, verbose_name='Question')
class Property(models.Model):
name = models.CharField(max_length=200)
class Response(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
submit_date = models.DateTimeField()
score = models.IntegerField(null=True, blank=True)
is_null = models.BooleanField(default=False)
ignore = models.BooleanField(default=False)
property = models.ForeignKey(Property, on_delete=models.CASCADE)
class Plan(models.Model):
name = models.CharField(max_length=100)
questions = models.ManyToManyField(Question, through='PlanQuestion')
start_date = models.DateField(null=True)
completion_date = models.DateField(null=True)
class PlanQuestion(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
plan = models.ForeignKey(Plan, on_delete=models.CASCADE)
I first iterate over the plans then plan questions like this:
plans = Plan.objects.filter(
start_date__isnull=False, completion_date__isnull=False
)
for plan in plans:
plan_questions = plan.questions.through.objects.filter(plan=plan)
for plan_question in plan_questions:
# run the below query for each plan_question here
In the above code for each plan question this query is run to calculate the average of score:
SELECT AVG(score) AS average_score
FROM Response WHERE question_id=%(question_id)s
AND DATE(submit_date) >= %(stard_date)s AND DATE(submit_date) <= %(end_date)s
The problem is that:
If let us say Plan1 has 5 questions:
P1 => Avg(Q1) + Avg(Q2) + Avg(Q3) + Avg(Q4) + Avg(Q5)
The query is run for each question which calculates the average score for each response (one question can have many responses) so for P1, 5 queries are run, and let us say it takes 0.5 seconds to execute one query then it would take 2.5 seconds (5 * 0.5) to run 5 queries for one plan. Now If we increase the number of Plans each having 5 questions then it would increase the time exponentially.
I want a way to reduce the number of these queries so that for each question I don't have to run queries separately. How to combine all the queries of question in one query ?. Maybe I can use union but I don't get how would I write a single query using that or maybe there might be a better solution than a union.
I also tried to add prefech_related but that did no improvement.
Edit:
Create Tables:
CREATE TABLE `Response` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question_id` int(11) NOT NULL,
`score` int(11) DEFAULT NULL
PRIMARY KEY (`id`),
KEY `Response_25110688` (`question_id`),
CONSTRAINT `question_id_refs_id_2dd82bdb` FOREIGN KEY (`question_id`) REFERENCES `Question` (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=157533450 DEFAULT CHARSET=latin1
CREATE TABLE `Question` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question_id` varchar(20) NOT NULL,
`label` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=353 DEFAULT CHARSET=latin1
CREATE TABLE `Plan` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_date` date DEFAULT NULL,
`completion_date` date DEFAULT NULL
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=687 DEFAULT CHARSET=latin1
CREATE TABLE `PlanQuestion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`plan_id` int(11) NOT NULL,
`question_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `PlanQuestion_plan_id_de8df699_fk_Plan_id` (`plan_id`),
KEY `PlanQuestion_question_id_49c10d5b_fk_Question_id` (`question_id`),
CONSTRAINT `PlanQuestion_plan_id_de8df699_fk_Plan_id` FOREIGN KEY (`plan_id`) REFERENCES `Plan` (`id`),
CONSTRAINT `PlanQuestion_question_id_49c10d5b_fk_Question_id` FOREIGN KEY (`question_id`) REFERENCES `Question` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2130 DEFAULT CHARSET=latin1
CREATE TABLE `Property` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=188651 DEFAULT CHARSET=latin1
Here is the full query:
SELECT id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response WHERE question_id=%(question_id)s
AND property_id=%(property_id)s and is_null=0
AND Response.ignore=0 AND DATE(submit_date) >= %(stard_date)s
AND DATE(submit_date) <= %(end_date)s
This does not make a lot of sense:
SELECT id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response
WHERE question_id=%(question_id)s
AND DATE(submit_date) >= %(stard_date)s
AND DATE(submit_date) <= %(end_date)s
Without a GROUP BY, the COUNT and AVG will be totals for the one "question_id". But then if there is a different id for each row, which id are you hoping for?
OK, assuming id is removed, it needs this composite index with the columns in this order:
INDEX(question_id, submit_date)
Meanwhile, remove INDEX(question_id) because it will be in the way.
Sorry, but sometimes performance requires changes.
Secondly... "for plan_question in plan_questions" implies that you want that to be run for every "question"?
Then get rid of the loop and do all the work at the same time:
SELECT question_id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response
WHERE DATE(submit_date) >= %(start_date)s
AND DATE(submit_date) <= %(end_date)s
GROUP BY question_id
This will return one row per question; then you can loop through the resultset to deliver the output.
Good news: Even if you don't add the above index, this will work better than what you have now.
Also... cur_date = datetime.now().date() could be removed from the app code; instead, use simply CURDATE() in SQL to get just the date or NOW() to get the date+time.
Indexing Getting rid of "for plan_question in plan_questions" will be the biggest benefit. The query (as I wrote it) already benefits from the index on question_id. However, adding INDEX(submit_date) might run faster if the date range is narrow.
If there are other clauses in the WHERE, we need to see them. There may be other indexes to suggest.
More
SELECT id, COUNT(*) AS count
FROM response
-- (and not JOINing to any other tables)
GROUP BY id;
This query always has a count of 1 because each id occurs in response exactly once.
SELECT
-- (without id)
COUNT(*) AS count
FROM response
-- (and not JOINing to any other tables)
-- (without GROUP BY)
;
This query always returns exactly 1 row.
Still More
Based on
WHERE question_id=%(question_id)s
AND property_id=%(property_id)s and is_null=0
AND Response.ignore=0 AND DATE(submit_date)...
you need
INDEX(question_id, property_id, is_null, ignore)
and drop INDEX(question_id).
But... My statement about doing a single query instead of an app loop still stands.
JOINing to Plan
SELECT r.question_id,
COUNT(*) AS count,
AVG(r.int_val) AS int_average,
p.plan -- perhaps you want to say which "plan" is involved?
FROM Plans AS p
JOIN PLanQuestions AS pq ON pq.plan_id = p.plan_id
JOIN Responses AS r ON r.question_id = pq.question_id
WHERE p.... -- optionally filter on which plans to include
AND pq.... -- optionally filter on the other columns in pq
AND r.... -- optionally filter on which responses to include
ORDER BY ... -- optionally sort the results by any column(s) in any table(s)
And remove the two single-column indexes in PlanQuestions, replacing them by two 2-column indexes:
INDEX(plan_id, question_id),
INDEX(question_id, plan_id)
AND DATE(submit_date) <= %(end_date)s
GROUP BY question_id
Sargable
DATE(submit_date) >= "..." is "not sargable" This means that an index involving col cannot help with the test. Since submit_date is of datatype DATE, this is semantically identical and faster:
submit_date >= "..."

Any way to do this query faster with big data

This query takes around 2.23seconds and feels a bit slow ... is there anyway to make it faster.
our member.id, member_id, membership_id, valid_to, valid_from has index as well.
select *
from member
where (member.id in ( select member_id from member_membership mm
INNER JOIN membership m ON mm.membership_id = m.id
where instr(organization_chain, 2513) and m.valid_to > NOW() and m.valid_from < NOW() ) )
order by id desc
limit 10 offset 0
EXPLAIN FOR WHAT QUERY DOING: every member has many a member_memberships and and member_memberships connect with another table called membership there we have the membership details. so query will get all members that has valid memberships and where the organization id 2513 exist on member_membership.
Tables as following:
CREATE TABLE `member` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`first_name` varchar(255) DEFAULT NULL,
`last_name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE `member_membership` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`membership_id` int(11) DEFAULT NULL,
`member_id` int(11) DEFAULT NULL,
`organization_chain` text DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `member_membership_to_membership` (`membership_id`),
KEY `member_membership_to_member` (`member_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE `membership` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`valid_to` datetime DEFAULT NULL,
`valid_from` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `valid_to` (`valid_to`),
KEY `valid_from` (`valid_from`),
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
ALTER TABLE `member_membership` ADD CONSTRAINT `member_membership_to_membership` FOREIGN KEY (`membership_id`) REFERENCES `membership` (`id`);
ALTER TABLE `member_membership` ADD CONSTRAINT `member_membership_to_member` FOREIGN KEY (`member_id`) REFERENCES `member` (`id`);
Here with EXPLAIN statement => https://i.ibb.co/xjrcYWR/EXPLAIN.png
Relations
member has many member_membership
membership has manymember_membership
So member_membership is like join for tables member and membership.
Well I found a way to make it less to 800ms ... like this. Is this good way or maybe there is more we can do?
select *
from member
where (member.id in ( select member_id from member_membership mm FORCE INDEX (PRIMARY)
INNER JOIN membership m ON mm.membership_id = m.id
where instr(organization_chain, 2513) and m.valid_to > NOW() and m.valid_from < NOW() ) )
order by id desc
limit 10 offset 0
NEW UPDATE.. and I think this solve the issue.. 15ms :)
I added FORCE INDEX..
The FORCE INDEX hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the named indexes to find rows in the table.
select *
from member
where (member.id in ( select member_id from member_membership mm FORCE INDEX (member_membership_to_member)
INNER JOIN membership m FORCE INDEX (organization_to_membership) ON mm.membership_id = m.id
where instr(organization_chain, 2513) and m.valid_to > NOW() and m.valid_from < NOW() ) )
order by id desc
limit 10 offset 0
How big is organization_chain? If you don't need TEXT, use a reasonably sized VARCHAR so that it could be in an index. Better yet, is there some way to get 2513 in a column by itself?
Don't use id int(11) NOT NULL AUTO_INCREMENT, in a many-to-many table; rather have the two columns in PRIMARY KEY.
Put the ORDER BY and LIMIT in the subquery.
Don't use IN ( SELECT ...), use a JOIN.

MySQL Index with ordering

I have a table with 5 million rows. I didn't add my indexes here:
CREATE TABLE `my_table` (
`Id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`Title` CHAR(200) NULL DEFAULT NULL,
`ProjectId` INT(10) UNSIGNED NOT NULL,
`RoleId` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`Id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
When I run below SQL, it takes more than 1 minute.
SELECT *
FROM `my_table` t
WHERE
t.ProjectId IN (123, 456, 789) AND
t.RoleId IN (111, 222, 333)
ORDER BY Title DESC
LIMIT 25
Question is, how properly add indexes for the table. Can you give any solutions?
Explain for index "ProjectId" and "RoleId" is:
key = IndxProjectIdRoleId
ref = NULL,
rows: 32,463
Extra: Using where; Using filesort
Thanks for any suggestion.
You can try indexes on (ProjectId, RoleId, Title) and (RoleId, ProjectId, Title). They may not help much. The problem is that you have two inequalities in the where.
One of these is likely to be better than the current execution plan. However, it might not help so much.
MySQL actually has good documentation on multi-column indexes. You might want to review it.
A more complicated version of the query might work better:
(SELECT *
FROM `my_table` t
WHERE t.ProjectId = 123 AND t.RoleId = 111
ORDER BY Title DESC
LIMIT 25
) UNION ALL
(SELECT *
FROM `my_table` t
WHERE t.ProjectId = 123 AND t.RoleId = 456
ORDER BY Title DESC
LIMIT 25
)
UNION ALL
. . . -- The other 7 combinations
ORDER BY Title DESC
LIMIT 25;
This much longer version of the query can take advantage of either of the above indexes so each should be quite fast. In the end, the query has to sort up to 9 * 25 (225) records, and that should be pretty fast, even without an index.
I suggest a composite index
INDEX my_index_name (ProjectId,RoleId )
in your case ..
CREATE TABLE `my_table` (
`Id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`Title` CHAR(200) NULL DEFAULT NULL,
`ProjectId` INT(10) UNSIGNED NOT NULL,
`RoleId` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`Id`),
INDEX my_index_name (ProjectId,RoleId)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;
eventually check if is more selective the inverse
INDEX my_index_name (RoleId, ProjectId)
And do the fact your table has only few column you can also try a complete indexed table
INDEX my_index_name (ProjectId,RoleId, Tile, id)
and select this way
SELECT Id, Title, ProjectId, RoleId
FROM `my_table` t
WHERE
t.ProjectId IN (123, 456, 789) AND
t.RoleId IN (111, 222, 333)
ORDER BY Title DESC
LIMIT 25;

Eliminating values from one table with another. Super slow

In the same datbase I have a table messages whos columns: id, title, text I want. I want only the records of which title has no entries in the table lastlogon who's title equivalent is then named username.
I have been using this SQL command in PHP, it generally took 2-3 seconds to pull up:
SELECT DISTINCT * FROM messages WHERE title NOT IN (SELECT username FROM lastlogon) LIMIT 1000
This was all good until the table lastlogon started to have about 80% of the values table messages. Messages has about 8000 entries, lastlogon about 7000. Now it takes about a minute to 2 minutes for it to go through. MySQL shoots up to very high CPU usage.
I tried the following but had no luck reducing the time:
SELECT id,title,text FROM messages a LEFT OUTER JOIN lastlogon b ON (a.title = b.username) LIMIT 1000
Why all of a sudden is it taking so long for such low amount of entries? I tried restarting mysql and apache multiple times. I am using debian linux.
Edit: Here are the structures
--
-- Table structure for table `lastlogon`
--
CREATE TABLE IF NOT EXISTS `lastlogon` (
`username` varchar(25) NOT NULL,
`lastlogon` date NOT NULL,
`datechecked` date NOT NULL,
PRIMARY KEY (`username`),
KEY `username` (`username`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
-- --------------------------------------------------------
--
-- Table structure for table `messages`
--
CREATE TABLE IF NOT EXISTS `messages` (
`id` smallint(9) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`email` varchar(50) NOT NULL,
`text` mediumtext,
`folder` tinyint(2) NOT NULL,
`read` smallint(5) unsigned NOT NULL,
`dateline` int(10) unsigned NOT NULL,
`ip` varchar(15) NOT NULL,
`attachment` varchar(255) NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`username` varchar(300) NOT NULL,
`error` varchar(500) NOT NULL,
PRIMARY KEY (`id`),
KEY `title` (`title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=9010 ;
Edit 2
Edited structure with new indexes.
After putting an index on both messages.title and lastlogon.username I came up with these results:
Showing rows 0 - 29 (623 total, Query took 74.4938 sec)
First: replace the key on title, with a compound key on title + id
ALTER TABLE messages DROP INDEX title;
ALTER TABLE messages ADD INDEX title (title, id);
Now change the select to:
SELECT m.* FROM messages m
LEFT JOIN lastlogon l ON (l.username = m.title)
WHERE l.username IS NULL
-- GROUP BY m.id DESC -- faster replacement for distinct. I don't think you need this.
LIMIT 1000;
Or
SELECT m.* FROM messages m
WHERE m.title NOT IN (SELECT l.username FROM lastlogon l)
-- GROUP BY m.id DESC -- faster than distinct, I don't think you need it though.
LIMIT 1000;
Another problem with the slowness is the SELECT m.* part.
By selecting all column, you are forcing MySQL to do extra work.
Only select the columns you need:
SELECT m.title, m.name, m.email, ......
This will speed up the query as well.
There's another trick you can use:
Replace the limit 1000 with a cutoff date.
Step 1: Add an index on timestamp (or whatever field you want to use for the cutoff).
SELECT m.* FROM messages m
LEFT JOIN lastlogon l ON (l.username = m.title)
WHERE (m.id > (SELECT MIN(M2.ID) FROM messages m2 WHERE m2.timestamp >= '2011-09-01'))
AND l.username IS NULL
-- GROUP BY m.id DESC -- faster replacement for distinct. I don't think you need this.
I suggest you to add an index on messages.title . Then try to run again the query and test the performance.

Not Equals In Where Clause Uses Filesort, But Equals Doesn't. Why?

I'm revamping my site's inner-mail system, and I came across something I don't understand. Here are the tables:
CREATE TABLE `mails` (
`id` bigint(12) unsigned not null auto_increment,
`recipient_id` mediumint(8) unsigned not null,
`date_sent` datetime not null,
`status` enum('unread', 'read', 'deleted') default 'unread',
PRIMARY KEY(`id`),
INDEX(`recipient_id`, `status`, `date_sent`),
CONSTRAINT FOREIGN KEY (`recipient_id`) REFERENCES `members` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;
CREATE TABLE `mail_contents` (
`mail_id` bigint(12) unsigned not null,
`sender_id` mediumint(8) unsigned not null,
`subject` varchar(150) default '',
`content` text not null,
CONSTRAINT FOREIGN KEY (`sender_id`) REFERENCES `members` (`id`) ON DELETE CASCADE,
CONSTRAINT FOREIGN KEY (`mail_id`) REFERENCES `mails` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;
And here's the query:
SELECT *
FROM mails AS m
LEFT JOIN mail_contents AS mc ON mc.mail_id = m.id
WHERE recipient_id = 66
AND status != 'deleted'
ORDER BY date_sent DESC
LIMIT 40\G
An EXPLAIN on the query shows "Using where; Using index; Using filesort". However if I change the query to this:
SELECT *
FROM mails AS m
LEFT JOIN mail_contents AS mc ON mc.mail_id = m.id
WHERE recipient_id = 66
AND status = 'sent'
ORDER BY date_sent DESC
LIMIT 40\G
The EXPLAIN shows "Using where; Using index". For some reason using != in the first query causes a filesort, but using = in the second query doesn't use a filesort. I'm curious about what's going on under the hood that would cause the difference?
Equals is inclusive and != is exclusive. It's more efficient for MySQL to find inclusive results.
The "Using filesort" is actually negative in this case, because it means that the query requires use of a temporary table to sort (the table being the file) and then return the result..
The INDEX(recipient_id, status, date_sent) is used in the second query for sorting, because the first 2 columns are fixed, and the order by is using the 3rd one date_sent => no filesort
It can not be used in the first query because status is not constant (!= 'deleted').
More informations here:ORDER BY Optimization