Performance with MySql query to load recent chat messages - mysql

I'm having some issues with performance for a MySql query for a chat application I'm in the process of building.
I'm trying to grab the most recent messages from a conversation. I'm testing with a table with approx 3 million rows in it (an export from an older version of the application). When loading from some conversations, it's quick. When loading from others, the query takes significantly longer.
Here's details on the table setup, it's an InnoDB table:
Column Type Comment
id int(10) unsigned Auto Increment
from int(10) unsigned NULL
to int(10) unsigned NULL
date int(10) unsigned NULL
message text NULL
read tinyint(1) NULL [0]
And here are the indexes I have:
PRIMARY id
INDEX from
INDEX to
INDEX date
This is an example of the current query that I'm running:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 342)
OR
(`to` = 2 and `from` = 342)
ORDER BY `id` DESC
LIMIT 10
Now, when I run this query with this user combination (which only has a total of 325 rows in the database), it takes 1.5+ seconds.
However, if I use a different user combination which has a total of 12,000 rows in the database, like this:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 10153)
OR
(`to` = 2 and `from` = 10153)
ORDER BY `id` DESC
LIMIT 10
Then the query runs in approximately 35-40 ms. Quite a big difference, and the opposite of what I would expect.
I'm sure I'm missing something here and would appreciate any help pointing me in the right direction for optimizing all of this.

it's not about how much records the user has. you have created one table for all chats which is an issue when you try to fetch first 10 records of a user have inserted entries recently will be served fatser.

Well, Another thing you can try is rather than using OR, use UNION which will give a little advantage.
Try to use this:
SELECT *
FROM `chat`
WHERE
(`from` =2 and `to` = 342)
UNION
SELECT *
FROM `chat`
WHERE
(`to` = 2 and `from` = 342)
ORDER BY `id` DESC
LIMIT 10
Time taken by query in your case will also depend on how long ago any user messaged.
For that you should change your model and not have all messages in a single table.

Related

MySQL 8 - Slow select when order by combined with limit

I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts

Seek paginated query gets progressively slower on a big table

I have a biggish table of events. (5.3 million rows at the moment). I need to traverse this table mostly from the beginning to the end in a linear fashion. Mostly no random seeks. The data currently includes about 5 days of these events.
Due to the size of the table I need to paginate the results, and the internet tells me that "seek pagination" is the best method.
However this method works great and fast for traversing the first 3 days, after this mysql really begins to slow down. I've figured out it must be something io-bound as my cpu usage actually falls as the slowdown starts.
I do belive this has something to do with the 2-column sorting I do, and the usage of filesort, maybe Mysql needs to read all the rows to sort my results or something. Indexing correctly might be a proper fix, but I've yet been unable to find an index that solves my problem.
The compexifying part of this database is the fact that the ids and timestamps are NOT perfectly in order. The software requires the data to be ordered by timestamps. However when adding data to this database, some events are added 1 minute after they have actually happened, so the autoincremented ids are not in the chronological order.
As of now, the slowdown is so bad that my 5-day traversal never finishes. It just gets slower and slower...
I've tried indexing the table on multiple ways, but mysql does not seem to want to use those indexes and EXPLAIN keeps showing "filesort". Indexing is used on the where-statement though.
The workaround I'm currently using is to first do a full table traversal and load all the row ids and timestamps in memory. I sort the rows in the python side of the software and then load the full data in smaller chunks from mysql as I traverse (by ids only). This works fine, but is quite unefficient due to the total of 2 traversals of the same data.
The schema of the table:
CREATE TABLE `events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`server` varchar(45) DEFAULT NULL,
`software` varchar(45) DEFAULT NULL,
`timestamp` bigint(20) DEFAULT NULL,
`data` text,
`event_type` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index3` (`timestamp`,`server`,`software`,`id`),
KEY `index_ts` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=7410472 DEFAULT CHARSET=latin1;
The query (one possible line):
SELECT software,
server,
timestamp,
id,
event_type,
data
FROM events
WHERE ( server = 'a58b'
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
AND timestamp <= 200
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
The query is based on https://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-the-seek-method/ (and some other postings with the same idea). I belive it is called "seek pagination with seek predicate". The basic gist is that I have a starting timestamp and ending timestamp, and I need to get all the events with the software on the servers I've specifed OR only the server-specific events (software = NULL). The weirdish ( )-stuff is due tho python constructing the queries based on the parameters it is given. I left them visible if by a small chance they might have some effect.
I'm excepting the traversal to finish before the heat death of the universe.
First change
AND ( timestamp, id ) > ( 100, 100 )
to
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
This optimisation is suggested in the official documentation: Row Constructor Expression Optimization
Now the engine will be able to use the index on (timestamp). Depending on cardinality of the columns server and software, that could be already fast enough.
An index on (server, timestamp, id) should improve the performance farther.
If still not fast enough, i would suggest a UNION optimization for
AND (software IS NULL OR software IN ('ASD', 'WASD'))
That would be:
(
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software IS NULL
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'ASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
) UNION ALL (
SELECT software, server, timestamp, id, event_type, data
FROM events
WHERE server = 'a58b'
AND software = 'WASD'
AND (timestamp > 100 OR timestamp = 100 AND id > 100)
AND timestamp <= 200
ORDER BY timestamp ASC, id ASC
LIMIT 100
)
ORDER BY timestamp ASC, id ASC
LIMIT 100
You will need to create an index on (server, software, timestamp, id) for this query.
There are multiple complications going on.
The quick fix is
INDEX(software, timestamp, id) -- in this order
together with
WHERE server = 'a58b'
AND timestamp BETWEEN 100 AND 200
AND ( software IS NULL
OR software IN ( 'ASD', 'WASD' ) ) )
AND ( timestamp, id ) > ( 100, 100 )
ORDER BY timestamp ASC,
id ASC
LIMIT 100;
Note that server needs to be first in the index, not after the thing you are doing a range on (timestamp). Also, I broke out timestamp BETWEEN ... to make it clear to the optimizer that the next column of the ORDER BY might make use of the index.
You said "pagination", so I assume you have an OFFSET, too? Add it back in so we can discuss the implications. My blog on "remembering where you left off" instead of using OFFSET may (or may not) be practical.

Query taking long time to execute in AWS RDS

I am working on some temp tables for practice.
The one query is taking too much of time around 550 sec.Db is hosted in AWS RDS with 8cpu and 16GB ram.
Below query has to be run in different DB( prod ) , first checking in test testDB
create table test_01 as
select *
from
(
select
person
,age
,dob
,place
from
person
where
person is not null
and age is not null
and dob is not null
and place is not null
limit 1000
) ps_u
left join
employee em_u
on ps_u.age = em_u.em_age
and ps_u.place = em_u.location
order by person
limit 1000
Is there issue with query or with the resource,
CPU utilization shows 30% ram is ok not too much.
Let me know any suggestion to optimize the query.
check your left join. it can be a reason for it. left join will return everything from your left table, if this table has lot of entry, it will slow down your query.
With it, you can break your query in two separate query & check execution time using different tweaking.
Try to return specific rows rather than *.
In case you are limiting the result (with limit 1000) - do you really need order by person? If the result is huge - order by could adversely affect the performance.
You can reduce 1 select statement / also left join bring all records from left table could take time to process data.
CREATE TABLE test_01 AS
(SELECT person,
age,
dob,
place
FROM person ps_u
LEFT JOIN employee em_u ON ps_u.age = em_u.em_age
AND ps_u.place = em_u.location
ORDER BY ps_u.person
WHERE ps_u.person IS NOT NULL
AND ps_u.age IS NOT NULL
AND ps_u.dob IS NOT NULL
AND ps_u.place IS NOT NULL
LIMIT 1000)
I solved it by creating index for the column
alter table person
add fulltext index `fulltext`
(
, person asc
, age asc
, dob asc
, place asc
)
;
And then the query took only 3 seconds for 1000 records

MySQL Query suggestions - column vs joint

We have a user table like this - with over 20 million entries.
CREATE TABLE `users` (
`uid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(64) default '',
`email` varchar(64) default '',
`flag` int(10) unsigned DEFAULT '0',
PRIMARY KEY (`uid')
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
In our admin panel, we'd like to show a few pinned users and search results from the user table.
There are two approaches we thought to show pinned user (pls suggest if any other better approaches)
1) add a separate column in user table for pinned users. However, pinned users are a handful (less than 100) compared to the total number of users (> 20M). Hence, this approach doesn't appear promising.
2) a separate table of pinned users and use join,
CREATE TABLE `pinnedusers` (
`uid` int(10) unsigned NOT NULL default 0,
PRIMARY KEY (`uid`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and run a join, for example,
select *
from users
left join pinnedusers
on pinnedusers.uid=users.uid
order by pinnedusers.uid desc
limit 200;
However, we are worried about the performance of the second approach as it involves join, order, limit.
What do you suggest?
This should produce the results you want, and indicate which rows are "pinned" users.
SELECT
a.*,
IF(b.`id` IS NULL,0,1) as `is_pinned`
FROM users a
LEFT JOIN pinnedusers b
on b.uid = a.uid
ORDER BY IF(b.`id` IS NULL,0,1) DESC, a.uid DESC
LIMIT 200;
It's not the JOIN to fear, it's the need to scan the entire 20M rows.
To "show a few pinned users", use this to "show up to 200 pinned users":
SELECT u.*
FROM pinnedusers p
JOIN users u USING(uid)
ORDER BY ...
LIMIT 200;
If there are fewer than 200 pinned users, the list will have less than 200. But the query will be fast.
Your original query and Sloan's are both terribly slow because they will have to scan the entire 20M rows (assuming less than 200 are pinned).
If you want all the pinned users, plus enough non-pinned users to fill out a list of 200, that is a different task. But, which non-pinned users would you like? The first few? That would be quick. A random few? That is more complex, else it, too, would scan the entire 20M. Some other criteria?

Missing Mysql-Opimization

I have precomputet some similarities (about 70 million) and want to find the similarities from one track to all other tracks. I only need the top-100-tracks that have the highest similarities. For my calculations i do this query about 15'000 times with different tracks as input. After a boot of the machine one calculation needs over 600 seconds for all 15k queries. After several runs, mysql has - i think - cached the indices so the complete run needs about 15 seconds. My only worries are: i have a very hight "Handler_read_rnd_nextDokumentation" value.
I have a MySQL table with this structure:
CREATE TABLE `similarity` (
`similarityID` int(11) NOT NULL AUTO_INCREMENT,
`trackID1` int(11) NOT NULL,
`trackID2` int(11) NOT NULL,
`tracksim` double DEFAULT NULL,
`timesim` double DEFAULT NULL,
`tagsim` double DEFAULT NULL,
`simsum` double DEFAULT NULL,
PRIMARY KEY (`similarityID`),
UNIQUE KEY `trackID1` (`trackID1`,`trackID2`),
KEY `trackID1sum` (`trackID1`,`simsum`),
KEY `trackID2sum` (`trackID2`,`simsum`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
I want to do very much queries on this. The queries look like this:
// simsum is a sum over tracksim, timesim, tagsim
(
SELECT similarityID, trackID2, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID1 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
UNION
(
SELECT similarityID, trackID1, tracksim, timesim, tagsim, simsum
FROM similarity
WHERE trackID2 = 512
ORDER BY simsum DESC
LIMIT 0,100
)
ORDER BY simsum DESC
LIMIT 0,100
The query is quite fast and under 0.1 sec (previous question) but i'm worried about the very huge number in the status page. I thought i have set every index that i'm using in the query.
Handler_read_rndDokumentation 88,0 M
Handler_read_rnd_nextDokumentation 20,0 G
Is there anything "wrong"? Could i get the query even faster? Do i have to worry about the 20G ?
Thanks in advance
The first thing which is obviously wrong here is that you seem to be calculating a directional relationship between tuples - if f(a,b)===f(b,a) then you could simplify your system a lot by swapping around track1 and track2 where track1 is greater than track2 but retaining the existing primary key (and ignore collisions).
You're only halving the amount of data - so it won't be a huge performance increase.
There may be further scope for improving the performance but this is very much dependant on how frequently the data changes, more specifically, you should prune the records where similarity is not in the top 100.