I would appreciate if someone could explain how is it possible MySQL is not churning with a large table on default config.
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
I have the following table:
CREATE TABLE `daily_reads` (
`a` varchar(32) NOT NULL DEFAULT '',
`b` varchar(50) NOT NULL DEFAULT '',
`c` varchar(20) NOT NULL DEFAULT '',
`d` varchar(20) NOT NULL DEFAULT '',
`e` varchar(20) NOT NULL DEFAULT '',
`f` varchar(10) NOT NULL DEFAULT 'Wh',
`g` datetime NOT NULL,
`PERIOD_START` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`i` decimal(16,3) NOT NULL,
`j` decimal(16,3) NOT NULL DEFAULT '0.000',
`k` decimal(16,2) NOT NULL DEFAULT '0.00',
`l` varchar(1) NOT NULL DEFAULT 'N',
`m` varchar(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (`a`,`b`,`c`,`PERIOD_START`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
It is running on a VM with 1 CPU Core, 6GB RAM, CentOS 7 (have very limited access to that VM).
It is running on a default MySQL config with 128MB buffer pool (SELECT ##innodb_buffer_pool_size/1024/1024)
DB size is ~96GB, ~560M rows in the 'reads' table, ~710M rows with other tables.
select database_name, table_name, index_name, stat_value*##innodb_page_size
from mysql.innodb_index_stats where stat_name='size';
PRIMARY: 83,213,500,416 (no other indexes)
I get like ~500K reads/month and writes are done only as part of an ETL process directly from Informatica to the DB (~ 75M writes/month).
The read queries are called only via stored procedure:
CALL sp_get_meter_data('678912345678', '1234567765432', '2017-01-13 00:00:00', '2017-05-20 00:00:00');
// striped out the not important bits:
...
SET daily_from_date = DATE_FORMAT(FROM_DATE_TIME, '%Y-%m-%d 00:00:00');
SET daily_to_date = DATE_FORMAT(TO_DATE_TIME, '%Y-%m-%d 23:59:59');
...
SELECT
*
FROM
daily_reads
WHERE
A = FRIST_NUMBER
AND
B = SECOND_NUMBER
AND
daily_from_date <= PERIOD_START
AND
daily_to_date >= PERIOD_START
ORDER BY
PERIOD_START ASC;
My understanding of InnoDB is quite limited, but I thought I need to fit all indexes into memory to do fast queries. The read procedure takes only a few milliseconds. I thought it is not technically possible to query 500M+ tables fast enough on a default MySQL config...?
What am I missing?
note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.
Long answer: Your primary key is a composite of several columns starting with a and b.
Your WHERE clause says this.
WHERE a = FRIST_NUMBER
AND b = SECOND_NUMBER
AND etc etc.
This WHERE clauses exploits the index associated with your primary key very efficiently indeed. It random-accesses the index to precisely the first row it needs, and then scans it sequentially. So it doesn't actually have to page in much of your index or your table to satisfy your query.
Short answer: When queries exploit indexes, MySQL is fast and cheap.
If you wanted an index that was perfect for this query, it would be a composite index on (a, b, daily_from_date). This would use equality matching to hit the first matching row in the index, then range scan the index for your chosen date range. But the performance you have now is pretty good.
You asked whether the index must fit entirely in memory. No. The entire purpose of DBMS software is to handle volumes of data that can't possibly fit in memory at once. Good DBMS implementations do a good job of maintaining memory caches, and refreshing those caches from mass storage, when needed. The innodb buffer pool is one such cache. Keep in mind that any insertions or updates to a table require both the table data and the index data to be written to mass storage eventually.
The performances can be improved with some index.
In your specific case, you are filtering on 3 columns: A, B, and PERIOD_START.
To speed up the query you can use index on this columns.
Add an index over PERIOD_START can be inefficient because this type stores TIME information, so you have a lot of differnt values in the same day.
You can add a new column to store the DATE part of PERIOD_START in the correct type (DATE) (something like PERIOD_START_DATE) and add an index on this column.
This makes a more effective indexing and this can improve the computation performance because you are using a look up table (key -> values).
If you do not want to change your client code, you can use a "Generated stored column". See MySql manual
Best regards
its possible your index is getting used (probably not given the leading edge doesnt match the columns in your query) but even if it isn't, you'd only ever read through the table once because the query doesn't have any joins and the subsequent runs would pick the cached results.
Since You're using informatica to load the data (its a swiss army knife of data loading) it may be doing a lot more than you realise e.g. assuming the data load is all inserts then it may drop and recreate indexes and run in bulk mode to load the data really quickly. It may even prerun the query to prime your cache with the first post load run.
Doesn't the index have to fit in memory?
No, the entire index does not have to fit in memory. Only the part of the index that needs to be examined during the query execution.
Since you have conditions on the left-most columns of your primary key (which is you clustered index), the query only examines rows that match the values you search for. The rest of the table is not examined at all.
You can try using EXPLAIN with your query and see an estimate of the number of rows examined. This is only a rough estimate calculated by the optimizer, but it should show that your query only needs to examine a small subset of the 550 million rows.
The InnoDB buffer pool keeps copies of frequently-used pages in RAM. The more frequently a page is used, the more likely it is to stay in the buffer pool and not get kicked out. Over time, as you run queries, your buffer pool gradually stabilizes with the set of pages that is most worth keeping in RAM.
If your query workload were to really scan your entire table frequently, then the small buffer pool would churn a lot more. But it's likely that your queries request the same small subset of the table repeatedly. A phenomenon called the Pareto Principle applies in many real-world applications: the majority of the requests are satisfied by a small minority of data.
This principle tends to fail when we run complex analytical queries, because those queries are more likely to scan the entire table.
Related
I've recently noticed a significant increase in variance of time needed to complete simple INSERT statements. While these statements on average take around 11ms, they can sometimes take 10-30 seconds, and I even noticed them taking over 5 minutes to execute.
MySQL version is 8.0.24, running on Windows Server 2016. The server's resources are never overloaded as far as I can tell. There is an ample amount of cpu overhead for the server to use, and 32GB of ram is allocated to it.
This is the table I'm working with:
CREATE TABLE `saved_segment` (
`recording_id` bigint unsigned NOT NULL,
`index` bigint unsigned NOT NULL,
`start_filetime` bigint unsigned NOT NULL,
`end_filetime` bigint unsigned NOT NULL,
`offset_and_size` bigint unsigned NOT NULL DEFAULT '18446744073709551615',
`storage_id` tinyint unsigned NOT NULL,
PRIMARY KEY (`recording_id`,`index`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
The table has no other indices or foreign keys, nor is it used as a reference for a foreign key in any other table. The entire table size is approximately 20GB with around 281M rows, which doesn't strike me as too large.
The table is used almost entirely read-only, with up to 1000 reads per second. All of these reads happen in simple SELECT queries, not in complex transactions, and they utilize the primary key index efficiently. There are very few, if any, concurrent writes to this table. This has been done intentionally in order to try to figure out if it would help with slow inserts, but it didn't. Before that there were up to 10 concurrent inserts going on at all times. UPDATE or DELETE statements are never executed on this table.
The queries that I'm having trouble with are all structured like this. They never appear in a transaction. While the inserts are definitely not append-only according to the clustered primary key, the queries almost alwayas insert 1 to 20 adjacent rows into the table:
INSERT IGNORE INTO saved_segment
(recording_id, `index`, start_filetime, end_filetime, offset_and_size, storage_id) VALUES
(19173, 631609, 133121662986640000, 133121663016640000, 20562291758298876, 10),
(19173, 631610, 133121663016640000, 133121663046640000, 20574308942546216, 10),
(19173, 631611, 133121663046640000, 133121663076640000, 20585348350688128, 10),
(19173, 631612, 133121663076640000, 133121663106640000, 20596854568114720, 10),
(19173, 631613, 133121663106640000, 133121663136640000, 20609723363860884, 10),
(19173, 631614, 133121663136640000, 133121663166640000, 20622106425668780, 10),
(19173, 631615, 133121663166640000, 133121663196640000, 20634653501528448, 10),
(19173, 631616, 133121663196640000, 133121663226640000, 20646967172721148, 10),
(19173, 631617, 133121663226640000, 133121663256640000, 20657773176227488, 10),
(19173, 631618, 133121663256640000, 133121663286640000, 20668825200822108, 10)
This is the output for an EXPLAIN statement of the above query:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
INSERT
saved_segment
NULL
ALL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
These problems are relatively recent and weren't apparent while the table was around twice as small.
I tried reducing the number of concurrent inserts into the table, from around 10 to 1. I also deleted foreign keys on some columns (recording_id) in order to speed up inserts farther. ANALYZE TABLE and schema profiling didn't yield any actionable information.
One solution I had in mind was to remove the clustered primary key, add an AUTOINCREMENT primary key and a regular index on the (recording_id, index) columns. In my mind this would help by making inserts 'append only'. I'm open to any and all suggestions, thanks in advance!
EDIT:
I'm going to address some points and questions raised in the comments and answers:
autocommit is set to ON
the value of innodb_buffer_pool_size is 21474836480, and the value of innodb_buffer_pool_chunk_size is 134217728
One comment raised a concern about contention between the read-locks used by reads, and the exclusive locks used by writes. The table is question is used somewhat like a cache, and I don't need reads to always reflect the most up to date state of the table, if that would mean increased performance. The table should however remain durable even in cases of server crashes and hardware failures. Is this possible to achieve with a more relaxed transaction isolation level?
The schema could definitely be optimized; recording_id could be 4 byte integer, end_filetime could instead be an elapsed value, and start_filetime could also probably be smaller. I'm afraid that these changes would just push the issue back for a while until the table grows in size to compensate for the space savings.
INSERTs into the table are always sequential
SELECTs executed on the table look like this:
SELECT TRUE
FROM saved_segment
WHERE recording_id = ? AND `index` = ?
SELECT index, start_filetime, end_filetime, offset_and_size, storage_id
FROM saved_segment
WHERE recording_id = ? AND
start_filetime >= ? AND
start_filetime <= ?
ORDER BY `index` ASC
The second type of query could definitely be improved with an index, but I'm afraid this would further degrade INSERT performance.
Another thing that I forgot to mention is the existence of a very similar table to this one. It is queried and inserted into in exactly the same manner, but might further contribute to IO starvation.
EDIT2:
Results of SHOW TABLE STATUS for the table saved_segment, and a very similar table saved_screenshot (this one has an aditional INDEX on an bigint unsigned not null column).
Name
Engine
Version
Row_format
Rows
Avg_row_length
Data_length
Max_data_length
Index_length
Data_free
Auto_increment
Create_time
Update_time
Check_time
Collation
Checksum
Create_options
Comment
saved_screenshot
InnoDB
10
Dynamic
483430208
61
29780606976
0
21380464640
6291456
NULL
"2021-10-21 01:03:21"
"2022-11-07 16:51:45"
NULL
utf8mb4_0900_ai_ci
NULL
saved_segment
InnoDB
10
Dynamic
281861164
73
20802699264
0
0
4194304
NULL
"2022-11-02 09:03:05"
"2022-11-07 16:51:22"
NULL
utf8mb4_0900_ai_ci
NULL
I'll go out on a limb with this Answer.
Assuming that
The value of innodb_buffer_pool_size is somewhat less than 20MB, and
Those 1K Selects/second randomly reach int various parts of the table, then
The system recently become I/O bound because the 'next' block needed for the next Select is becoming more often not cached in the buffer_pool.
The simple solution is to get more RAM and up the setting of that tunable. But the table will only grow to whatever next limit you purchase.
Instead, here are some partial solutions.
If the numbers don't get too big, the first two columns could be INT UNSIGNED (4 bytes instead of 8) or maybe even MEDIUMINT UNSIGNED (3 bytes). Caution the ALTER TABLE would lock the table for a long time.
Those start and end times look like timestamps with fractional seconds that are always ".000". DATETIME and TIMESTAMP take 5 bytes (versus 8).
Your sample shows 0 elapsed time. If (end-start) is usually very small, then storing elapsed instead of endtime would further shrink the data. (But make it messy to use the endtime).
The sample data you presented looks "consecutive". That is about as efficient as an autoincrement. Is that the norm? If not, the INSERTs could be part of the I/O thrashing.
Your suggestion of adding AI, plus a secondary index, sort of doubles the effort for Inserts; so I do not recommend it.
More
just push the issue back for a while until the table grows in size
Yes, that will be the case.
Both of your queries are optimally helped by this as an INDEX or, even better, as the start of the PRIMARY KEY:
(recording_id, index)
Re:
SELECT TRUE
FROM saved_segment
WHERE recording_id = ? AND `index` = ?
If that is used to control some other SQL, consider adding this to that other SQL:
... EXISTS ( SELECT 1
FROM saved_segment
WHERE recording_id = ? AND `index` = ? ) ...
That query (in either form) needs what you already have
PRIMARY KEY(recording_id, index)
Your other query needs
INDEX(recording_id, start_filetime)
So, add that INDEX, or...
Even better... This combination would be better for both SELECTs:
PRIMARY KEY(recording_id, start_filetime, index).
INDEX(recording_id, index)
With that combo,
The single-row existence check would be performed "Using index" because it is "covering".
And the other query would find all the relevant rows clustered together on the PK.
(The PK has those 3 columns because it needs to be Unique. And they are in that order to benefit your second query. And it is the PK, not just an INDEX so it does not need to bounce between the index's BTree and the data's BTree.)
The "clustering" may help your performance by cutting down on the number of disk blocks needed for such queries. This leads to less "thrashing" in the buffer_pool, hence less need to increase RAM.
My index suggestions are mostly orthogonal to my datatype suggestions.
Dear StackOverflow Members
It's my first post, so please be nice :-)
I have a strange SQL behavior which i can't explain and don't find any resources which explains it.
I have built a web honeypot which record all access and attacks and display it on a statistic page.
However since the data increased, the generation of the statistic page is getting slower and slower.
I narrowed it down to a some select statements which takes a quite a long time.
The "issue" seems to be an index on a specific column.
*For sure the real issue is my lack of knowledge :-)
Database: mysql
DB schema
Event Table (removed unrelated columes):
Event table size: 30MB
Event table records: 335k
CREATE TABLE `event` (
`EventID` int(11) NOT NULL,
`EventTime` datetime NOT NULL DEFAULT current_timestamp(),
`WEBURL` varchar(50) COLLATE utf8_bin DEFAULT NULL,
`IP` varchar(15) COLLATE utf8_bin NOT NULL,
`AttackID` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `event`
ADD PRIMARY KEY (`EventID`),
ADD KEY `AttackID` (`AttackID`);
ALTER TABLE `event`
ADD CONSTRAINT `event_ibfk_1` FOREIGN KEY (`AttackID`) REFERENCES `attack` (`AttackID`);
Attack Table
attack table size: 32KB
attack Table records: 11
CREATE TABLE attack (
`AttackID` int(4) NOT NULL,
`AttackName` varchar(30) COLLATE utf8_bin NOT NULL,
`AttackDescription` varchar(70) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `attack`
ADD PRIMARY KEY (`AttackID`),
SLOW Query:
SELECT Count(EventID), IP
-> FROM event
-> WHERE AttackID >0
-> GROUP BY IP
-> ORDER BY Count(EventID) DESC
-> LIMIT 5;
RESULT: 5 rows in set (1.220 sec)
(This seems quite long for me, for a simple query)
QuerySlow
Now the Strange thing:
If I remove the foreign key relationship the performance of the query is the same.
But if I remove the the index on event.AttackID same select statement is much faster:
(ALTER TABLE `event` DROP INDEX `AttackID`;)
The result of the SQL SELECT query:
5 rows in set (0.242 sec)
QueryFast
From my understanding indexes on columns which are used in "WHERE" should improve the performance.
Why does removing the index have such an impact on the query?
What can I do to keep the relations between the table and have a faster
SELECT execution?
Cheers
Why does removing the index improve performance?
The query optimizer has multiple ways to resolve a query. For instance, two methods for filtering data are:
Look up the rows that match the where clause in the index and then fetch related data from the data pages.
Scan the index.
This doesn't get into the use of indexes for joins or aggregations or alternative algorithms.
Which is better? Under some circumstances, the first method is horribly slower than the second. This occurs when the data for the table does not fit into memory. Under such circumstances, the index can read a record from page 124 and then from 1068 and then from 124 again and -- well, all sorts of random intertwined reading of pages. Reading data pages in order is usually faster. And when the data doesn't fit into memory, thrashing occurs, which means that a page in memory is aged (overwritten) -- and then needed again.
I'm not saying that is occurring in your case. I am simply saying that what optimizers do is not always obvious. The optimizer has to make judgements based on the nature of the data -- and those judgements are not right 100% of the time. They are usually correct. But there are borderline cases. Sometimes, the issue is out-of-date statistics. Sometimes the issue is that what looks best to the optimizer is not best in practice.
Let me emphasize that optimizers usually do a very good job, and a better job than a person would do. Even if they occasionally come up with suboptimal plans, they are still quite useful.
Get rid of your redundant UNIQUE KEYs. A primary key is a unique key.
Use COUNT(*) rather than COUNT(IP) in your query. They mean the same thing because you declared IP to be NOT NULL.
Your query can be much faster if you stop saying WHERE AttackId>0. Because that column is a FK to the PK of your other table, those values should be nonzero anyway. But to get that speedup you'll need an index on event(IP) something like this.
CREATE INDEX IpDex ON event (IP)
But you're still summarizing a large table, and that will always take time.
It looks like you want to display some kind of leaderboard. You could add a top_ips table, and use an EVENT to populate it, using your query, every few minutes. Then you could display it to your users without incurring the cost of the query every time. This of course would display slightly stale data; only you know whether that's acceptable in your app.
Pro Tip. Read https://use-the-index-luke.com by Marcus Winand.
Essentially every part of your query, except for the FKey, conspires to make the query slow.
Your query is equivalent to
SELECT Count(*), IP
FROM event
WHERE AttackID >0
GROUP BY IP
ORDER BY Count(*) DESC
LIMIT 5;
Please use COUNT(*) unless you need to avoid NULL.
If AttackID is rarely >0, the optimal index is probably
ADD INDEX(AttackID, -- for filtering
IP) -- for covering
Else, the optimal index is probably
ADD INDEX(IP, -- to avoid sorting
AttackID) -- for covering
You could simply add both indexes and let the Optimizer decide. Meanwhile, get rid of these, if they exist:
DROP INDEX(AttackID)
DROP INDEX(IP)
because any uses of them are handled by the new indexes.
Furthermore, leaving the 1-column indexes around can confuse the Optimizer into using them instead of the covering index. (This seems to be a design flaw in at least some versions of MySQL/MariaDB.)
"Covering" means that the query can be performed entirely in the index's BTree. EXPLAIN will indicate it with "Using index". A "covering" index speeds up a query by 2x -- but there is a very wide variation on this prediction. ("Using index condition" is something different.)
More on index creation: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.
First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.
Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online
I am facing serious performance issue in inserting, selecting and updating rows to a table in mysql.
The table structure I am using is
CREATE TABLE `sessions` (
`sessionid` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`expiry` datetime NOT NULL,
`value` text NOT NULL,
`data` text,
PRIMARY KEY (`sessionid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Sessions';
The queries for which I face issue are :
INSERT INTO sessions (SESSIONID, EXPIRY, DATA, VALUE) VALUES ('b8c10810c505ba170dd9403072b310ed', '2019-05-01 17:25:50', 'PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM', '7bKDofc/pyFSQhm7QE5jb6951Ahg6Sk8OCVZI7AcbUPb4jZpHdrCAKuCPupJO14DNY3jULxKppLadGlpsKBifiJavZ/');
UPDATE sessions SET EXPIRY = '2019-05-01 17:26:07' WHERE (SESSIONID = 'e99a0889437448091a06a43a44d0f170');
SELECT SESSIONID, EXPIRY, DATA, VALUE FROM sessions WHERE (SESSIONID = '507a752c48fc9cc3043a3dbe889c52eb');
I tried explaining the query but was not able to infer much about optimizing the table/query.
From the slow query report the time taken
for select in average is 23.45, for update it is 15.93 and for insert it is
22.31.
Any help in identifying the issue is much appreciated.
How many queries per second?
How big is the table?
How much RAM?
What is the value of innodb_buffer_pool_size?
UUIDs are terrible for performance. (Is that a SHA1?) This is because they are so random that the 'next' query (any of those you mentioned) is likely not to be in cache, hence necessitating a disk hit.
So, with a table that is much larger than the buffer_pool, you won't be able to sustain more than about 100 queries per second with a spinning drive. SSD would be faster.
More on the evils of UUIDs (SHA1 has the same unfortunate properties, but no solution like the one for uuids): http://mysql.rjweb.org/doc.php/uuid
One minor thing you can do is to shrink the table:
session_id BINARY(20)
and use UNHEX() when inserting/updating/deleting and HEX() when selecting.
More
51KB avg row len --> The TEXT columns are big, and "off-record", hence multiple blocks needed to work with a row.
0.8GB buffer_pool, but 20GB of data, and 'random' PRIMARY KEY --> The cache is virtually useless.
These mean that there will be multiple disk hits to for each query, but probably under 10.
300ms (a fast time) --> about 30 disk hits on HDD (more on SSD; which do you have?).
So, I must guess that 20s for a query happened when there was a burst of activity that had the queries stumbling over each other, leading to lots of I/O contention.
What to do? Most of the data looks like hex. If that is true, you could cut the disk footprint in half (and cut back some on disk hits needed) by packing and using BINARY(..) or BLOB.
INSERT INTO sessions (SESSIONID, EXPIRY, DATA, VALUE)
VALUES (UNHEX('b8c10810c505ba170dd9403072b310ed'),
'2019-05-01 17:25:50',
UNHEX('PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM'),
UNHEX('7bKDofc/pyFSQhm7QE5jb6951Ahg6Sk8OCVZI7AcbUPb4jZpHdrCAKuCPupJO14DNY3jULxKppLadGlpsKBifiJavZ/'));
UPDATE sessions SET EXPIRY = '2019-05-01 17:26:07'
WHERE SESSIONID = UNHEX('e99a0889437448091a06a43a44d0f170');
SELECT SESSIONID, EXPIRY, DATA, VALUE FROM sessions
WHERE SESSIONID = UNHEX('507a752c48fc9cc3043a3dbe889c52eb');
and
`sessionid` VARBINARY(20) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`expiry` datetime NOT NULL,
`value` BLOB NOT NULL,
`data` BLOB,
And ROW_FORMAT=DYNAMIC might be optimal (but this is not critical).
Your queries looks good, but problem is with your server, it may not be having enough memory to handle such request, you can increase memory of your database server to to get optimised response
I have a table like this:
create table test (
id int primary key auto_increment,
idcard varchar(30),
name varchar(30),
custom_value varchar(50),
index i1(idcard)
)
I insert 30,000,000 rows to the table
and then I execute:
select * from test where idcard='?'
The statement cost 12 seconds to return
when I use iostat to monitor disk
the read speed is about 6 mb/s while the util is 94%
is any way to optimize it?
12 seconds may be realistic.
Assumptions about the question:
A total of 30M rows, but only 3000 rows in the resultset.
Not enough room to cache things in RAM or you are running from a cold start.
InnoDB or MyISAM (the analysis is the same; the details are radically different).
Any CHARACTER SET and COLLATION for idcard.
INDEX(idcard) exists and is used in the query.
HDD disk drive, not SSD.
Here's a breakdown of the processing:
Go to the index, find the first entry with ?, scan forward until hitting an entry that is not ? (about 3K rows later).
For each of those 3K items, reach into the table to find all the columns (cf SELECT *.
Deliver them.
Step 1: Fast.
Step 2: This is (based on the assumption of not being cached) costly. It may involve about 3K disk hits. For an HDD, that would be about 30 seconds. So, 12 seconds could imply some of the stuff was cached or happened to be near each other.
Step 3: This is a network cost, which I am not considering.
Run the query a second time. It may take only 1 second the this time -- because all 3K blocks are cached in RAM! And iostat will show zero activity!
is any way to optimize it?
Well...
You already have the best index.
What are you going to do with 3000 rows all at once? Is this a one-time task?
When using InnoDB, innodb_buffer_pool_size should be about 70% of available RAM, but not so big that it leads to swapping. What is its setting, and how much RAM do you have and what else is running on the machine?
Could you do more of the task while you are fetching the 3K rows?
Switching to SSDs would help, but I don't like hardware bandaids; they are not reusable.
How big is the table (in GB) -- perhaps 3GB data plus index? (SHOW TABLE STATUS.) If you can't make the buffer_pool big enough for it, and you have a variety of queries that compete for different parts of this (and other) tables, then more RAM may be beneficial.
Seems more like an I/O limitation than something that could be solved by adding indices. What will improve the speed is change the collation of the idcard column to latin1_bin. This uses only 1 byte per character. It also uses binary comparison which is faster than case insensitive comparison.
Only do this if you have no special characters in the idcard column, because the character set of latin1 is quite limited.
ALTER TABLE `test` CHANGE COLUMN `idcard` `idcard` VARCHAR(30) COLLATE 'latin1_bin' AFTER `id`;
Furthermore the ROW_FORMAT=FIXED also improves the speed. ROW_FORMAT=FIXED is not available using the InnoDB engine, but it is with MyISAM. The resulting table I now have is shown below. It's 5 times quicker (80% less time) with select statements than the initial table.
Note that I also changed the collation for 'name' and 'custom_value' to latin1_bin. This does make quite a difference in speed in my test setup, and I'm still figuring out why.
CREATE TABLE `test` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`idcard` VARCHAR(30) COLLATE 'latin1_bin',
`name` VARCHAR(30) COLLATE 'latin1_bin',
`custom_value` VARCHAR(50) COLLATE 'latin1_bin',
PRIMARY KEY (`id`),
INDEX `i1` (`idcard`)
)
ENGINE=MyISAM
ROW_FORMAT=FIXED ;
You may try adding the three other columns in the select clause to the index:
CREATE INDEX idx ON test (idcard, id, name, custom_value);
The three columns other than idcard are being added to allow the index to cover everything being selected. The problem with your current index is that it is only on idcard. This means that once MySQL has traversed down to each leaf node in the index, it would have to do another seek back to the clustered index to lookup the values of all columns mentioned in the select *. As a result of this, MySQL may choose to ignore the index completely. The suggestion I made above avoids this additional seek.