I have a database with the following stats
Tables Data Index Total
11 579,6 MB 0,9 GB 1,5 GB
So as you can see the Index is close to 2x bigger. And there is one table with ~7 million rows that takes up at least 99% of this.
I also have two indexes that are very similar
a) UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
b) KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
Update: Here is the table definition (at least structurally) of the largest table
CREATE TABLE `invoices` (
`id` int(10) unsigned NOT NULL auto_increment,
`customer_id` int(10) unsigned NOT NULL,
`order_no` varchar(10) default NULL,
`invoice_no` varchar(20) default NULL,
`customer_no` varchar(20) default NULL,
`name` varchar(45) NOT NULL default '',
`archived` tinyint(4) default NULL,
`invoiced` tinyint(4) default NULL,
`time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`group` int(11) default NULL,
`customer_group` int(11) default NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
KEY `idx_time` (`time`),
KEY `idx_order` (`order_no`),
KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
) ENGINE=InnoDB AUTO_INCREMENT=9146048 DEFAULT CHARSET=latin1 |
Update 2:
mysql> show indexes from invoices;
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| invoices | 0 | PRIMARY | 1 | id | A | 7578066 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_time | 1 | time | A | 541290 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_order | 1 | order_no | A | 6091 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 3 | order_no | A | 7578066 | NULL | NULL | YES | BTREE | |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
My questions are:
Is there a way to find unused indexes in MySQL?
Are there any common mistakes that impact the size of the index?
Can indexA safely be removed?
How can you measure the size of each index? All I get is the total of all indexes.
You can remove index A, because, as you have noted, it is a subset of another index. And it's possible to do this without disrupting normal processing.
The size of the index files is not alarming in itself and it can easily be true that the net benefit is positive. In other words, the usefulness and value of an index shouldn't be discounted because it results in a large file.
Index design is a complex and subtle art involving a deep understanding of the query optimizer explanations and extensive testing. But one common mistake is to include too few fields in an index in order to make it smaller. Another is to test indexes with insufficient, or insufficiently representative data.
I may be wrong, but the first index (idx_customer_invoice) is UNIQUE, the second (idx_customer_invoice_order) is not, so you'll probably lose the uniqueness constraint when you remove it. No?
Is there a way to find unused indexes in MySQL?
The database engine optimizer will select a proper index when attempting to optimize your query. Depending on when you collected statistics on your indexes last, the index which is chosen will vary. Unused indexes could suddenly become used because of new data repartition.
Can indexA safely be removed?
I would say yes, if indexA and indexB are B-Tree indexes. This is because an index that starts with the same columns in the same order will have the same structure.
use
show indexes from table;
to define what indexes do you have in a particular table. Cardinality would tell how useful your index is.
You can remove your indexes safely (it will not break a table), but beware: some queries might execute slower. First you should analyze your queries to decide whether you need a certain index or not.
I don't think you can find out data length of a particular index, though.
BUT, I think you probably think that if indexes length is greater than data length twice is something abnormal... Well, you are wrong. All of your indexes might be useful ;) If you have a table that provides a lot of information and you have to search on it upon a large number of column, it can easily be that indexes of this table will 2 times bigger in size that the tables data.
indexA can remove because there's a
indexB include indexA
what impact your index length is
your column type and column length
use:
select index_length from information_schema.tables
where table_name='your_table_name' and
table_schema='your_db_name';
get your table index_length
Related
I have a MySQL database with the following structure :
mysql> describe company;
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| name | varchar(50) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
mysql> describe nameserver;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| companyId | int | NO | MUL | NULL | |
| ns | varchar(250) | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+----------------+
mysql> describe domain;
+--------------+--------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+-------------------+-------------------+
| id | int | NO | PRI | NULL | auto_increment |
| nameserverId | int | NO | MUL | NULL | |
| domain | varchar(250) | NO | MUL | NULL | |
| tld | varchar(20) | NO | MUL | NULL | |
| createDate | datetime | NO | | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| updatedAt | datetime | YES | | NULL | |
| status | tinyint | NO | | NULL | |
| fileNo | smallint | NO | MUL | NULL | |
+--------------+--------------+------+-----+-------------------+-------------------+
The indexes structure :
-- Indexes for table `company`
--
ALTER TABLE `company`
ADD PRIMARY KEY (`id`);
--
-- Indexes for table `domain`
--
ALTER TABLE `domain`
ADD PRIMARY KEY (`id`),
ADD KEY `nameserver` (`nameserverId`),
ADD KEY `domain` (`domain`),
ADD KEY `tld` (`tld`),
ADD KEY `fileNo` (`fileNo`);
--
-- Indexes for table `nameserver`
--
ALTER TABLE `nameserver`
ADD PRIMARY KEY (`id`),
ADD KEY `company` (`companyId`),
ADD KEY `ns` (`ns`);
--
-- AUTO_INCREMENT for dumped tables
--
--
-- AUTO_INCREMENT for table `company`
--
ALTER TABLE `company`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- AUTO_INCREMENT for table `domain`
--
ALTER TABLE `domain`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- AUTO_INCREMENT for table `nameserver`
--
ALTER TABLE `nameserver`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- Constraints for dumped tables
--
--
-- Constraints for table `domain`
--
ALTER TABLE `domain`
ADD CONSTRAINT `nameserver` FOREIGN KEY (`nameserverId`) REFERENCES `nameserver` (`id`);
--
-- Constraints for table `nameserver`
--
ALTER TABLE `nameserver`
ADD CONSTRAINT `company` FOREIGN KEY (`companyId`) REFERENCES `company` (`id`);
The amount of data is as following:
domain table about 500 millions records
nameserver table about 2 millions records
Running this query take about 4 hours to get me the result :
SELECT distinct domain FROM domain
INNER join nameserver on nameserver.id = domain.nameserverId
WHERE nameserver.companyId = 2
The explain result for above query :
+----+-------------+------------+------------+------+-------------------
+------------+---------+-----------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+-------------------+------------+---------+-----------------------+------+----------+------------------------------+
| 1 | SIMPLE | nameserver | NULL | ref | PRIMARY,company | company | 4 | const | 1738 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | domain | NULL | ref | nameserver,domain | nameserver | 4 | tldzone.nameserver.id | 716 | 100.00 | NULL |
+----+-------------+------------+------------+------+-------------------+------------+---------+-----------------------+------+----------+------------------------------+
My question is how can I improve the speed of getting query from this database?
It is possible for me to change the DB structure or even replace it with another DBMS.
MySQL is running on a VPS with 8.0 GB RAM and dual core CPU.
nameserver: INDEX(companyId, id) -- in this order (you have this)
domain: INDEX(nameserverId, domain) -- in this order
("MUL" does not tell me whether you already have either of these composite indexes. SHOW CREATE TABLE is more descriptive than DESCRIBE.)
1 Add indexes to the relevant columns: Adding indexes to the companyId, nameserverId, and domain columns in the nameserver and domain tables can help to speed up the query by allowing the database to quickly locate the relevant rows.
2 Use a covering index: A covering index is an index that includes all the columns that are used in the query. By creating a covering index on the companyId, nameserverId, and domain columns, you can avoid the need for the database to look up the data in the actual tables, which can improve query performance.
3 Use a column-store index: A column-store index is an index that stores data by column rather than by row. Column-store indexes can be more efficient for querying large datasets and can improve the performance of the query you provided.
4 Use a database management system that is optimized for large datasets: If you are using a database management system that is not well-suited to handling large datasets, you may see improved performance by switching to a different system. Some options to consider include column-oriented database management systems such as Vertica or ClickHouse, or distributed database management systems such as Cassandra or HBase.
5 Consider using a distributed database: If you have a very large dataset and are still experiencing slow query performance, you may want to consider using a distributed database management system, which allows you to spread your data across multiple servers and can improve the scalability and performance of your database.
6 It's important to keep in mind that the specific solutions that work best for you will depend on the specific requirements of your database and the workload you are placing on it. It may be helpful to perform some benchmarking and testing to determine which approaches work best for your needs.
I'm faced with a MySQL database which contains an events table with ~70 million rows which has foreign keys to other tables and is used to generate reports. Constructing a performant query to select (while counting/summing values) and grouping data per day from this table is proving challenging.
The database structure is as follows:
CREATE TABLE `client` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_client_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8mb3
CREATE TABLE `class` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`client_id` int DEFAULT NULL,
`duration` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_client_id_idx` (`client_id`),
CONSTRAINT `fk_client_id` FOREIGN KEY (`client_id`) REFERENCES `client` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2606 DEFAULT CHARSET=utf8mb3
CREATE TABLE `event` (
`id` int NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`class_id` int DEFAULT NULL,
`venue_id` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `fk_class_id_idx` (`class_id`),
KEY `fk_venue_id_idx` (`venue_id`),
KEY `idx_1` (`venue_id`,`class_id`,`start_time`),
CONSTRAINT `fk_class_id` FOREIGN KEY (`class_id`) REFERENCES `class` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `fk_venue_id` FOREIGN KEY (`venue_id`) REFERENCES `venue` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=64093231 DEFAULT CHARSET=utf8mb3
CREATE TABLE `venue` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_venue_id_name` (`id`,`name`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8mb3
The query which is fine on an events table with a few thousand rows to demonstrate the desired outcome is as follows:
SELECT
CAST(event.start_time as date) as day,
class.name,
client.name,
venue.name,
COUNT(class.name) AS occurrences,
SUM(class.duration) AS duration
FROM
class,
client,
event,
venue
WHERE
event.venue_id = venue.id
AND event.class_id = class.id
AND class.client_id = client.id
GROUP BY day, class.name, client.name, venue.name
The database isn't indexed and although I've tried indexing with things like alter table events add index idx_test (venue_id, class_id, start_time); to improve performance it's still incredibly slow (I tend to abort them when they're past the 10 minute mark so don't know for sure how long they'd take to complete).
I figured this was a good use case for a summary table (as suggested by Rick James' guide) so that I could hold a separate set of summarized data broken down into day with occurrences and total duration calculated/incremented with each addition to the table (IODKU). However I'm then also up against creating rows per day in a summary table based on what is considered a day in the database (UTC) which may not match with the application's "day" due to timezone offset.
Short of converting the start_time column to a timestamp type (which is then inconsistent with all other date types in the database) is there any way round this or is there any other optimization I could be making to the original events table resulting in a more responsive query? TIA
Update 23/05
Here's the buffer pool size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
I've also made a bit of progress with indexing, modifying the query and creating a summary table.
I tried various ordering of columns to test indexes and found idx_event_venueid_classid_starttime (below), to be the most efficient for the event table:
SHOW INDEXES FROM EVENT;
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| event | 0 | PRIMARY | 1 | id | A | 62142912 | NULL | NULL | | BTREE | | | YES | NULL |
| event | 1 | fk_class_id_idx | 1 | class_id | A | 51286 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | fk_venue_id_idx | 1 | venue_id | A | 16275 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 1 | venue_id | A | 13378 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 2 | class_id | A | 81331 | NULL | NULL | YES | BTREE | | | YES | NULL |
| event | 1 | idx_event_venueid_classid_starttime | 3 | start_time | A | 63909472 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-------+------------+-------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
Here's my modified version of the query, using JOIN syntax and now uses CONVERT_TZ to convert from UTC to the timezone required for reporting and then group that by the date (discarding the time portion):
SELECT
DATE(CONVERT_TZ(event.start_time,
'UTC',
'Europe/London')) AS tz_date,
class.name,
client.name,
venue.name,
COUNT(class.id) AS occurrences,
SUM(class.duration) AS duration
FROM
event
JOIN
class ON class.id = event.class_id
JOIN
venue ON venue.id = event.venue_id
JOIN
client ON client.id = class.client_id
GROUP BY tz_date, class.name, client.name, venue.name;
And here's the output of explain for that query:
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
| 1 | SIMPLE | venue | NULL | index | PRIMARY,idx_venue_id_name | idx_venue_id_name | 772 | NULL | 28 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | event | NULL | ref | fk_class_id_idx,fk_venue_id_idx,idx_event_venueid_classid_starttime | idx_event_venueid_classid_starttime | 5 | example.venue.id | 4777 | 100.00 | Using where; Using index |
| 1 | SIMPLE | class | NULL | eq_ref | PRIMARY,fk_client_id_idx | PRIMARY | 4 | example.event.class_id | 1 | 100.00 | Using where |
| 1 | SIMPLE | client | NULL | eq_ref | PRIMARY,idx_client_id_name | PRIMARY | 4 | example.class.client_id | 1 | 100.00 | NULL |
+----+-------------+--------+------------+--------+---------------------------------------------------------------------+-------------------------------------+---------+-------------------------+------+----------+------------------------------+
The query takes ~1m 20s to run now so I figured I could prepend that with an insert into to populate a summary table with the dates being timezone specific and run that on a nightly basis. Summary table structure:
CREATE TABLE `summary` (
`tz_date` date NOT NULL,
`class` varchar(255) NOT NULL,
`client` varchar(255) NOT NULL,
`venue` varchar(255) NOT NULL,
`occurrences` int NOT NULL,
`duration` int NOT NULL,
PRIMARY KEY (`tz_date`,`class`,`client`,`venue`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
From the original ~60m+ rows in the event table, the aggregated summary table is populated with ~66k rows.
To then generate the reports from the summary table it takes a fraction of a second (shown below with data snipped):
SELECT * FROM SUMMARY;
66989 rows in set (0.03 sec)
I haven't looked into the impact of inserting into event while the query to populate the summary table is running - is using InnoDB likely to slow that down?
No further indexes are likely to help. It need to scan all the events table, reaching into the other tables to get the names.
Some things for us to look at:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
EXPLAIN SELECT ...
How much RAM do you have?
Do the aggregates (COUNT and SUM) look correct? In some situations involving JOIN, they can be over-inflated.
Please use the newer JOIN ... ON syntax. (Won't change performance.)
As you observed, a Summary Table may help -- but only of the older data is not being modified. Please provide the SHOW CREATE TABLE and query for it.
Yes, timezone vs "definition of day" is a thorny issue. Notice how StackOverflow defines day based on UTC.
How many new rows are there per day? Are they spread out somewhat evenly throughout the day? If the average number of rows per hour is at least 20, then the Summary Table could be based on half-hour intervals. (I picked that because of India time vs most of the rest of the world.) The 20 comes from a Rule of Thumb that says that a summary table should have one-tenth as many rows as the Fact table.
Yes, TIMESTAMP instead of DATETIME may be a workaround.
Since you are talking about moderately large tables, consider whether to change INT NULL to SMALLINT UNSIGNED NOT NULL or some other sized integer.
(As for the cliff in 2038, ask yourself how many databases have been active on the same hardware and software since 2006. That may give some perspective on whether your design must survive 16 years.)
I have the following table (it has more data columns, removed them because it would be a long post):
CREATE TABLE `members` (
`memberid` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastname` varchar(45) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`memberid`),
KEY `members_lname_ix` (`lastname`)
) ENGINE=InnoDB AUTO_INCREMENT=1019 DEFAULT CHARSET=utf8
COLLATE=utf8_unicode_ci;
By default, a user only ever accesses 10-20 rows from this table at a time and it is usually sorted by the lastname column, it's all paginated server side. so I decided to add an index to lastname to help with sorting, however the index does not seem to be working like I would expect it to. when I run EXPLAIN SELECT * FROM members ORDER BY lastname ASC I get:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra
1 | simple | members | ALL | null | null | null | null | 711 | using filesort
I can at least confirm the index exists because if I run SHOW INDEX FROM members I get:
Table | Non_Unique | Key_name | Seq_in_ix | Col_name | Collation | Cardinality | Sub part | Packed | Null | Ix type
members | 0 | PRIMARY | 1 | memberid | A | 711 | null | null | (blank) | BTREE
members | 1 | members_lname_ix | 1 | lastname | A | 711 | null | null | YES | BTREE
if I add USE INDEX (members_lname_ix) both possible_keys and key will remain null. However if I add FORCE INDEX (members_lname_ix) possible_keys remains null and key shows members_lname_ix. This is my first time trying to apply indexing but to me this doesn't seem very intuitive - it feels like mysql should know that I created an index for lastname, no? I can't quite figure out what I'm doing wrong here unless I am misunderstanding something. Is the solution here to just keep using FORCE INDEX?
There are two ways to perform that query:
Plan A (as you were expecting):
Scan through the index sequentially, reading the entire (estimated) 711 rows.
Randomly look up each row in the data BTree. This involves reading the entire dataset.
Deliver the data in order.
Plan B (what it does):
Scan through the data, reading all 711 rows.
Sort the data
Deliver the sorted data.
Plan B does not touch the index at all; this was deemed to be a bigger savings than not having to sort the data.
In a table as tiny as yours, it would be hard to see a difference in speed. (In my test case, it took under 10 milliseconds either way.) In huge tables, the difference could be significant.
For optimal pagination, see http://mysql.rjweb.org/doc.php/pagination
This is my fist time with big MySQL tables, and i have a couple of questions about the speed of a search.
I have a table with 100 million entries in a MySQL table. The table now look like this:
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| Accession | char(10) | NO | PRI | NULL | |
| DB | char(6) | NO | | NULL | |
| Organism | varchar(255) | NO | | NULL | |
| Gene | varchar(255) | NO | | NULL | |
| Name | varchar(255) | NO | | NULL | |
| Header | text | NO | | NULL | |
| Sequence | text | NO | | NULL | |
+-----------+--------------+------+-----+---------+-------+
with indexes like this:
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| uniprot | 0 | PRIMARY | 1 | Accession | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 1 | Accession | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 2 | DB | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 3 | Organism | A | 94275840 | 191 | NULL | | BTREE | | |
| uniprot | 1 | main_index | 4 | Gene | A | 94275840 | 191 | NULL | | BTREE | | |
| uniprot | 1 | main_index | 5 | Name | A | 94275840 | 191 | NULL | | BTREE | | |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
My question is about the efficiency of this. The searces i use are very simple, but i need the answer really fast.
For 80% of the times i use Accession as a query and i want the sequence back.
select sequence from uniprot where accession="q32p44";
...
1 row in set (0.06 sec)
For 10% of the times i search for a "Gene" and 10% of the time i search for an Organism.
The table is unique for "Accession".
My questions are:
Can i make this table more efficient (search time wise) anyhow?
Is the indexing good?
Do i speed up the search time by making a multiple keyed primary key like (Accession, Gene, Organism)?
Thanks a lot!
EDIT1:
As requested in the comments:
mysql> show create table uniprot;
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uniprot | CREATE TABLE `uniprot` (
`Accession` char(10) NOT NULL,
`DB` char(6) NOT NULL,
`Organism` varchar(255) NOT NULL,
`Gene` varchar(255) NOT NULL,
`Name` varchar(255) NOT NULL,
`Header` text NOT NULL,
`Sequence` text NOT NULL,
PRIMARY KEY (`Accession`),
KEY `main_index` (`Accession`,`DB`,`Organism`(191),`Gene`(191),`Name`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Don't use "prefix" indexing, it almost never does as well as you might expect.
CHAR(10) with utf8mb4 means that you are taking 40 bytes always. accession="q32p44" implies VARCHAR and ascii would be better. With those changes, I would not bother switching to a 'surrogate' key. Consider the same issue for DB.
With PRIMARY KEY(Accession) and InnoDB, there is no advantage in having KEY main_index (Accession, ...). Drop that KEY.
What is Sequence? If it is a text string with only 4 different letters, then it should be highly compressible. And, with 100M rows, shrinking the disk footprint could lead to a noticeable speedup. I would COMPRESS it in the client and store it into a BLOB.
Do you really need 255 in varchar(255)? Please shrink to something 'reasonable' for the data. That way, we can reconsider what index(es) to add, without using prefixing.
select sequence from uniprot where accession="q32p44";
works very efficiently with PRIMARY KEY(accession)
select sequence from uniprot where accession="q32p44" AND gene = '...';
also works efficiently with that PK. It will find the one row for q32p44 and then simply check that gene matches; then deliver 0 or 1 row.
select sequence from uniprot where gene = '...';
would benefit from INDEX(gene). Similarly for Organism.
How big is the table (in GB)? What is the value of innodb_buffer_pool_size? How much RAM do you have? If the table is a lot bigger than the buffer pool, a random "point query" (WHERE accession = constant) will typically take one disk hit. To discuss other queries, please show us the SELECT.
Edit
With 100M rows, shrinking the disk footprint is important for performance. There are multiple ways to do it. I want to focus on (1) Shrink the size of each column; (2) Avoid implicit overhead in indexes.
Each secondary key implicitly includes the PRIMARY KEY. So, if there are 3 indexes, there are 3 copies of the PK. That means that the size of the PK is especially important.
I'm recommending something like
CREATE TABLE `uniprot` (
`Accession` VARCHAR(10) CHARACTER SET ascii NOT NULL,
`DB` VARCHAR(6) NOT NULL,
`Organism` varchar(100) NOT NULL,
`Gene` varchar(100) NOT NULL,
`Name` varchar(100) NOT NULL,
`Header` text NOT NULL,
`Sequence` text NOT NULL,
PRIMARY KEY (`Accession`),
INDEX(Gene), -- implicitly (Gene, Accession)
INDEX(Name) -- implicitly (Organism, Accession)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
And your main queries are
SELECT Sequence FROM uniprot WHERE Accession = '...';
SELECT Sequence FROM uniprot WHERE Gene = '...';
SELECT Sequence FROM uniprot WHERE Organism = '...';
If Accession is really variable length and shorter than to and ascii, then what I suggest brings the length down from 40 bytes * 3 occurrences * 100M rows = 12GB, just for the copies of Accession, down to perhaps 2GB. I think the savings of 10GB is worth it. Going to BIGINT would be also be about 2GB (no further savings); going to INT would be about 1GB (more savings, but not much).
Shrinking Gene and Organism to 'reasonable' sizes (if practical) avoids the need for using prefixing, hence allowing the index to work better. But, you can argue that maybe prefixing will work "well enough" in INDEX(Gene(11)). Let's get some numbers to make the argument one way or another. What is the average length of Gene (and Organism)? How many initial characters in Gene are usually sufficient to identify a Gene?
Another space question is whether there are a lot of duplicates in Gene and/or Organism. If so, then "normalizing" those fields would be warranted. Ditto for Name, Header, and Sequence.
The need for a JOIN (or two) if you make surrogates for Accession and/or Gene is only a slight bit of overhead, not enough to worry about.
First off, as mentioned in the comments I wouldn't use a natural key (Accession), I would opt for a surrogate key (Id), however with 100M rows, that would be a painful alter during which the table will be locked.
With that being said, Accession is already indexed b/c it's a Primary Key so for simple queries, you can't optimize further:
select sequence from uniprot where accession="q32p44";
If doing look-ups against other columns then your best bet is to add separate indices for each column:
ALTER TABLE uniprot ADD INDEX (Gene(10)), ADD KEY (Organism(10));
The goal is to index the uniqueness of the values (cardinality), so if you have a lot of values with somethingsomething1, somethingsomething2, somethingsomething3 then it would be best to go with a prefix of 18+ but no larger than say 30.
Per MySQL docs:
If names in the column usually differ in the first 10 characters, this index should not be much slower than an index created from the entire name column. Also, using column prefixes for indexes can make the index file much smaller, which could save a lot of disk space and might also speed up INSERT operations.
So the goal is to index the uniqueness (cardinality) but without inflating size on disk.
I would also remove that main_index index, as I don't see the benefit as you are not searching on all those columns at the same time, and due to length, will slow down your writes with little gain on the reads.
Be sure to test before you run anything on production. Perhaps get a small sampling (1-5% of the dataset) and prefix your queries you plan on running with explain to see how MySQL will execute them.
My MySQL database has over 350 million rows, and is growing. It's 32GB in size right now. I am using SSD's and lots of RAM, but would like to seek advice to make sure I am using appropriate indexes.
CREATE TABLE `qcollector` (
`key` bigint(20) NOT NULL AUTO_INCREMENT,
`instrument` char(4) DEFAULT NULL,
`datetime` datetime DEFAULT NULL,
`last` double DEFAULT NULL,
`lastsize` int(10) DEFAULT NULL,
`totvol` int(10) DEFAULT NULL,
`bid` double DEFAULT NULL,
`ask` double DEFAULT NULL,
PRIMARY KEY (`key`),
KEY `datetime_index` (`datetime`)
) ENGINE=InnoDB;
show index from qcollector;
+------------+------------+----------------+--------------+-------------+-----------+-- -----------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| qcollector | 0 | PRIMARY | 1 | key | A | 378866659 | NULL | NULL | | BTREE | | |
| qcollector | 1 | datetime_index | 1 | datetime | A | 63144443 | NULL | NULL | YES | BTREE | | |
+------------+------------+----------------+--------------+-------------+-----------+------ -------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.03 sec)
select * from qcollector order by datetime desc limit 1;
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| key | instrument | datetime | last | lastsize | totvol | bid | ask |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| 389054487 | ES | 2012-06-29 15:14:59 | 1358.25 | 2 | 2484771 | 1358.25 | 1358.5 |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
1 row in set (0.09 sec)
A typical query that is slow (full table scan, this query takes 3-4 minutes):
explain select date(datetime), count(lastsize) from qcollector where instrument = 'ES' and datetime > '2011-01-01' and time(datetime) between '15:16:00' and '15:29:00' group by date(datetime) order by date(datetime) desc;
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | qcollector | ALL | datetime_index | NULL | NULL | NULL | 378866659 | Using where; Using temporary; Using filesort |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
A couple ideas for you to consider:
A covering index (that is, an index that includes ALL of the columns referenced in the query) may help some. Such an index is going to require more disk (SSD?) space, but it will remove the necessity for MySQL to visit the data pages to lookup the values of the columns that aren't in the index.
ON qcollector (datetime,instrument,lastsize)
or
ON qcollector (instrument,datetime,lastsize)
Do you really need to exclude rows that have a NULL value for lastsize from the count? Could you return a count of all rows instead? If you could instead return COUNT(1) or SUM(1), then the query wouldn't need to reference the lastsize column, so it wouldn't be needed in an index to make it a covering index.
The COUNT(lastsize) expression is equivalent to SUM(IF(lastsize IS NULL,0,1))
Do you need to return dates when there are only NULL lastsize values for the datetime range, or could all of the rows with a NULL lastsize be excluded? That is, could you include a predicate like
AND lastsize IS NOT NULL
in your query?
Those may help some.
I think the big problem is that the predicates on the TIME(datetime) expression are not sargable. That is, MySQL won't use an index range scan operation for those. The predicate on the bare datetime column is sargable... that's why the EXPLAIN is showing the datetime_index as a possible key.
And the other big problem is that the query is doing GROUP BY and ORDER BY operations on a derived expression, which is going to require MySQL to generate an intermediate result set (as a temporary MyISAM table), and then process that result set. And that can be a lot of heavy lifting when there are lots of rows to process.
As far as table changes, I would consider using separate DATE and TIME columns, and using a TIMESTAMP datatype in place of DATETIME (if you need to store the date and time together). I would rewrite the query to reference the bare DATE and bare TIME columns, and consider adding a covering index that included all columns referenced in the rewritten query, with leading columns being the columns with the highest cardinality (and having the most selective predicates in the query.)
When you use date and time functions on a column the indexes cannot be used efficiently. You could also store the date and time in separate columns and index those, though this will take up more storage space.
You may also want to consider adding multi-column indexes. An index on (instrument, datetime) would probably help you here.