We are having a table like this to save login tokens per user sessions. This table was not partitioned earlier but now we decided to partition it to improve performance as it contains over a few millions rows.
CREATE TABLE `tokens` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uid` int(10) unsigned DEFAULT NULL,
`session` int(10) unsigned DEFAULT '0',
`token` varchar(128) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `usersession` (`uid`,`session`),
KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PARTITION BY HASH(id) PARTITIONS 101;
We plan to partition based on 'id' as it's primarily used for "select" queries and hence can effectively perform pruning.
However the problem is we maintain unique index of (uid, session) and partition requires participating column to be part of unique index. Now in this case unique index of (id, uid, session) doesn't make sense (will always be unique).
is there anyway to get around this issue without manually checking (uid, session).
Don't use partitioning. It won't speed up this kind of table.
I have yet to see a case of BY HASH that speeds up a system.
It is almost never useful to partition on the PRIMARY KEY.
In general, don't have an AUTO_INCREMENT id when you have a perfectly good "natural" PK -- (uid, session). Or should it be (toke n)?
Don't have one index being the first part of another: (uid) is redundant, given (uid, session).
Consider using utf8mb4 if you expect to have Emoji or Chinese. On the other hand, if token is, say, base64, then make it ascii or something.
So, I think this will work significantly better (smaller, faster, etc):
CREATE TABLE `tokens` (
`uid` int(10) unsigned DEFAULT NULL,
`session` int(10) unsigned DEFAULT '0',
`token` VARBINARY(128) NOT NULL DEFAULT '',
PRIMARY KEY (token),
) ENGINE=InnoDB
Which of these do you search by?
WHERE token = ...
WHERE uid = ... AND session = ...
One drawback is that I got rid of id; if id is needed by other tables, then a change is needed there.
Presumably your unique uid,sessionkey index enforces some business rule for you.
Do you rely on DBMS enforcement of that rule? Do you use INSERT .... ON DUPLICATE KEY UPDATE... statements, or use error handlers, or some such thing, to handle this uniqueness? Or is it there just for good measure?
If you rely on that unique index, partitioning this table on id will not work. Fugeddaboudit.
If you can delete that index, or delete its unique constraint, you may be able to proceed with partitioning. But partitioning isn't generally suitable for tables with multiple unique keys.
A 40M-row table is ordinarily not large enough to be a good candidate for partitioning. If you're having performance problems you should investigate improving your indexing instead.
Edit: If you have modern hardware (multi-terabyte storage, plenty of RAM) and well-chosen indexes, partitioning is (I believe) more trouble that it's worth. It's definitely a lot of trouble for tables with fewer than about 10**9 rows. When your autoincrementing id values must be BIGINT rather than INT data types (because int.MaxValue isn't big enough), that's when partitioning starts to be worth considering.
It's most effective when all queries filter based on the partitioning key. Filtering on other criteria without the partitioning key is slow.
Pro tip: The old saying about regular expressions also applies to partititions. If you solve a problem with partitioning, now you have two problems.
Related
I have a MySQL 8 database table accounts that has the following columns:
id (primary)
city_id (foreign key)
province_id (foreign key)
country_id (foreign key)
school_id (foreign key)
age (indexed)
EDIT: See bottom for complete table structure.
Now, imagine the following SQL query:
SELECT
COUNT(`id`) AS AGGREGATE
FROM
`accounts`
WHERE
`city_id` = 1
AND
`country_id` = 7
AND
`age` = 3
At 1 million records, this query becomes slow (~200ms).
When running EXPLAIN, I receive the following output:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
accounts
NULL
index_merge
accounts_city_id_foreign accounts_country_id_foreign accounts_age_index
accounts_city_id_foreign accounts_country_id_foreign accounts_age_index
9,2,9
NULL
15542
100.00
Using intersect(accounts_city_id_foreign, accounts_country_id_foreign, accounts_age_index); Using where; Using index
Given that MySQL appears to be using the indexes, I'm not sure what I can do to bring the execution time down. Does anyone have any ideas?
EDIT: In the future, the table will include more columns that will make it impossible to use a composite index as it will exceed the 16 column limit.
EDIT: Here's the complete table structure:
CREATE TABLE `accounts` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`city_id` bigint unsigned DEFAULT NULL,
`school_id` bigint unsigned DEFAULT NULL,
`country_id` bigint unsigned DEFAULT NULL,
`province_id` bigint unsigned DEFAULT NULL,
`age` tinyint unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `accounts_city_id_foreign` (`city_id`),
KEY `accounts_school_id_foreign` (`school_id`),
KEY `accounts_country_id_foreign` (`country_id`),
KEY `accounts_province_id_foreign` (`province_id`),
KEY `accounts_age_index` (`age`),
CONSTRAINT `accounts_city_id_foreign` FOREIGN KEY (`city_id`) REFERENCES `cities` (`id`) ON DELETE SET NULL,
CONSTRAINT `accounts_country_id_foreign` FOREIGN KEY (`country_id`) REFERENCES `countries` (`id`) ON DELETE SET NULL,
CONSTRAINT `accounts_province_id_foreign` FOREIGN KEY (`province_id`) REFERENCES `provinces` (`id`) ON DELETE SET NULL,
CONSTRAINT `accounts_school_id_foreign` FOREIGN KEY (`school_id`) REFERENCES `schools` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB AUTO_INCREMENT=1000002 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Try creating a composite index on all three columns, e.g. CREATE INDEX idx_city_country_age ON table (city_id, country_id, age)
Indexes are to help your querying. So as suggested by Marko and agreed by others, having an index on (city_id, country_id, age) should significantly help. Now, yes, you will add other columns to the table, but are you trying to filter on 16+ criteria??? I doubt it. And of the queries you would be running, even if you have multiple composite indexes to help optimize those queries, how many columns might you need at any single time? 4, 5, 6? After that, I mean how granular do you plan on getting with your data. Country, State/Province, City, Town, Village, Neighborhood, Street, House? and by the time you are that low in the data, you would be at the page level data anyhow, wouldn't you?
So, your query of Country = 7, that already chops off a ton of stuff. Then to a given city within that country? Great, now you are at a finite level.
if you are going do be doing queries against large data that requires any aggregations, and the data is rather fixed from a historical perspective, maybe having pre-aggregated tables by some common elements might help long term.
FEEDBACK
The performance of querying is not necessarily where you will be hit, it would be in the inserts, updates, deletes as whatever may change has to update all the indexes on the table - single or composite. If you are getting more than 5 columns in an index, ask yourself, really??? How granular is it that you need for the index to be optimized. Querying out the data should be very fast with proper indexes. Updating indexes is also quick, but if you are dealing with millions of inserts in a month, quarter, year? The user doing theirs may have a slight delay ( 1/4 second?) but adding up a million seconds starts to get delay. But again, over what period of time would insert/update/delete be done anyhow.
You asked what will bring the query time down, and using a composite index will do that. Searching a single composite index is faster than searching several single-column indexes and performing an intersection merge on the results.
You commented that you will be adding more columns in the future, and there will eventually be more than 16 columns.
You don't have to add ALL the columns to the composite index!
Index design is not magic. It follows rules. You will create indexes designed to support specific queries that you need to run. You don't add add columns to an index unless they help the given query. You may have multiple composite indexes in the table, created to help different queries.
You might like my presentation How to Design Indexes, Really (or the video).
Re your comment:
I won't know every possible query combination ahead of time.
Yes, that's true. You can only create indexes for queries that you know. Other queries will not be optimized. If you need to optimize queries in the future, you might need to add new indexes to support them.
In my experience, this happens regularly, and I address this in the presentation. You will review your queries from time to time, because of course your application code changes and the queries you need change. You may add new indexes, or replace an index with a different index, or drop indexes that are no longer needed.
Two tables:
CREATE TABLE `htmlcode_1` (
`global_id` int(11) NOT NULL,
`site_id` int(11) NOT NULL,
PRIMARY KEY (`global_id`),
KEY `k_site` (`site_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `htmlcode_2` (
`global_id` int(11) NOT NULL,
`site_id` int(11) NOT NULL,
PRIMARY KEY (`site_id`,`global_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
which one should be faster for selects and why?
'select * from table where site_id=%s'
The latter table is probably slightly faster for that SELECT query, assuming the table has a nontrivial number of rows.
When querying InnoDB by primary key, the lookup is against the clustered index for the table.
Secondary key lookups require a lookup in the index, then that reveals the primary key value, which is then used to do a lookup by primary key. So this uses two lookups.
The reason to use a PRIMARY KEY is to allow for either quick access OR REFERENTIAL INTEGRITY (CONSTRAINT ... FOREIGN KEY ...)
In your second example, you do not have the proper key for referential integrity if any other table refers to your table. In that case, other operations will be very very slow.
The differences in speed in either case for your particular case should be too small and trivial, but the proper design will dictate the first approach.
The first table represents many "globals" in each "site". That is, a "many-to-one" relationship. But it is the "wrong" way to do it. Instead the Globals table should have a column site_id to represent such a relationship to the Sites table. Meanwhile, the existence of htmlcode_1 is an inefficient waste.
The second table may be representing a "many-to-many" relationship between "sites" and "globals". If this is what you really want, then see my tips . Since you are likely to map from globals to sites, another index is needed.
I'm just getting into indexes on MySQL using InnoDB.
Firstly, and hopefully I am right, because I am using InnoDB and creating foreign keys, they will automatically be used as an index when querying the table? Is that correct?
Also, I'm reading that the order of the index will effect the speed of a query and even if it is used.
SO...how exactly do I specify the order of the index if that will indeed impact queries.
If you take my below table for example. It would be very beneficial for a query on this table to first use the index FK on org_id, since that is going greatly reduce the amount of rows read, and it is the org_id that most data is going to be separated by in the application.
CREATE TABLE IF NOT EXISTS `completed_checks` (
`complete_check_id` int(15) NOT NULL AUTO_INCREMENT,
`check_type` varchar(40) NOT NULL,
`check_desc` varchar(200) DEFAULT NULL,
`assigned_user` int(12) DEFAULT NULL,
`assigned_area` int(12) DEFAULT NULL,
`org_id` varchar(8) NOT NULL,
`check_notes` varchar(300) DEFAULT NULL,
`due` date NOT NULL,
`completed_by` int(12) DEFAULT NULL,
`completed_on` datetime DEFAULT NULL,
`status` int(1) DEFAULT NULL,
`passed` int(1) DEFAULT '0',
PRIMARY KEY (`complete_check_id`),
KEY `fk_org_id_CCheck` (`org_id`),
KEY `fk_user_id_CCheck` (`assigned_user`),
KEY `fk_AreaID_CCheck` (`assigned_area`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
So would MySQL use the FK index on org_id first when querying this table with org_id in the where clause?
And on a separate note, how would I specify the order in which the indexes are used in MySQL? If this is something that I need to be concerned about?
Thanks
Yes, this is correct, see MySQL documentation on creating foreign keys:
index_name represents a foreign key ID. The index_name value is ignored if there is already an explicitly defined index on the child table that can support the foreign key. Otherwise, MySQL implicitly creates a foreign key index
The order of the indexes in a table does not affect what index a query will use. You cannot even say that in general all queries should use such index first since different queries may need different indexes. Moreover, MySQL cannot use more than 1 index per table in a query.
In general MySQL decides which index to use (if any). If you believe that MySQL erred in its decision, then you can use index hint to influence MySQL's decision:
Index hints give the optimizer information about how to choose indexes during query processing.
In the newer versions of MySQL you can also use optimizer hints to influence the query plan.
The last way to influence index use is to force the update of the index statistics collected on a table using the analyse table command:
ANALYZE TABLE analyzes and stores the key distribution for a table.
I have a client who has asked me to tune his MySQL database in order to implement some new features and to improve the performance of an already existing web app.
The biggest table (~90 GB) has over 200M rows, and is growing at periodic intervals (one per visit to any of the websites he owns). Having continuous INSERTs, each SELECT query performed from the backend page takes a while to complete, as indexes are regenerated each time.
I've done a simulation on my own server switching from BTREE indexes to HASH indexes. Both SELECTs and INSERTs are not running any faster. The table uses MyISAM as storage engine. There are only INSERTs and SELECTs, no UPDATEs or DELETEs.
I've came up with the idea of creating an auxiliary table updated together with each INSERT to speed up every SELECT query coming from the backend. I know this is bad practice, but, I'm sure the performance will improve for the statistics page.
I'm not a database performance expert, as you may have noticed... Is there a better approach for this?
By the way, from phpMyAdmin I've seen that most indexes on the table have a cardinality of 0. In my simulation, this didn't happen. I'm not sure why is this happening.
Thanks a lot.
1st update: I've just learned that hash index isn't available for MyISAM engine.
2nd update: OK. Here's the table schema.
CREATE TABLE `visits` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`datetime` int(8) NOT NULL,
`webmaster_id` char(18) NOT NULL,
`country` char(2) NOT NULL,
`connection` varchar(15) NOT NULL,
`device` varchar(15) NOT NULL,
`provider` varchar(100) NOT NULL,
`ip_address` varchar(15) NOT NULL,
`url` varchar(300) NOT NULL,
`user_agent` varchar(300) NOT NULL,
PRIMARY KEY (`id`),
KEY `datetime` (`datetime`),
KEY `webmaster_id` (`webmaster_id`),
KEY `country` (`country`),
KEY `connection` (`connection`),
KEY `device` (`device`),
KEY `provider` (`provider`)
) ENGINE=InnoDB;
So, instead of performing queries like select count(*) from visits where datetime=20140715 and device="ios", won't it be best to fetch this from select count from visits_stats where datetime=20140715 and device="ios"?
INSERTs are, as said, much more frequent than SELECTs, but my client wants to improve the performance of the backend used to retrieve aggregated data. Using my approach, each visit would imply one INSERT and one INSERT/UPDATE (or REPLACE) which would increment one or more counters (I haven't decided the schema for the visits_stats table yet, the above query was just an example).
Apart from this, I've decided to replace some of the fields by their appropriate IDs from a foreign table. So far, data is stored in strings like connection=cable, device=android, and so on. I'm not sure how would this affect performance.
Thanks again.
Edit: I said before not to use partitions. But Bill is right that the way he described would work. Your only concern would be if you tried to select across the 101 partitions, then the whole thing would come to a standstill. If you don't intend to do this then partitioning would solve the problem. Fix your indexes first though.
Your primary problem is that MyISAM is not the best engine, neither is InnoDB. TokuDB would be your best bet, but you'd have to install that on the server.
Now, you need to prune your indexes. This is the major reason for the slowness. Remove an index on everything that isn't part of common SELECT statements. Add an multi-column index on exactly what is requested in the WHERE of your SELECT statements.
So (in addition to your primary key) you want an index on datetime, device only as a multi-column index, according to your posted SELECT statement.
If you change to TokuDB the inserts will be much faster, if you stick with MyISAM then you could speed the whole thing up by using INSERT DELAYED instead of INSERT. The only issue with this is that the inserts will not be live, but will be added whenever MySQL decides there is not too much load.
Alternatively, if the above still does not help, your final option would be to use two tables. One table that you SELECT from, and another that you INSERT to. Once an day or so you would then copy the insert table to the select table. Though this means the data in your select table could be up to 24 hours old.
Other than that you would have to completely change the table structure, for which I can't tell you how to do because it depends on what you are using it for exactly, or use something other than MySQL for this. However, my above optimizations should work.
I would suggest looking into partitioning. You have to add datetime to the primary key to make that work, because of a limitation of MySQL. The primary or unique keys must include the column by which you partition the table.
Also make the index on datetime into a compound index on (datetime, device). This will be a covering index for the query you showed, so the query can get its answer from the index alone, without having to touch table rows.
CREATE TABLE `visits` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`datetime` int(8) NOT NULL,
`webmaster_id` char(18) NOT NULL,
`country` char(2) NOT NULL,
`connection` varchar(15) NOT NULL,
`device` varchar(15) NOT NULL,
`provider` varchar(100) NOT NULL,
`ip_address` varchar(15) NOT NULL,
`url` varchar(300) NOT NULL,
`user_agent` varchar(300) NOT NULL,
PRIMARY KEY (`id`, `datetime`), -- compound primary key is necessary in this case
KEY `datetime` (`datetime`,`device`), -- compound index for the SELECT
KEY `webmaster_id` (`webmaster_id`),
KEY `country` (`country`),
KEY `connection` (`connection`),
KEY `device` (`device`),
KEY `provider` (`provider`)
) ENGINE=InnoDB
PARTITION BY HASH(datetime) PARTITIONS 101;
So when you query for select count(*) from visits where datetime=20140715 and device='ios', your query is only scanning one partition, with about 1% of the rows in the table. Then within that partition, it narrows down even further using the index.
Inserts should also improve, because they are updating much smaller indexes.
I use a prime number when doing hash partitioning, to help the partitions remain more evenly filled in case the dates inserted follow a regular pattern.
Converting a 90GB table to partitioning is going to take a long time. You can use pt-online-schema-change to avoid blocking your application.
You can even make more partitions if you want, in theory up to 1024 in MySQL 5.5 and 8192 in MySQL 5.6. Although with thousands of partitions, you may run into different bottlenecks, like the number of open files.
P.S.: HASH indexes are not support by either MyISAM or InnoDB. HASH indexes are only supported by MEMORY and NDB storage engines.
You are in the problem which is called Big Data Querying / Big Data handling now a days. For handling big data there are many solutions available unfortunately none of them are easy enough to be implemented. You always need a team to structure Big Data to fulfill your need. Some of The solution I may define here are as Under.
1. Big Table
Google uses this technique to create a whole lot big table with thousands of column.(To minimize records vertically). For which you will have to analyze your data and then partition on the basis of similarity and then tag those similarity with appropriate name. Now you must have to write Query that will be first analyzed by some algorithm to check what column space have to be queried. Not Simple enough
2. Distribute Database Across multiple Machine
Hadoop file system is an open source Apache project which is totally created for solving the problem of storing and querying big data. In early days Space was issue and system were capable enough to process small data but now space is not an issue.Even Small organization have tera bytes of data stored locally. But this terabytes of data can not be be processed in one go at one machine. Even a giant machine can take days to process aggregate operation. That is why hadoop is there.
If you are individual then definitely you are in trouble you will need resource for doing this painful task for You. But you can use the essence of these techniques without employing these technologies.
You are free to give a try to these technique. Just study articles about handling big data. Relational database queries are not gonna work in your case
I have table in my database which looks like this (names changed to comply with NDA)
CREATE TABLE `Job` (
`id` varchar(45) NOT NULL,
`type` int(11) NOT NULL,
`status` int(11) NOT NULL,
`created_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `JobTypeFK_idx` (`type`),
KEY `JobStatusFK_idx` (`status`),
KEY `JobTypeFK_idx1` (`type`),
KEY `JobStatusFK_idx1` (`status`),
KEY `JobParentIDFK_idx` (`parent_id`),
)ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have read the significance of naming Indexes as per this question
significance of index name in creating an index (mySQL)
unfortunately it doesn't talk about the situation where there is more than one duplicate index on the same column.
There is another question, but relevant to SQL Server
Same column with multiple index names. It is possible. What is the use?
Unfortunately I am not working with SQL-Server. I was cleaning up the schema to use newer MySQL features when I came across duplicate index names, which I want to remove. I just wanted to know if there is any possible problems which I might face later? If I keep worrying about breaking something, then I would never be able to clean up the schema.
As far as I know, the only place where index names are used (other than DDL statements to modify and drop indexes) is in index hints. This allows you to suggest or force MySQL to use a specific index in a query, and it identifies them by name. If you ever make use of this feature, and you remove the index that's required by the query, the query will get an error.
As this feature is very rarely used, you can probably remove the redundant indexes without worrying about breaking anything. On the off chance that you do use this feature, just make sure you remove the index that isn't named. On the really unlikely chance that you have different queries that force different names of indexes on the same column, rewrite them to use the same index name, and then remove the other index.
You can search your code for the regular expression:
\b(using|ignore|force)\s+(index|key)\b
to find any uses of this feature.