Performance of joins on multi-million-row tables

Performance of joins on multi-million-row tables - mysql

I need to give my website users the ability to select their country, province and city. So I want to display a list of countries, then a list of provinces in the selected country, then a list of cities in the selected province (I don't want any other UI solution for now). Of course, every name must be in the user's language, so I need additional tables for the translations.
Let's focus on the case of the cities. Here are the two tables:
CREATE TABLE `city` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`province_id` int(10) unsigned DEFAULT NULL
PRIMARY KEY (`id`),
KEY `idx_fk_city_province` (`province_id`),
CONSTRAINT `fk_city_province` FOREIGN KEY (`province_id`) REFERENCES `province` (`id`)
) ENGINE=InnoDB;
CREATE TABLE `city_translation` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`city_id` int(10) unsigned NOT NULL,
`locale_id` int(10) unsigned DEFAULT NULL,
`name` varchar(255) DEFAULT NULL
PRIMARY KEY (`id`),
KEY `idx_fk_city_translation_city` (`city_id`),
KEY `idx_fk_city_translation_locale` (`locale_id`),
KEY `idx_city_translation_city_locale` (`city_id`,`locale_id`),
CONSTRAINT `fk_city_translation_city` FOREIGN KEY (`city_id`) REFERENCES `city` (`id`),
CONSTRAINT `fk_city_translation_locale` FOREIGN KEY (`locale_id`) REFERENCES `locale` (`id`)
) ENGINE=InnoDB;
The city table contains 4 millions rows and the city_translation table 4 millions × the number of the languages available on my website. This is 12 millions now. If in the future I want to support 10 languages, it will be 40 millions...
So I am wondering: is it a bad idea (performance wise) to work with a table of this size, or is a good index (here on the join fields, city_id and locale_id) sufficient to make the size not matter?
If not, what are the common solutions used to solve this specific --but I guess common-- problem? I'm only interested in performance. I'm ok to denormalize if necessary, or even to use other tools if they are more appropriate (ElasticSearch?).

Get rid of id in city_translations. Instead have PRIMARY KEY(city_id, locale_id). With InnoDB, this may double the speed because of cutting out an unnecessary step in the JOINs. And you can shrink the disk footprint by also removing the two indexes starting with city_id.
Do you think you will go beyond 16M cities? I doubt it. So save one byte by changing (in all tables) city_id to MEDIUMINT UNSIGNED.
Save 3 bytes by changing locale_id to TINYINT UNSIGNED.
Those savings are multiplied by the number of columns and indexes mentioning them.
How big are the tables (GB)? What is the setting of innodb_buffer_pool_size? How much RAM is there? See if you can make that setting bigger than the total table size and yet no more than 70% of available memory. (That's the only "tunable" that is worth checking.)
I hope you have a default of CHARACTER SET utf8mb4 for the sake of Chinese users. (But that is another story.)

Related

mysql choose between unique key and primary key for user id

Im creating a user database ... i want to separate user - cellphone number from 'user' table and create another table for it (user_cellphone (table))
but i have a problem to select best index !
in user_cellphone table, we get user_id and cellphone number ... but all SELECT queries are more based on 'user_id' so i want to know if it's better to choose 'user_id' column as primary key or not !!!
(Also each user have only one cellphone number !)
which option of these 2 options are better ?
CREATE TABLE `user_cellphone_num` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`cellphone_country_code` SMALLINT UNSIGNED NOT NULL,
`cellphone_num` BIGINT UNSIGNED NOT NULL,
`user_id` INT UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `cellphone` (`cellphone_country_code`, `cellphone_num`),
UNIQUE INDEX `user_id` (`user_id`)
)
CREATE TABLE `user_cellphone_num` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`cellphone_country_code` SMALLINT UNSIGNED NOT NULL,
`cellphone_num` BIGINT UNSIGNED NOT NULL,
`user_id` INT UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`),
UNIQUE INDEX `id` (`id`),
UNIQUE INDEX `cellphone` (`cellphone_country_code`, `cellphone_num`)
)
choosing 'user_id' as primary key or just set 'user_id' as a unique key ?! is there any different here in performance ? (Im talking about when i have millions of rows)
in future im going to use some queries like this:
select u.*,cell.* FROM user AS u LEFT JOIN user_cellphone AS cell ON cell.user_id = u.id
so which one of these options give me better performance for some queries like this ?

May I offer some hard-won data design advice?
Do not use telephone numbers as any kind of unique or primary key.
Why not?
Sometimes multiple people use a single number.
Sometimes people make up fake numbers.
People punctuate numbers based on context. To my neighbors, my number is (978)555-4321. To a customer in the Netherlands it is +1.978.555.4321. Can you write a program to regularize those numbers? Of course. Can you write a correct program to do that? No. Why bother trying. Just take whatever people give you.
(Unless you work for a mobile phone provider, in which case ask your database administrator.
Read this carefully. https://github.com/google/libphonenumber/blob/master/FALSEHOODS.md

InnoDB tables are stored as a clustered index, also called an index-organized table. If the table has a PRIMARY KEY, then that is used as the key for the clustered index. The other UNIQUE KEY is a secondary index.
Queries where you look up rows by the clustered index are a little bit more efficient than using a secondary index, even if that secondary index is a unique index. So if you want to optimize for the most common query which you say is by user_id, then it would be a good idea to make that your clustered index.
In your case, it would be kind of strange to separate the cellphones into a separate table, but then make user_id alone be the PRIMARY KEY. That means that only one row per user_id can exist in this table. I would have expected that you separated cellphones into a separate table to allow each user to have multiple phone numbers.
You can get the same benefit of the clustered index if you just make sure user_id is the first column in a compound key:
CREATE TABLE `user_cellphone_num` (
`user_id` INT UNSIGNED NOT NULL,
`num` TINYINT UNSIGNED NOT NULL,
`cellphone_country_code` SMALLINT UNSIGNED NOT NULL,
`cellphone_num` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`, `num`)
)
So a query like SELECT ... FROM user_cellphone_num WHERE user_id = ? will match one or more rows, but it will be an efficient lookup because it's searching the first column of the clustered index.
Reference: https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html

Blacklist / Whitelist Table Design

We have a set of users
CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`email` varchar(254) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_email` (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
Each user can have one or many domains, such as
CREATE TABLE `domains` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` varchar(11) NOT NULL,
`domain` varchar(254) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `domain` (`domain`),
CONSTRAINT `domains_user_id_fk` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
And we have a table that has some sort of data, for this example it doesn't really matter what it contains
CREATE TABLE `some_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`content` TEXT NOT NULL,
PRIMARY KEY (`id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
We want certain elements of some_data to be accessible to only certain users or only certain domains (whitelist case).
In other cases we want elements of some_data to be accessible to everyone BUT certain users or certain domains (blacklist case).
Ideally we would like to retrieve the list of domains that the given element of some_data is accessible to in a single query and ideally do the reverse (list all the data the given domain has access to)
Our approach so far is a single table
CREATE TABLE `access_rules` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rule_type` enum('blacklist','whitelist')
`some_data_id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`domain_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
CONSTRAINT `access_rules_some_data_id_fk` FOREIGN KEY (`some_data_id`) REFERENCES `some_data` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
The problem however is the fact that we need to query the db twice (to figure out if the given data entry is operating a blacklist or a whitelist [whitelist has higher priority]). (EDIT: it can be done in a single query)
Also since the domain_id is nullable (to allow blacklisting / whitelisting an entire user) joining is not easy
The API that will use this schema is currently hit 4-5k times per second so performance matters.
The users table is relatively small (50k+ rows) and the domains table is about 1.5 million entries. some_data is also relatively small (sub 100k rows)
EDIT: the question is more around semantics and best practices. With the above structure I'm confident we can make it work, but the schema "feels wrong" and I'm wondering if there is better way

There are two issues to consider, normalization and management.
To normalize traditionally you would need 4 tables.
Set up the 3 master tables USER, DOMAIN, OtherDATA.
Set up a child table with User_Id, Domain_Id, OtherDATA_Id, PermissionLevel
This provides the least amount of repeated data. It also makes the management possible at the user-domain level easier. You could also add a default whitelist/blacklist field at the user and domain tables. This way a script could auto populate the child table and then a manager could just go in and adjust the one value needed.
If you have a two different tables, one for white and one black list, you could get a user or domain on both lists by accident. Actually it would be 4 tables, 2 for users and 2 for domain. Management would be more complex.

MySql primary key doesn't consume any size

I have those two tables schema:
CREATE TABLE `myTable` (
id int(11) NOT NULL AUTO_INCREMENT,
lat double NOT NULL,
lng double NOT NULL,
date datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
mobile bigint(11) unsigned NOT NULL,
date_updated datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `IDX_Datee` (`mobile`,`date`),
CONSTRAINT `FK_DeviceLocationss` FOREIGN KEY (`mobile`) REFERENCES `device` (`serial`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
And here is the second one:
CREATE TABLE `myTable2` (
lat double NOT NULL,
lng double NOT NULL,
date datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
mobile bigint(11) unsigned NOT NULL,
date_updated datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY `IDX_Datee2` (`mobile`,`date`),
CONSTRAINT `FK_DeviceLocationss2` FOREIGN KEY (`mobile`) REFERENCES `device` (`serial`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
In every table there are around 4,000,000 records till now,
So I'm trying to build the most suitable schema which is more fast and less storage consuming.
When I check the state of each Table in MySql Workbeanch I got little confusing:
First Table:
Second Table
When I changed the IDX_Datee key from Index to Primary, It doesn't consume any space.
I believe the second schema is better for me, But I don't have a good understand about that difference.
Can anyone explain that?

The table is index organized. The datarecords are stored in index order.
see https://dev.mysql.com/doc/refman/5.5/en/optimizing-primary-keys.html
"With the InnoDB storage engine, the table data is physically organized to do ultra-fast lookups and sorts based on the primary key column or columns"
so there is no extra index necessary

All operations (select, insert, delete, update) on a single row specified by the PK will be very fast and efficient. Drill down the BTree that contains the data and is organized by the PK, and there is the row to work with.
The PK takes a tiny amount of space, just as any BTree is more than the leaf nodes. As a Rule-Of-Thumb, MySQL's BTrees (data or index) have a fanout of about 100. That is, each node has about 100 nodes under it. This implies that there is only about 1% overhead for the non-leaf nodes for the 'rest' of the PK overhead.
16KB / 61 is about 268 -- your "fanout".
For starters, I will suggest that DOUBLE (8 bytes) is gross overkill for latitude and longitude unless you are trying to distinguish one flea from another on a dog. Here is my table of representation choices for lat/lng.
INT is 4 bytes. If you are sure you won't go past 16 million, change the PK to MEDIUMINT UNSIGNED (3 bytes). (I suggest this is too risky.)
The size of the PK is doubly important because it is included in every secondary key.
If (mobile, date) is unique, the it may as well be the PK. That shaves off two copies of id, and speeds up queries based on mobile.
If mobile contains phone numbers, well some numbers won't fit. Better off going with DECIMAL(11) takes 5 bytes; (13) takes 6. If, instead, mobile is an AUTO_INCREMENT in some other table, the perhaps even SMALLINT UNSIGNED (2 bytes per copy, per table) would be better.
Your First table has 4 extra columns (relative to the Second table): id--twice, mobile, and date.

Mysql Dynamic Table expanding columns

I am wondering if there is a better way to make some mysql tables than what I have been using in this project. I have a series of numbers which represent a specific time. Such as the number 101 would represent Jan 12, 2012 for example. It doesn't only represent time but that is the very basic of that information. So I created a lexicon table which has all the numbers we use and details such as time and meaning of that number. I have another table that is per customer which whenever they make a purchase I check off that the purchase is eligiable for a specific time. But the table where I check off each purchase and the lexicon table are not linked. I am wondering if there is a better way, maybe a way to have an sql statement take all the data from the Lexicon table and turn that into columns while the rows consist of customer ID and a true/false selector.
table structure
THIS IS THE CUSTOMER PURCHASED TABLE T/F
CREATE TABLE `group1` (
`100` TINYINT(4) NULL DEFAULT '0',
`101` TINYINT(4) NULL DEFAULT '0',
`102` TINYINT(4) NULL DEFAULT '0',
... this goes on for 35 times each table
PRIMARY KEY (`CustID`)
)
THIS IS THE LEXICON TABLE
CREATE TABLE `lexicon` (
`Number` INT(3) NOT NULL DEFAULT '0',
`Date` DATETIME NULL DEFAULT NULL,
`OtherPurtinantInfo` .... etc
)
So I guess instead of making groups of numbers every season for the customers I would prefer being able to use the updated lexicon table to automatically generate a table. My only concerns are that we have many many numbers so that would make a very large table all combined together but perhaps that could be limited into groups automatically as well so that it is not an overwhelming table.
I am not sure if I am being clear enough so feel free to comment on things that need to be clarified.

Here's a normalized ERD, based on what I understand your business requirements to be:
The classifieds run on certain dates, and a given advertisement can be run for more than one classifieds date.
The SQL statements to make the tables:
CREATE TABLE IF NOT EXISTS `classified_ads` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
);
CREATE TABLE IF NOT EXISTS `classified_dates` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`date` DATETIME NOT NULL,
`info` TEXT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE IF NOT EXISTS `classified_ad_dates` (
`classified_ad_id` INT UNSIGNED NOT NULL,
`classifiend_date_id` INT UNSIGNED NOT NULL,
PRIMARY KEY (`classified_ad_id`, `classifiend_date_id`),
INDEX `fk_classified_ad_dates_classified_ads1` (`classified_ad_id` ASC),
INDEX `fk_classified_ad_dates_classified_dates1` (`classifiend_date_id` ASC),
CONSTRAINT `fk_classified_ad_dates_classified_ads1`
FOREIGN KEY (`classified_ad_id`)
REFERENCES `classified_ads` (`id`)
ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT `fk_classified_ad_dates_classified_dates1`
FOREIGN KEY (`classifiend_date_id`)
REFERENCES `classified_dates` (`id`)
ON DELETE CASCADE
ON UPDATE CASCADE
);

Enforce unique rows in MySQL

I have a table in MySQL that has 3 fields and I want to enforce uniqueness among two of the fields. Here is the table DDL:
CREATE TABLE `CLIENT_NAMES` (
`ID` int(11) NOT NULL auto_increment,
`CLIENT_NAME` varchar(500) NOT NULL,
`OWNER_ID` int(11) NOT NULL,
PRIMARY KEY (`ID`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The ID field is a surrogate key (this table is being loaded with ETL).
The CLIENT_NAME is a field that contains names of clients
The OWNER_ID is an id indicates a clients owner.
I thought I could enforce this with a unique index on CLIENT_NAME and OWNER_ID,
ALTER TABLE `DW`.`CLIENT_NAMES`
ADD UNIQUE INDEX enforce_unique_idx(`CLIENT_NAME`, `OWNER_ID`);
but MySQL gives me an error:
Error executing SQL commands to update table.
Specified key was too long; max key length is 765 bytes (error 1071)
Anyone else have any ideas?

MySQL cannot enforce uniqueness on keys that are longer than 765 bytes (and apparently 500 UTF8 characters can surpass this limit).
Does CLIENT_NAME really need to be 500 characters long? Seems a bit excessive.
Add a new (shorter) column that is hash(CLIENT_NAME). Get MySQL to enforce uniqueness on that hash instead.

Have you looked at CONSTRAINT ... UNIQUE?

Something seems a bit odd about this table; I would actually think about refactoring it. What do ID and OWNER_ID refer to, and what is the relationship between them?
Would it make sense to have
CREATE TABLE `CLIENTS` (
`ID` int(11) NOT NULL auto_increment,
`CLIENT_NAME` varchar(500) NOT NULL,
# other client fields - address, phone, whatever
PRIMARY KEY (`ID`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `CLIENTS_OWNERS` (
`CLIENT_ID` int(11) NOT NULL,
`OWNER_ID` int(11) NOT NULL,
PRIMARY KEY (`CLIENT_ID`,`OWNER_ID`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I would really avoid adding a unique key like that on a 500 character string. It's much more efficient to enforce uniqueness on two ints, plus an id in a table should really refer to something that needs an id; in your version, the ID field seems to identify just the client/owner relationship, which really doesn't need a separate id, since it's just a mapping.

Here. For the UTF8 charset, MySQL may use up to 3 bytes per character. CLIENT_NAME is 3 x 500 = 1500 bytes. Shorten CLIENT_NAME to 250.
later: +1 to creating a hash of the name and using that as the key.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008