MySQL index design with table partitioning

MySQL index design with table partitioning - mysql

I have 2 MySQL tables with the following schemas for a web site that's kinda like a magazine.
Article (articleId int auto increment ,
title varchar(100),
titleHash guid -- a hash of the title
articleText varchar(4000)
userId int)
User (userId int autoincrement
userName varchar(30)
email etc...)
The most important query is;
select title,articleText,userName,email
from Article inner join user
on article.userId = user.UserId
where titleHash = <some hash>
I am thinking of using the articleId and titleHash columns together as a clustered primary y for the Article table. And userId and userName as a primary key for the user table.
As the searches will be based on titlehash and userName columns.
Also titlehash and userName are unqiue by design and will not change normally.
The articleId and userid columns are not business keys and are not visible to the application, so they'll only be used for joins.
I'm going to use mysql table partitioning on the titlehash column so the selects will be faster as the db will be able to use partition elimination based on that column.
I'm using innoDB as the storage engine;
So here are my questions;
Do I need to create another index on
the titlehash column as the primary
key (articleId,titlehash) is not
good for the searches on the
titlehash column as it is the second
column on the primary key ?
What are the problems with this
design ?
I need the selects to be very fast and expects the tables to have millions of rows and please note that the int Id columns are not visible to the business layer and can never be used to find a record
I'm from a sql server background and going to use mysql as using the partitioning on sql server will cost me a fortune as it is only available in the Enterprise edition.
So DB gurus, please help me; Many thanks.

As written, your "most important query" doesn't actually appear to involve the User table at all. If there isn't just something missing, the best way to speed this up will be to get the User table out of the picture and create an index on titleHash. Boom, done.
If there's another condition on that query, we'll need to know what it is to give any more specific advice.
Given your changes, all that should be necessary as far as keys should be:
On Article:
PRIMARY KEY (articleId) (no additional columns, don't try to be fancy)
KEY (userId)
UNIQUE KEY (titleHash)
On User:
PRIMARY KEY (userId)
Don't try to get fancy with composite primary keys. Primary keys which just consist of an autoincrementing integer are handled more efficiently by InnoDB, as the key can be used internally as a row ID. In effect, you get one integer primary key "for free".
Above all else, test with real data and look at the results from EXPLAINing your query.

Related

What if `auto_increment` gaps caused by MySQL `INSERT...ON DUPLICATE KEY UPDATE` cannot be ignored?

While performing INSERT...ON DUPLICATE KEY UPDATE on InnoDB in MySQL, we are often told to ignore the potential gaps in auto_increment columns. What if such gaps are very likely and cannot be ignored?
As an example, suppose there is one table rating that stores the users' ratings of items. The table scheme is something like
CREATE TABLE rating (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
item_id INT NOT NULL,
rating INT NOT NULL,
UNIQUE KEY tuple (user_id, item_id),
FOREIGN KEY (user_id) REFERENCES user(id),
FOREIGN KEY (item_id) REFERENCES item(id)
);
It is possible that there are many users and many items, while users may frequently change the ratings of items that they have already rated before. Every time a rating is changed, a gap is created if we use INSERT...ON DUPLICATE KEY UPDATE, otherwise we will have to query twice (do a SELECT first) which is performance harming or check affected rows which cannot accommodate multiple records INSERT.
For some system where 100K users each has rated 10 items and changes half of the ratings every day, the auto_increment id will be exhausted within two years. Then what should we do to prevent it in practice?

Full answer.
Gaps it's ok! Just use bigger id field, for example BIGINT. Don't try to reuse gaps. This is a bad idea. Don't think about performance or optimization in this case. Its waste of time.
Another solution is to make composite key as primary. In your case, you can remove id field, and use pair (user_id, item_id) as primary key.
In case of "rating" the most frequent queries are "delete by user_id", and inserting. So you are not really need this "id" primary key for functionality. But you always need any primary key to be presented in table.
The only drawback of this method is, that now when you want to delete just one row from the table, you will need to use query something like:
DELETE FROM rating WHERE user_id = 123 AND item_id=1234
instead of old
DELETE FROM rating WHERE id = 123
But in this case it isn't hard to change one line of code in your application. Furthermore, in most cases people doesn't needs such functionality.

We work in a large table and we have tables with 100s millions of records in some table. We repeatedly use INSERT IGNORE or INSERT.. ON DUPLICATE KEY. Making the column as unsigned bigint will avoid the id issue.
But I would suggest you to think of long term solution as well. With some known facts.
SELECT and INSERT/UPDATE is quite often faster than INSERT..ON DUPLICATE KEY, again based on you data size and other factors
If you have two unique keys ( or one primary and one unique key), your query might not always predictable. It gives replication error if you use statement based replication.
ID is not the only issue with large tables. If you have table with more than some 300M records, performances degrades drastically. You need to think of partitioning/clustering/sharding your database/tables pretty soon
Personally I would suggest not to use INSERT.. ON DUPLICATE KEY. Read extensively on its usage and performance impact if you are planning for a highly scalable service

Mysql table with no primary key but a foreign key

I have read somewhere here that having primary key in each every table is a good thing to do... Let me say I have two tables "student" and "student_details" and i am using INNODB
"student" has a few columns like - student_id(Primary Key), student_name
"student_details" has a few columns like - student_id(Foreign Key), Address, Phone, Mobile, etc..
Do "student_details" still need a primary key?

Whether you know it or not, what you are doing is column partitioning the table. You can have studentdetails.studentid be both a primary key and a foreign key. No problem with that. So, you can have a primary key in the table.
There are several reasons to do column partitioning, usually related to performance on commonly used columns or to create rows with more than the maximum number of columns. I doubt either of these apply in your case.
In fact, given the nature of the data, the studentdetails table is actually storing a "slowly-changing dimension". In simpler language, students move, so their address changes. Students change their telephone number. And so on. What you should really have is an effective and end date for each student details record. Then you can add an auto-incrementing primary key (which is what I would do) or you could declare studentdetails(studentid, effdate) as the primary key.

Comma separated list on MySQL database

I am implementing a friends list for users in my database, where the list will store the friends accountID.
I already have a similar structure in my database for achievements where I have a separate table that has a pair of accountID to achievementID, but my concern with this approach is that it is inefficient because if there are 1 million users with 100 achievements each there are 100million entries in this table. Then trying to get every achievement for a user with a certain accountID would be a linear scan of the table (I think).
I am considering having a comma separated string of accountIDs for my friends list table, I realize how annoying it will be to deal with the data as a string, but at least it would be guaranteed to be log(n) search time for a user with accountID as the primary key and the second column being the list string.
Am I wrong about the search time for these two different structures?

MySQL can make effective use of appropriate indexes, for queries designed to use those indexes, avoiding a "scan" operation on the table.
If you are ALWAYS dealing with the complete set of achievements for a user, retrieving the entire set, and storing the entire set, then a comma separated list in a single column can be a workable approach.
HOWEVER... that design breaks down when you want to deal with individual achievements. For example, if you want to retrieve a list of users that have a particular achievement. Now, you're doing expensive full scans of all achievements for all users, doing "string searches", dependent on properly formatted strings, and MySQL is unable to use an index scan to efficiently retrieve that set.
So, the rule of thumb, if you NEVER need to individually access an achievement, and NEVER need to remove an achievement from user in the database, and NEVER need to add an individual achievement for a user, and you will ONLY EVER pull the achievements as an entire set, and only store them as an entire set, in and out of the database, the comma separated list is workable.
I hesitate to recommend that approach, because it never turns out that way. Inevitably, you'll want a query to get a list of users that have a particular achievement.
With the comma separated list column, you're into some ugly SQL:
SELECT a.user_id
FROM user_achievement_list a
WHERE CONCAT(',',a.list,',') LIKE '%,123,%'
ugly in the sense that MySQL can't use an index range scan to satisfy the predicate; MySQL has to look at EVERY SINGLE list of achievements, and then do a string scan on each and every one of them, from the beginning to the end, to find out if a row matches or not.
And it's downright excruciating if you want to use the individual values in that list to do a join operation, to "lookup" a row in another table. That SQL just gets horrendously ugly.
And declarative enforcement of data integrity is impossible; you can't define any foreign key constraints that restrict the values that are added to the list, or remove all occurrences of a particular achievement_id from every list it occurs in.
Basically, you're "giving up" the advantages of a relational data store; so don't expect the database to be able to do any work with that type of column. As far as the database is concerned, it's just a blob of data, might as well be .jpg image stored in that column, MySQL isn't going to help with retrieving or maintaining the contents of that list.
On the other hand, if you go with a design that stores the individual rows, each achievement for each user as a separate row, and you have an appropriate index available, the database can be MUCH more efficient at returning the list, and the SQL is more straightforward:
SELECT a.user_id
FROM user_achievements a
WHERE a.achievement_id = 123
A covering index would be appropriate for that query:
... ON user_achievements (achievement_id, user_id)
An index with user_id as the leading column would be suitable for other queries:
... ON user_achievements (user_id, achievement_id)
FOLLOWUP
Use EXPLAIN SELECT ... to see the access plan that MySQL generates.
For your example, retrieving all achievements for a given user, MySQL can do a range scan on the index to quickly locate the set of rows for the one user. MySQL doesn't need to look at every page in the index, the index is structured as a tree (at least, in the case of B-Tree indexes) so it can basically eliminate a whole boatload of pages it "knows" that the rows you are looking for can't be. And with the achievement_id also in the index, MySQL can return the resultset right from the index, without a need to visit the pages in the underlying table. (For the InnoDB engine, the PRIMARY KEY is the cluster key for the table, so the table itself is effectively an index.)
With a two column InnoDB table (user_id, achievement_id), with those two columns as the composite PRIMARY KEY, you would only need to add one secondary index, on (achievement_id, user_id).
FOLLOWUP
Q: By secondary index, do you mean a 3rd column that contains the key for the composite (userID, achievementID) table. My create table query looks like this
CREATE TABLE `UserFriends`
(`AccountID` BIGINT(20) UNSIGNED NOT NULL
,`FriendAccountID` BIGINT(20) UNSIGNED NOT NULL
,`Key` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT
, PRIMARY KEY (`Key`)
, UNIQUE KEY `AccountID` (`AccountID`, `FriendAccountID`)
);
A: No, I don't mean the addition of a third column. If the only two columns in the table are the foreign keys to another table (looks like they refer to the same table, and the columns are both NOT NULL and there is a UNIQUE constraint on the combination of the columns... and there are no other attributes on the table, I would consider not using a surrogate as the primary key at all. I would make the UNIQUE KEY the PRIMARY KEY.
Personally, I would be using InnoDB, with the innodb_file_per_table option enabled. And my table definition would look something like this:
CREATE TABLE user_friend
( account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, friend_account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, PRIMARY KEY (account_id, friend_account_id)
, UNIQUE KEY user_friend_UX1 (friend_account_id, account_id)
, CONSTRAINT FK_user_friend_user FOREIGN KEY (account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
, CONSTRAINT FK_user_friend_friend FOREIGN KEY (friend_account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
) Engine=InnoDB;

Sql Query join suggestions

I was wondering when having a parent table and a child table with foreign key like:
users
id | username | password |
users_blog
id | id_user | blog_title
is it ok to use id as auto increment also on join table (users_blog) or will i have problems of query speed?
also i would like to know which fields to add as PRIMARY and which as INDEX in users_blog table?
hope question is clear, sorry for my bad english :P

I don't think you actually need the id column in the users_blog table. I would make the id_user the primary index on that table unless you have another reason for doing so (perhaps the users_blog table actually has more columns and you are just not showing it to us?).
As far as performance, having the id column in the users_blog table shouldn't affect performance by itself but your queries will never use this index since it's very unlikely that you'll ever select data based on that column. Having the id_user column as the primary index will actually be of benefit for you and will speed up your joins and selects.

What's the cardinality between the user and user_blog? If it's 1:1, why do you need an id field in the user_blog table?

is it ok to use id as auto increment also on join table (users_blog)
or will i have problems of query speed?
Whether a field is auto-increment or not has no impact on how quickly you can retrieve data that is already in the database.
also i would like to know which fields to add as PRIMARY and which as
INDEX in users_blog table?
The purpose of PRIMARY KEY (and other constraints) is to enforce the correctness of data. Indexes are "just" for performance.
So what fields will be in PRIMARY KEY depends on what you wish to express with your data model:
If a users_blog row is identified with the id alone (i.e. there is a "non-identifying" relationship between these two tables), put id alone in the PRIMARY KEY.
If it is identified by a combination of id_user and id (aka. "identifying" relationship) then you'll have these two fields together in your PK.
As of indexes, that depends on how you are going to access your data. For example, if you do many JOINs you may consider an index on id_user.
A good tutorial on index performance can be found at:
http://use-the-index-luke.com

I don't see any problem with having an auto increment id column on users_blog.
The primary key can be id_user, id. As for indexing, this heavily depends on your usage.
I doubt you will be having any database related performance issue with a blog engine though, so indexing or not doesn't make much of a difference.

You dont have to use id column in users_blog table you can join the id_user with users table. also auto increment is not a problem to performance

It is a good idea to have an identifier column that is auto increment - this guarantees a way of uniquely identifying the row (in case all other columns are the same for two rows)
id is a good name for all table keys and it's the standard
<table>_id is the standard name for foreign keys - in your case use user_id (not id_user as you have)
mysql automatically creates indexes for columns defined as primary or foreign keys - there is no need to do anything here
IMHO, table names should be singular - ie user not users
You SQL should look something like:
create table user (
id int not null auto_increment primary key,
...
);
create table user_blog (
id int not null auto_increment primary key,
id_user int not null references user,
...
);

Is string or int preferred for foreign keys?

I have a user table with userid and username columns, and both are unique.
Between userid and username, which would be better to use as a foreign key and why?
My Boss wants to use string, is that ok?

Is string or int preferred for foreign keys?
It depends
There are many existing discussions on the trade-offs between Natural and Surrogate Keys - you will need to decide on what works for you, and what the 'standard' is within your organisation.
In the OP's case, there is both a surrogate key (int userId) and a natural key (char or varchar username). Either column can be used as a Primary key for the table, and either way, you will still be able to enforce uniqueness of the other key.
Here are some considerations when choosing one way or the other:
The case for using Surrogate Keys (e.g. UserId INT AUTO_INCREMENT)
If you use a surrogate, (e.g. UserId INT AUTO_INCREMENT) as the Primary Key, then all tables referencing table MyUsers should then use UserId as the Foreign Key.
You can still however enforce uniqueness of the username column through use of an additional unique index, e.g.:
CREATE TABLE `MyUsers` (
`userId` int NOT NULL AUTO_INCREMENT,
`username` varchar(100) NOT NULL,
... other columns
PRIMARY KEY(`userId`),
UNIQUE KEY UQ_UserName (`username`)
As per #Dagon, using a narrow primary key (like an int) has performance and storage benefits over using a wider (and variable length) value like varchar. This benefit also impacts further tables which reference MyUsers, as the foreign key to userid will be narrower (fewer bytes to fetch).
Another benefit of the surrogate integer key is that the username can be changed easily without affecting tables referencing MyUsers.
If the username was used as a natural key, and other tables are coupled to MyUsers via username, it makes it very inconvenient to change a username (since the Foreign Key relationship would otherwise be violated). If updating usernames was required on tables using username as the foreign key, a technique like ON UPDATE CASCADE is needed to retain data integrity.
The case for using Natural Keys (i.e. username)
One downside of using Surrogate Keys is that other tables which reference MyUsers via a surrogate key will need to be JOINed back to the MyUsers table if the Username column is required. One of the potential benefits of Natural keys is that if a query requires only the Username column from a table referencing MyUsers, that it need not join back to MyUsers to retrieve the user name, which will save some I/O overhead.

An int is 4 bytes, a string can be as many bytes as you like. Because of that, an int will always perform better. Unless ofcourse if you stick with usernames that are less than 4 characters long :)
Besides, you should never use a column as PK/FK if the data within the column itself can change. Users tend to change their usernames, and even if that functionality doesn't exist in your app right now, maby it will in a few years. When that day comes, you might have 1000 tables that reference that user-table, and then you'll have to update all 1000 tables within a transaction, and that's just bad.

int will index faster, may or may not be an issue, hard to say based on what you have provided

It depends on the foreign key: If your company has control over it, then I recommend using an Int if there is an ID field for it. However, sometimes an ID field is not on a table because another key makes sense as an alternate unique key. So, the ID field might be a surrogate key in that case.
Rule of thumb: Your foreign key data type should match your primary key data type.
Here's an exception: what about foreign keys that don't belong to your company? What about foreign keys to databases and APIs that you have no control over? Those IDs should always be strings IMO.
To convince you, I ask these questions:
Are you doing math on it? Are you incrementing it? Do you have control over it? APIs are notorious for change, even data types CAN be changed in someone else's database... so how much will it mess you up when an int ID becomes a hex?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008