I would like to have advices about a mysql table design for a event logger.
Our needs :
- track a lot of action
- 10 000 actions / second
- 1 billion row at this time
Our hardware :
- 2*Xeon (seen as 32 CPU by the system)
- 128 GB RAM
- 6*600 SSD with Raid 10
Our table design :
CREATE TABLE IF NOT EXISTS `log_event` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`id_event` smallint(6) NOT NULL,
`id_user` bigint(20) NOT NULL,
`date` int(11) NOT NULL,
`data` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `id_event_2` (`id_event`,`data`),
KEY `id_inscri` (`id_inscri`),
KEY `date` (`date`),
KEY `id_event_4` (`id_event`,`date`,`data`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
ALTER TABLE `log_event`
ADD CONSTRAINT `log_event_ibfk_1` FOREIGN KEY (`id_inscri`) REFERENCES `inscription` (`id_inscri`) ON DELETE CASCADE ON UPDATE CASCADE;
Our problem :
- We have an auto-increment as primary, but it is not really used. Is it a problem to remove it ? We will no have primary key if we remove it => How to identify a line ?
We would like to do partionning, but with the foreign it seems to be impossible ?
We don't do bulk insert. Is it a good idea to insert in a Memory table without index and copy data every 5 minutes ?
Do you have any idea to optimize ? Do you have best practice for this kind of system ?
Thanks !
François
Primary keys of relational tables (relations) might have two types:
Natural - exists in subject area to completely determine each row of relational table.
Natural primary keys might be simple (if consists of only one column), or complex (if consists more than one column). It is not recomended to set a natural primary key on large string column.
Artificial - special column, injected by database designer / developer to boost table performance, if natural key is complex, and have to be used in related table (is foreign key for something), or if it is simple, but is large and will produce data overhead while copied in related table as a foreign key, or if it is complex to search (for example, CRUD operations on VARCHAR IDs might be slower, than on INT IDs). There might be other reasons. TL;DR: Artificial key - one special column, serving to completely determine each row of relational table and boost it's performance for CRUD operations.
We have an auto-increment as primary, but it is not really used. Is it
a problem to remove it ? We will no have primary key if we remove it
=> How to identify a line ?
If you do not need to reference your table to another tables (as source), then you may probably remove artificial key without any consequences. Still, I recomend you set any other PRIMARY KEY in this table to avoid data duplication, and for obviosity (if it matters).
Your table by itself (if properly normalized) will have natural key as one of "key candidates". It might be complex one (consist of few columns). It is normal. But don't set primary for strings, because PRIMARY always have index, which will produce data overhead. If it is combination of INT or "small" VARCHAR columns, then it is normal.
Consider as an option: id_event + id_user + date.
We don't do bulk insert. Is it a good idea to insert in a Memory table
without index and copy data every 5 minutes ?
It is not a bad idea. But it is not good idea, until it properly tested. Try to perform load-test, before real use.
If you not reference MEMORY table to others, then you still may join it with any other InnoDB table. But you will loose InnoDB functionality (referential integrity). If lose of parent table ON DELETE CASCADE ON UPDATE CASCADE is not a concern, then it might be done. As for me, InnoDB is not so slow to switch table engine, in your case.
Related
I created/defined an admin table, now I have seen other programmers alter the table and add keys to the tables
CREATE TABLE `admin` (
`admin_id` int(11) NOT NULL AUTO_INCREMENT,
`admin_name` varchar(255) NOT NULL,
`admin_surname` varchar(255) NOT NULL,
`phone` CHAR(10) NOT NULL,
`admin_email` varchar(255) NOT NULL,
`password` varchar(255) NOT NULL,
PRIMARY KEY (`admin_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `admin`
ADD PRIMARY KEY (`admin_id`),
ADD UNIQUE KEY `admin_email` (`admin_email`);
If I have already defined the table why should I alter the definition again here?
In InnoDB there exists clustered index always.
When primary key exists in a table then it is used as clustered index.
When no primary key but unique index(es) which's expression does not innclude NULLable columns exists then the most upper unique index from them in the table definition is clustered.
When no such unique index then inner hidden row number is used as an expression for clustered index.
Hence, if you create a table (and some expression is used for clustered index) and then use ALTER TABLE for to add primary key then the table must be rebuilt. It doesn't matter when the table is empty, but when there is data in it the process may be long enough (because COPY method is used).
If you create primary key as a part of CREATE TABLE then this is always fast.
I like to put all the index definitions inside the CREATE TABLE, and put them at the end instead of sitting on the column definitions.
Potential problem 1:
But I notice that some dump utilities like to add the indexes later. This may be a kludge to handle FOREIGN KEY definitions. Those have trouble if the tables are not created in just the right order.
It would seem better to simply ADD FOREIGN KEY... after all the tables are created and indexed.
Potential problem 2:
If you will be inserting a huge number of rows, it is usually more efficient to make the secondary keys after loading the data. This is more efficient than augmenting the indexes as it goes. For small tables (under, say, a million rows), this is not a big deal.
I do not understand why they ADD PRIMARY KEY after loading the data. That requires (as Akina points out) tossing the fabricated PK, sorting the data, and adding the real PK. That seems like extra work, even for a huge table.
If the rows are sorted in PK order, the loading is more efficient. The table is ordered by the PK (for InnoDB); inserting in that order is faster than jumping around. (mysqldump will necessarily provide them in PK order, so it is usually a non-issue.)
I'll begin to try and explain my problem and what I meant with the title.
Currently I have got a table with around ~8 million rows.
This table is highly active, what this means is there's constant updates, inserts and deletes.
These are caused by users (it's like a collecting game). Meaning I also need to make sure the data is accurately displayed.
I've looked so far into:
indexing
partitioning
sharding
mapreduce
optimize
I applied indexing, however I'm not sure if I applied this method correctly and it doesn't seem to help much more than I thought.
As I said, my table is highly active, meaning that if I'd add partitioning to this table, it would mean there are going to be additional inserts/deletes and make this process way more complex than I can understand. I do not have that much experience with databases.
Sharding this database is way too complex for me and I only have one service I can run this database on, so this option is a no-go.
As for mapreduce, I am not entirely sure what this does, but as far as I understood, it mainly has to do more so with the code, than with the database.
I applied optimize, but it didn't really seem to have too much effect neither as I experienced.
I have tried to not use the * in SELECT statements, I made sure to get rid of most DISTINCT, COUNT and other functionalities of SQL alike, so that these wouldn't affect the speed of the database.
However even after narrowing down the data in each table and specifically this table, it's currently slower than it was before this.
This table consists of:
CREATE TABLE `claim` (
`global_id` bigint NOT NULL AUTO_INCREMENT,
`fk_user_id` bigint NOT NULL,
`fk_series_id` smallint NOT NULL,
`fk_character_id` smallint NOT NULL,
`fk_image_id` int NOT NULL,
`fk_gif_id` smallint DEFAULT NULL,
`rarity` smallint NOT NULL,
`emoji` varchar(31) DEFAULT NULL,
PRIMARY KEY (`global_id`),
UNIQUE KEY `global_id_UNIQUE` (`global_id`),
KEY `fk_claim_character_id` (`fk_character_id`),
KEY `fk_claim_image_id` (`fk_image_id`),
KEY `fk_claim_series_id` (`fk_series_id`),
KEY `fk_claim_user_id` (`fk_user_id`) /*!80000 INVISIBLE */,
KEY `fk_claim_gif_id` (`fk_gif_id`) /*!80000 INVISIBLE */,
KEY `fk_claim_rarity` (`rarity`) /*!80000 INVISIBLE */,
KEY `fk_claim_emoji` (`emoji`),
CONSTRAINT `fk_claim_character_id` FOREIGN KEY (`fk_character_id`) REFERENCES `character` (`character_id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `fk_claim_image_id` FOREIGN KEY (`fk_image_id`) REFERENCES `image` (`image_id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `fk_claim_series_id` FOREIGN KEY (`fk_series_id`) REFERENCES `series` (`series_id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `fk_claim_user_id` FOREIGN KEY (`fk_user_id`) REFERENCES `user` (`user_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=7622452 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Is there possibly another solution to speed up the database? If so, how? I'm currently at wits end and stuck on it. The database needs to respond preferably within 300ms.
EXAMPLE SLOW QUERIES:
SELECT PK FROM <table> WHERE fk_user_id = ?;
SELECT PK FROM <table> WHERE fk_user_id = ? GROUP BY fk_character_id HAVING MAX(fk_character_id) = 1;
SELECT PK, fk_user_id, fk_character_id, etc, etc, etc FROM <table> WHERE fk_user_id = ? ORDER BY PK ASC LIMIT 0, 20
Redundant
PRIMARY KEY (`global_id`),
UNIQUE KEY `global_id_UNIQUE` (`global_id`),
A PRIMARY KEY, in MySQL, is a UNIQUE KEY. So the UNIQUE KEY is redundant, wastes disk space, and slows down INSERT.
Need VISIBLE index starting with user_id for Q1 and Q2
Replace this
KEY `fk_claim_user_id` (`fk_user_id`) /*!80000 INVISIBLE */,
with
INDEX(fk_user_id, fk_character_id)
in that order -- this will help with your first 2 queries.
Query 3
The 3rd query may still need (in the given order)
INDEX(fk_user_id, global_id)
If you need some of the DISTINCTs/COUNTs, let's see them. Changing indexes may help.
Strange query
As for
SELECT PK FROM <table> WHERE fk_user_id = ?;
Why would you just want the PK? Is global_id useful by itself? Or is it useful only for looking up something else? If the latter, let's see it; it is often more practical to optimize a single, complex, query than two queries that are artificially split.
Tuning
How much RAM is available to MySQL? What is the value of innodb_buffer_pool_size? 30s for 50K rows -- sounds like being I/O-bound. Maybe that setting is too low.
In some cases, DISTINCT speeds up a query -- if for no other reason that less data is shoveled back to the client.
Redesign PK
Based on the names "claim" and "user_id" and the test for "user_id" in all 3 queries, I deduce that you are frequently looking up stuff for a single "user"? What, if anything, is global_id needed for outside this table?
If you need need global_id elsewhere or nothing else could be used for uniqueness, do
PRIMARY KEY(user_id, global_id), -- for locality of reference
INDEX(global_id) -- to keep AUTO_INCREMENT happy
If (user_id, xx) is known to be unique (for some column(s) xx), toss global_id and change to
PRIMARY KEY(user_id, xx)
In either case, these go away:
PRIMARY KEY (`global_id`),
UNIQUE KEY `global_id_UNIQUE` (`global_id`),
KEY `fk_claim_user_id` (`fk_user_id`) /*!80000 INVISIBLE */,
InnoDB stores the data in PK order. By having the PK start with user_id, all the rows for one user are "adjacent" on the disk, thereby more readily cached in RAM (in the buffer_pool).
Given a user with 100 claims, I am restructuring the table so that the data is found in a couple of consecutive blocks (16KB unit of storage by InnoDB) instead of upwards of 100 scattered blocks.
I need a table to store some ratings, in this table I have a composite index (user_id, post_id) and other column to identify different rating system.
user_id - bigint
post_id - bigint
type - varchar
...
Composite Index (user_id, post_id)
In this table I've not a primary key because the primary need to be unique and the INDEX not need to be unique, in my case univocity is a problem.
For example I can have
INSERT INTO tbl_rate
(user_id,post_id,type)
VALUES
(24,1234,'like'),
(24,1234,'love'),
(24,1234,'other');
The missing of PRIMARY KEY may cause performance problem? My table structure is good or I need to change it?
Thank you
A few points:
It sounds like you are just using what is currently unique about the table and making that as a primary key. That works. And natural keys have some advantages when it comes to querying because of locality. (The data for each user is stored in the same area). And because the table is clustered by that key which eliminates lookups to the data if you are searching by the columns in the primary.
But, using a natural primary key like you chose has disadvantages for performance as well.
Using a very large primary key will make all other indexes very large in innodb because the primary key is included in each index value.
Using a natural primary key isn't as fast as a surrogate key for INSERT's because in addition to being bigger it can't just insert at the end of the table each time. It has to insert in the section for that user and post etc.
Also, if u are searching by time most likely you will be seeking all over the table with a natural key unless time is your first column. surrogate keys tend to be local for time and can often be just right for some queries.
Using a natural key like yours as a primary key can also be annoying. What if you want to refer to a particular vote? You need a few fields. Also it's a little difficult to use with lots of ORMs.
Here's the Answer
I would create your own surrogate key and use it as a primary key rather than rely on innodb's internal primary key because you'll be able to use it for updates and lookups.
ALTER TABLE tbl_rate
ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ADD PRIMARY KEY(id);
But, if you do create a surrogate primary key, I'd also make your key a UNIQUE. Same cost but it enforces correctness.
ALTER TABLE tbl_rate
ADD UNIQUE ( user_id, post_id, type );
The missing of PRIMARY KEY may cause performance problem?
Yes in InnoDB for sure, as InnoDB will use a algorithm to create it's own "ROWID",
Which is defined in dict0boot.ic
Returns a new row id.
#return the new id */
UNIV_INLINE
row_id_t
dict_sys_get_new_row_id(void)
/*=========================*/
{
row_id_t id;
mutex_enter(&(dict_sys->mutex));
id = dict_sys->row_id;
if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
dict_hdr_flush_row_id();
}
dict_sys->row_id++;
mutex_exit(&(dict_sys->mutex));
return(id);
}
The main problem in that code is mutex_enter(&(dict_sys->mutex)); which blocks others threads from accessing if one thread is already running this code.
Meaning it will table lock the same as MyISAM would.
% may take a few nanoseconds. That is insignificant compared to
everything else. Anyway #define DICT_HDR_ROW_ID_WRITE_MARGIN 256
Indeed yes Rick James this is indeed insignificant compared to what was mentioned above.
The C/C++ compiler would micro optimize it more to to get even more performance out off it by making the CPU instructions lighter.
Still the main performance concern is mentioned above..
Also the modulo operator (%) is a CPU heavy instruction.
But depening on the C/C++ compiler (and/or configuration options) if might be optimized if DICT_HDR_ROW_ID_WRITE_MARGIN is a power of two. Like (0 == (id & (DICT_HDR_ROW_ID_WRITE_MARGIN - 1))) as bitmasking is much faster, i believe DICT_HDR_ROW_ID_WRITE_MARGIN indeed had a number which is a power of 2
Two tables:
CREATE TABLE `htmlcode_1` (
`global_id` int(11) NOT NULL,
`site_id` int(11) NOT NULL,
PRIMARY KEY (`global_id`),
KEY `k_site` (`site_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `htmlcode_2` (
`global_id` int(11) NOT NULL,
`site_id` int(11) NOT NULL,
PRIMARY KEY (`site_id`,`global_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
which one should be faster for selects and why?
'select * from table where site_id=%s'
The latter table is probably slightly faster for that SELECT query, assuming the table has a nontrivial number of rows.
When querying InnoDB by primary key, the lookup is against the clustered index for the table.
Secondary key lookups require a lookup in the index, then that reveals the primary key value, which is then used to do a lookup by primary key. So this uses two lookups.
The reason to use a PRIMARY KEY is to allow for either quick access OR REFERENTIAL INTEGRITY (CONSTRAINT ... FOREIGN KEY ...)
In your second example, you do not have the proper key for referential integrity if any other table refers to your table. In that case, other operations will be very very slow.
The differences in speed in either case for your particular case should be too small and trivial, but the proper design will dictate the first approach.
The first table represents many "globals" in each "site". That is, a "many-to-one" relationship. But it is the "wrong" way to do it. Instead the Globals table should have a column site_id to represent such a relationship to the Sites table. Meanwhile, the existence of htmlcode_1 is an inefficient waste.
The second table may be representing a "many-to-many" relationship between "sites" and "globals". If this is what you really want, then see my tips . Since you are likely to map from globals to sites, another index is needed.
I am building a website (LAMP stack) with an Amazon RDS MySQL instance as the back end (type db.m3.medium).
I am happy with database integrity, and it works perfectly with regards to SELECT/JOIN/ETC queries (everything is normalized, indexed, and foreign keyed, all tables have id primary keys and relevant secondary keys / unique keys).
I have a table 'df_products' with approx half a million products in it. The products need to be updated nightly. The process involves a PHP script reading over a large products data-file and inserting data into several tables (products table, product_colours table, brands table, etc), calling either INSERT or UPDATE depending on whether or not a row already exists. This is done as one giant transaction.
What I am seeing is the UPDATE commands are sufficiently fast (50/sec, not exactly lightning but it should do), however the INSERT commands are super slow (1/sec) and appear to be consuming 100% of the CPU. On a dual core instance we see 50% CPU use (i.e. one full core).
I assume that this is because indexes (1x PRIMARY + 5x INDEX + 1x UNIQUE + 1x FULLTEXT) are being rebuilt after every INSERT. However I though that putting the entire process into one transaction should stop indexes being rebuilt until the transaction is committed.
I have tried setting the following params via PHP but there is negligible performance improvement:
$this->db->query('SET unique_checks=0');
$this->db->query('SET foreign_key_checks=0;');
The process will take weeks to complete at this rate so we must improve performance. Google appears to suggest using LOAD DATA. However:
I would have to generate five files in order to populate five tables
The process would have to use UPDATE commands as opposed to INSERT since the tables already exist
I would still need to loop over the products and scan the database for what values already do and don't exist
The database is entirely InnoDB and I don't plan to move to MyISAM (I want transactions, foreign keys, etc). This means that I cannot disable indexes. Even if I did it would probably be a big performance drain as we need to check if a row already exists before we insert it, and without an index this will be super slow.
I have provided the products table defition below for information. Can you please provide advice to what process we should be using to achieve faster INSERT/UPDATE on multiple large related tables? Or what optimisations we can make to our existing process?
Thank you,
CREATE TABLE `df_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_brand` int(11) NOT NULL,
`title` varchar(255) NOT NULL,
`id_gender` int(11) NOT NULL,
`id_colourSet` int(11) DEFAULT NULL,
`id_category` int(11) DEFAULT NULL,
`desc` varchar(500) DEFAULT NULL,
`seoAlias` varchar(255) CHARACTER SET ascii NOT NULL,
`runTimestamp` timestamp NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `seoAlias_UNIQUE` (`seoAlias`),
KEY `idx_brand` (`id_brand`),
KEY `idx_category` (`id_category`),
KEY `idx_seoAlias` (`seoAlias`),
KEY `idx_colourSetId` (`id_colourSet`),
KEY `idx_timestamp` (`runTimestamp`),
KEY `idx_gender` (`id_gender`),
FULLTEXT KEY `fulltext_title` (`title`),
CONSTRAINT `fk_id_colourSet` FOREIGN KEY (`id_colourSet`) REFERENCES `df_productcolours` (`id_colourSet`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `fk_id_gender` FOREIGN KEY (`id_gender`) REFERENCES `df_lu_genders` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=285743 DEFAULT CHARSET=utf8
How many "genders" are there? If the usual 2, don't normalize it, don't index it, don't us a 4-byte INT to store it, use a CHAR(1) CHARACTER SET ascii (only 1 byte) or an ENUM (1 byte).
Each unnecessary index is a performance drain on the load, regardless of how it is done.
For INSERT vs UPDATE, look into using INSERT ... ON DUPLICATE KEY UPDATE.
Load the nightly data into a separate table (this could be MyISAM with no indexes). Then run one query to update existing rows and one to insert new rows. (Each needs a JOIN.) See http://mysql.rjweb.org/doc.php/staging_table, especially the 2 SQLs used for "normalizing". They can be adapted to your situation.
Any kind of multi-row query runs noticeably faster than 1-row at a time. (A 100-row INSERT runs 10 times as fast as 100 1-row inserts.)
innodb_flush_log_at_trx_commit = 2 will let the individual write statements run much faster. (Batching them as I suggest won't speed up much.)