I have a table with ~20 columns.
-----------------------------------------------------------------
GUID_PK | GUID_SET_ID | Col_3 | Col_4 | ... | Col_20
-----------------------------------------------------------------
There could be thousands of Sets each having tens to less than a thousand records. Records within a set are all related to each other. sets are totally independent to each other. A whole set is read/written at a time in one big transaction. Once a record is written, it is read-only for ever, never altered, only read. Data is rarely deleted from this table. when it is deleted, the whole set is deleted in one go.
Only SET_ID is an incoming foreign key. PK is an outgoing foreign key to a different table. in the detail table about 3 or 4 records (each a single blob) are kept per master record.
Question is: should I partition the tables? I think yes. My boss thinks better. he wants the tables be created dynamically, one master one detail for each set. I personally am not comfortable with the dynamic creation idea, but fear the one-table-to-rule-them-all architecture.
The bulk insertions and bulk selects are definitely going to hit performance. Bulk deletes will again reorder the indexes. What would be an optimal structure?
Taking into account the most of the Col_x columns are populated you can do a HASH PARTITIONING :
CREATE TABLE
....
PARTITION BY HASH(GUID_SET_ID)
PARTITIONS NO_PART;
Where NO_PART is the number of partitions that you want , this should be established taking into account:
1) the volume of data you receive daily
2) the volume of data you estimate that will be received in the future
Also you can check other partition types here.
Related
I have a table with login logs which is EXTREMELY busy and large InnoDB table. New rows are inserted all the time, the table is queried by other parts of the system, it is by far the busiest table in the DB. In this table, there is logid which is PRIMARY KEY and its generated as a random hash by software (not auto increment ID). I also want to store some data like number of items viewed.
create table loginlogs
(
logid bigint unsigned primary key,
some_data varchar(255),
viewed_items biging unsigned
)
viewed_items is a value that will get updated for multiple rows very often (assume thousands of updates / second). The dilemma I am facing now is:
Should I
UPDATE loginlogs SET viewed_items = XXXX WHERE logid = YYYYY
or should I create
create table loginlogs_viewed_items
(
logid bigint unsigned primary key,
viewed_items biging unsigned,
exported tinyint unsigned default 0
)
and then execute with CRON
UPDATE loginlogs_viewed_items t
INNER JOIN loginlogs l ON l.logid = t.logid
SET
t.exported = 1,
l.viewed_items = t.viewed_items
WHERE
t.exported = 0;
e.g. every hour?
Note that either way the viewed_items counter will be updated MANY TIMES for one logid, it can be even 100 / hour / logid and there is tons of rows. So whichever table I chose for this, either the main one or the separate one, it will be getting updated quite frequently.
I want to avoid unnecessary locking of loginlogs table and at the same time I do not want to degrade performance by duplicating data in another table.
Hmm, I wonder why you'd want to change log entries and not just add new ones...
But anyway, as you said either way the updates have to happen, whether individually or in bulk.
If you have less busy time windows updating in bulk then might have an advantage. Otherwise the bulk update may have more significant impact when running in contrast to individual updates that might "interleave" more with the other operations making the impact less "feelable".
If the column you need to update is not needed all the time, you could think of having a separate table just for this column. That way queries that just need the other columns may be less affected by the updates.
"Tons of rows" -- To some people, that is "millions". To others, even "billions" is not really big. Please provide some numbers; the answer can be different. Meanwhile, here are some general principles.
I will assume the table is ENGINE=InnoDB.
UPDATEing one row at a time is 10 times as costly as updating 100 rows at a time.
UPDATEing more than 1000 rows in a single statement is problematic. It will lock each row, potentially leading to delays in other statements and maybe even deadlocks.
Having a 'random' PRIMARY KEY (as opposed to AUTO_INCREMENT or something roughly chronologically ordered) is very costly when the table is bigger than the buffer_pool. How much RAM do you have?
"the table is queried by other parts of the system" -- by the random PK? One row at a time? How frequently?
Please elaborate on how exported works. For example, does it get reset to 0 by something else?
Is there a single client doing all the work? Or are there multiple servers throwing data and queries at the table? (Different techniques are needed.)
I have two big tables for example:
'tbl_items' and 'tbl_items_transactions'
First table keeping some items metadata which may have 20 (varchar) columns with millions rows... and second table keeping each transaction of first table.
for example if a user insert new record to tbl_items then automatically a new record will be adding to tbl_items_transactions with same data plus date, username and transaction type to keep each row history.
so in the above scenario two tables have same columns but tbl_items_transactions have 3 extra columns date, username, transaction_type to keep each tbl_items history
now assume we have 1000 users that wants to Insert, Update, Delete tbl_items records with a web application. so these two tables scale very soon (maybe billion rows in tbl_items_transactions)
I have tried MySQL, MariaDB, PostgreSQL... they are very good but when table scale and millions rows inserted they are slow when run some select queries on tbl_items_transactions... but sometimes PostgreSQL is faster than MySQL or MariaDB
now I think I'm doing wrong things... If you was me... do you use MariaDB or PostgreSQL or somthing like that and structure your database like what I did?
Your setup is wrong.
You should not duplicate the columns from tbl_items in tbl_items_transactions, rather you should have a foreign key in the latter table pointing to the former.
That way data integrity is preserved, and tbl_items_transactions will be much smaller. This technique is called normalization.
To speed up queries when the table get large, define indexes on them that match the WHERE and JOIN conditions.
I'm about to build an application that stores up to 500 million records of domain names.
I'll index the '.net' or '.com' part and strip the 'www' at the beginning.
So I believe the table would look like this:
domain_id | domain_name | domain_ext
----------+--------------+-----------
1 | dropbox | 2
2 | digitalocean | 2
domain_ext = 2 means it's a '.com' domain.
The queries I'm about to perform::
I need to be able to insert new domains easily.
I also need to make sure I'm not inserting a duplication (each domain should have only 1 record), so I think to make domain_name + domain_ext as UNIQUE index (with MySQL - InnoDB).
Query domains in batches. For example: SELECT * FROM tbl_domains LIMIT 300000, 600;
What do you think? will that table hold hundreds of millions of records?
How about partitioning by first letter of the domain name, would that be good?
Let me know your suggestions, I'm open minded.
Partitioning is unlikely to provide any benefit. Certainly if you are partitioning on the first letter.
Don't use OFFSET and LIMIT for batching. Instead "remember where you left off". See my blog for more details.
If you have declared domain_ext to be INT, then I ask why? INT takes 4 bytes. So does .com. Even if you counter with SMALLINT or .uk, I will counter-counter with "The small difference does not justify the complexity."
Edit (on UNIQUE)
A non-partitioned table can have a UNIQUE index. (Note: A PRIMARY KEY is a UNIQUE index.) When you have a UNIQUE index, checking for uniqueness is virtually instantaneous, even for 500M rows. (Drilling down about 5 levels of BTree is very fast.)
With PARTITIONing, every UNIQUE key must include the "partition key". If the domain is not split, you cannot use PARTITION BY RANGE. Splitting off the extension (top-level domain) as an INT, you could use BY RANGE or BY LIST. The UNIQUE would be possible since the TLD is both the partition key and needed as part of the domain. But it would not gain any performance. A lookup would (1) pick the partition ("partition pruning"), then (2) drill down 4-5 levels of BTree to get to the row to check.
Conclusion: Doing a uniqueness check, while possible in this case, will not be any faster with PARTITIONing.
Background
I have spent couple of days trying to figure out how I should handle large amounts of data in MySQL. I have selected some programs and techniques for the new server for the software. I am probably going to use Ubuntu 14.04LTS running nginx, Percona Server and will be using TokuDB for the 3 tables I have planned and InnoDB for the rest of the tables.
But yet I have the major problem unresolved. How to handle the huge amount of data in database?
Data
My estimates for the possible data to receive is 500 million rows a year. I will be receiving measurement data from sensors every 4 minutes.
Requirements
Insertion speed is not very critical, but I want to be able to select few hundred measurements in 1-2 seconds. Also the amount of required resources is a key factor.
Current plan
Now I have thought of splitting the sensor data in 3 tables.
EDIT:
On every table:
id = PK, AI
sensor_id will be indexed
CREATE TABLE measurements_minute(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_hour(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_day(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
So I would be storing this 4 minute data for one month. After the data is 1 month old it would be deleted from minute table. Then average value would be calculated from the minute values and inserted into the measurements_hour table. Then again when the data is 1 year old all the hour data would be deleted and daily averages would be stored in measurements_day table.
Questions
Is this considered a good way of doing this? Is there something else to take in consideration? How about table partitioning, should I do that? How should I execute the splitting of the date into different tables? Triggers and procedures?
EDIT: My ideas
Any idea if MonetDB or Infobright would be any good for this?
I have a few suggestions, and further questions.
You have not defined a primary key on your tables, so MySQL will create one automatically. Assuming that you meant for "id" to be your primary key, you need to change the line in all your table create statements to be something like "id bigint(20) NOT NULL AUTO_INCREMENT PRIMARY KEY,".
You haven't defined any indexes on the tables, how do you plan on querying? Without indexes, all queries will be full table scans and likely very slow.
Lastly, for this use-case, I'd partition the tables to make the removal of old data quick and easy.
I had to solve that type of ploblem before, with nearly a Million rows per hour.
Some tips:
Engine Mysam. You don't need to update or manage transactions with that tables. You are going to insert, select the values, and eventualy delete it.
Be careful with the indexes. In my case, It was critical the insertion and sometimes Mysql queue was full of pending inserts. A insert spend more time if your table has more index. The indexes depends of your calculated values and when you are going to do it.
Sharding your buffer tables. I only trigger the calculated values when the table was ready. When I was calculating my a values in buffer_a table, it's because the insertions was on buffer_b one. In my case, I calculate the values every day, so I switch the destination table every day. In fact, I dumped all the data and exported it in another database to make the avg, and other process without disturb the inserts.
I hope you find this helpful.
I am wondering what is more efficient and faster in performance:
Having an index on one big table or multiple smaller tables without indexes?
Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).
Now so far I have an INDEX on the user_id and the query looks something like this
SELECT * FROM statistics WHERE user_id = 1
Now, with more and more rows the table gets slower and slower. INSERT statements slow down because the INDEX gets bigger and bigger; SELECT statements slow down, well, because there are more rows to search through.
Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:
SELECT * FROM statistics_1
where 1 represents the user_id obviously.
This way, no INDEX is needed and there is far less data in each table, so INSERT and SELECT statements should be much faster.
Now my questions again:
Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an INDEX?
Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?
Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.
I called this syndrome Metadata Tribbles in my book SQL Antipatterns Volume 1. You see this happen every time you plan to create a "table per X" or a "column per X".
This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.
There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?
Instead, I'd recommend you use MySQL Partitioning.
Here's an example of partitioning the table:
CREATE TABLE statistics (
id INT AUTO_INCREMENT NOT NULL,
user_id INT NOT NULL,
PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;
This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.
For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:
mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: statistics
partitions: p1 <--- this shows it touches only one partition
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 2
Extra: Using where; Using index
The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.
You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.
It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.
Re your questions in comment:
How could I determine a resonable number of partitions?
For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.
So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?
Shouldn't the amount of partitions grow over time? If so: How can I automate that?
The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.
That said, you can re-partition a table with ALTER TABLE:
ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;
This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.
You may want to monitor the size of data and indexes in partitions:
SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;
Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.
If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html
I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.
It probably depends on the type of queries you plan on making often, and the best way to know for sure is to just implement a prototype of both and do some performance tests.
With that said, I would expect that a single (large) table with an index will do better overall because most DBMS systems are heavily optimized to deal with the exact situation of finding and inserting data into large tables. If you try to make many little tables in hopes of improving performance, you're kindof fighting the optimizer (which is usually better).
Also, keep in mind that one table is probably more practical for the future. What if you want to get some aggregate statistics over all users? Having 20 000 tables would make this very hard and inefficient to execute. It's worth considering the flexibility of these schemas as well. If you partition your tables like that, you might be designing yourself into a corner for the future.
Concrete example:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).
Do this:
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
...
PRIMARY KEY(user_id, id),
INDEX(id)
Having user_id at the start of the PK gives you "locality of reference". That is, all the rows for one user are clustered together thereby minimizing I/O.
The id on the end of the PK is because the PK must be unique.
The strange-looking INDEX(id) is to keep AUTO_INCREMENT happy.
Abstract question:
Never have multiple identical tables.
Use PARTITIONing only if it meets one of the use-cases listed in http://mysql.rjweb.org/doc.php/partitionmaint
A PARTITIONed table needs a different set of indexes than the non-partitioned equivalent table.
In most cases a single, non-partitioned, table is optimal.
Use the queries to design indexes.
There is little to add to Bill Karwins answer. But one hint is: check if all the data for the user is needed in complete detail over all the time.
If you want to give usage statistics or number of visits or those things, you usually will get not a granularity of single actions and seconds for, say, the year 2009 from todays view. So you could build aggregation tables and a archive-table (not engine archive, of course) to have the recent data on action- base and an overview over the older actions.
Old actions don't change, I think.
And you still can go into detail from the aggregation with a week_id in the archive-table for example.
Intead of going from 1 table to 1 table per user, you can use partitioning to hit a number of tables/table size ratio somewhere in the middle.
You can also keep stats on users to try to move 'active' users into 1 table to reduce the number of tables that you have to access over time.
The bottom line is that there is a lot you can do, but largely you have to build prototypes and tests and just evaluate the performance impacts of various changes you are making.