i want to store changes that i do on my "entity" table. This should be like a log. Currently it is implemented with this table in MySQL:
CREATE TABLE `entitychange` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`entity_id` int(10) unsigned NOT NULL,
`entitytype` enum('STRING_1','STRING_2','SOMEBOOL','SOMEDOUBLE','SOMETIMESTAMP') NOT NULL DEFAULT 'STRING_1',
`when` TIMESTAMP NOT NULL,
`value` TEXT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
entity_id = the primary key of my entity table.
entitytype = the field that was changed in the entity table. sometimes only one field is changed, sometimes multiple. one change = one row.
value = the string representation of the "new value" of the field.
Example when changing Field entity.somedouble from 3 to 2, i run those queries:
UPDATE entity SET somedouble = 2 WHERE entity_id = 123;
INSERT INTO entitychange (entity_id,entitytype,value) VALUES (123,'SOMEDOUBLE',2);
I need to select the changes of a specific entity and entitytype of the last 15 days. For example: The last changes with SOMEDOUBLE for entity_id 123 within the last 15 days.
Now, there are two things that i dislike:
All Data is stored as TEXT - although most (less than 1%) isn't really text, in my case, most values are DOUBLE. Is this a big problem?
The Table is getting really, really slow when inserting, since the table already has 200 million rows. Currently my Server load is up to 10-15 because of this.
My Question: How do i address those two "bottlenecks"? I need to scale.
My approaches would be:
Store it like this: http://sqlfiddle.com/#!2/df9d0 (click on browse) - Store the changes in the entitychange table and then store the value according to its datatype in entitychange_[bool|timestamp|double|string]
Use partitioning by HASH(entity_id) - i thought of ~50 partitions.
Should I use another database system, maybe MongoDB?
If I were facing the problem you mentioned, I would design LOG table like bellow:
EntityName: (String) Entity that is being manipulated.(mandatory)
ObjectId: Entity that is being manipulated, primary key.
FieldName: (String) Entity field name.
OldValue: (String) Entity field old value.
NewValue: (String) Entity field new value.
UserCode: Application user unique identifier. (mandatory)
TransactionCode: Any operation changing the entities will need to have a unique transaction code (like GUID) (mandatory), In case of an update on an entity changing multiple fields,these column will be the key point to trace all changes in the update(transcation)
ChangeDate: Transaction date. (mandatory)
FieldType: enumeration or text showing the field type like TEXT or Double. (mandatory)
Having this approach Any entity (table) could be traced Reports will be readableOnly changes will be logged. Transaction code will be the key point to detect changes by a single action.
BTW
Store the changes in the entitychange table and then store the value
according to its datatype in entitychange_[bool|timestamp|double|string]
Won't be needed, in the single table you will have changes and data types
Use partitioning by HASH(entity_id)
I will prefer partitioning by ChangeDate or creating backup tables for changeDate that are old enough to be backed up and remover from the main LOG table
Should I use another database system, maybe MongoDB?
Any data base comes with its own prob and cons , you can use the design on any RDBMS.
A useful comparison of documant based data bases like MongoDB could be found here
hope be helpful.
Now I think I understand what you need, a versionable table with history of the records changed. This could be another way of achieving the same and you could easily make some quick tests in order to see if it gives you better performance than your current solution. Its the way Symfony PHP Framework does it in Doctrine with the Versionable plugin.
Have in mind that there is a primary key unique index of two keys, version and fk_entity.
Also take a look at the values saved. You will save a 0 value in the fields which didnt change and the changed value in those who changed.
CREATE TABLE `entity_versionable` (
`version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`fk_entity` INT(10) UNSIGNED NOT NULL,
`str1` VARCHAR(255),
`str2` VARCHAR(255),
`bool1` BOOLEAN,
`double1` DOUBLE,
`date` TIMESTAMP NOT NULL,
PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "0", "0", "0", "2013-06-02 17:13:16");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a2", "0", "0", "0", "2013-06-11 17:13:12");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b1", "0", "0", "2013-06-11 17:13:21");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b2", "0", "0", "2013-06-11 17:13:42");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "0", "1", "0", "2013-06-16 17:19:31");
/*Another example*/
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "b1", "0", "0", CURRENT_TIMESTAMP);
SELECT * FROM `entity_versionable` t WHERE
(
(t.`fk_entity`="1") AND
(t.`date` >= (CURDATE() - INTERVAL 15 DAY))
);
And probably another step to improve performance, it could be to save all history log records in separate tables, once per month or so. That way you wont have many records in each table, and searching by date will be really fast.
There two main challenges here:
How to store data efficiently, i.e. taking less space and being in an easy to use format
2-3. Managing a big table: archiving, ease for backup and restore
2-3. Performance optimisation: faster inserts and selects
Storing data efficiently
value filed. I would suggest to make it VARCHAR (N).
Reasons:
Using N<255 will save 1 byte per row just because of the data type.
Using other data types for this filed: fixed types use space whatever the value is, and normally it will be 8 bytes per row (datetime, long integer, char (8)) and other variable datatypes are too big for this field.
Also TEXT data type results in performance penalties: (from manaul on BLOB and Text data types)
Instances of TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types. Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
Basically TEXT is designed to store big strings and pieced of text, whereas VARCHAR() is designed relatively short strings.
id field. (updated, thanks to #steve) I agree that this field does not carry any useful information. Use 3 columns for your primary key: entity_id and entitype and when . TIMESTAMP will guarantee you pretty well that there will be no duplicates. Also same columns will be used for partitioning/sub-partitioning.
Table manageability
There are two main options: MERGE tables and Partitioning. MERGE storage engine is based on My_ISAM, which is being gradually phased out as far as I understand. Here is some reading on [MERGE Storage Engine].2
Main tool is Partitioning and it provides two main benefits:
1. Partition switching (which is often an instant operation on large chunk of data) and rolling window scenario: insert new data in one table and then instantly switch all of it into archive table.
2. Storing data in sorted order, that enables partition pruning - querying only those partitions, that contain needed data. MySQL allows sub-partitioning to group data further.
Partitioning by entity_id makes sense. If you need to query data for extended periods of time or you have other pattern in querying your table - use that column for sub-partitioing. There is no need for sub- partitioning on all columns of primary key, unless partitions will be switched at that level.
Number of partitions depends on how big you want db file for that partition to be. Number of sub-partitions depends on number of cores, so each core can search its own partition, N-1 sub-partitions should be ok, so 1 core can do overall coordination work.
Optimisation
Inserts:
Inserts are faster on table without indexes, so insert big chunk of data (do your updates), then create indexes (if possible).
Change Text for Varchar - it take some strain off db engine
Minimal logging and table locks may help, but not often possible to use
Selects:
Text to Varchar should definitely improve things.
Have a current table with recent data - last 15 days, then move to archive via partition switching. Here you have an option to partition table different to archive table (eg. by date first, then entity_id), and change partitioning manner by moving small (1 day) of data to temp table anв changing partitioning of it.
Also you can consider partitioning by date, you have many queries on date ranges. Put usage of your data and its parts first and then decide which schema will support it best.
And as for your 3rd question, I do not see how use of MongoDB will specifically benefit this situation.
This is called a temporal database, and researchers have been struggling with the best way to store and query temporal data for over 20 years.
Trying to store the EAV data as you are doing is inefficient, in that storing numeric data in a TEXT column uses a lot of space, and your table is getting longer and longer, as you have discovered.
Another option which is sometimes called Sixth Normal Form (although there are multiple unrelated definitions for 6NF), is to store an extra table to store revisions for each column you want to be tracked temporally. This is similar to the solution posed by #xtrm's answer, but it doesn't need to store redundant copies of columns that haven't changed. But it does lead to an explosion in the number of tables.
I've started to read about Anchor Modeling, which promises to handle temporal changes of both structure and content. But I don't understand it well enough to explain it yet. I'll just link to it and maybe it'll make sense to you.
Here are a couple of books that contain discussions of temporal databases:
Joe Celko's SQL for Smarties, 4th ed.
Temporal Data & the Relational Model, C.J. Date, Hugh Darwen, Nikos Lorentzos
Storing an integer in a TEXT column is a no-go! TEXT is the most expensive type.
I would go as far as creating one log table per field you want to monitor:
CREATE TABLE entitychange_somestring (
entity_id INT NOT NULL PRIMARY KEY,
ts TIMESTAMP NOT NULL,
newvalue VARCHAR(50) NOT NULL, -- same type as entity.somestring
KEY(entity_id, ts)
) ENGINE=MyISAM;
Partition them, indeed.
Notice I recommend using the MyISAM engine. You do not need transactions for this (these) unconstrained, insert-only table(s).
Why is INSERTing so slow, and what can you do to make it faster.
These are the things I would look at (and roughly in the order I would work through them):
Creating a new AUTO_INCREMENT-id and inserting it into the primary key requires a lock (there is a special AUTO-INC lock in InnoDB, which is held until the statement finishes, effectively acting as a table lock in your scenario). This is not usually a problem as this is a relatively fast operation, but on the other hand, with a (Unix) load value of 10 to 15, you are likely to have processes waiting for that lock to be freed. From the information you supply, I don't see any use in your surrogate key 'id'. See if dropping that column changes performance significantly. (BTW, there is no rule that a table needs a primary key. If you don't have one, that's fine)
InnoDB can be relatively expensive for INSERTs. This is a trade off made to allow additional functionality such as transactions and may or may not be affecting you. Since all your actions are atomic, I see no need for transactions. That said, give MyISAM a try. Note: MyISAM is usually a bad choice for huge tables because it only supports table locking and not record level locking, but it does support concurrent inserts, so it might be a choice here (especially if you do drop the primary key, see above)
You could play with database storage engine parameters. Both InnoDB and MyISAM have options you could change. Some of them have an impact on how TEXT data is actually stored, others have a broader function. One you should specifically look at is innodb_flush_log_at_trx_commit.
TEXT columns are relatively expensive if (and only if) they have non-NULL values. You are currently storing all values in that TEXT column. It is worth giving the following a try: add extra fields value_int and value_double to your table and store those values in the corresponding column. Yes, that will waste some extra space, but might be faster - but this will largely be dependant on the database storage engine and its settings. Please note that a lot of what people think about TEXT column performance is not true. (See my answer to a related question on VARCHAR vs TEXT)
You suggested spreading the information over more than one table. This is only a good idea if your tables are fully independant of one another. Otherwise you'll end up with more than one INSERT operation for any change, and you're more than likely to make things a lot worse. While normalizing data is usually good(tm), it is likely to hurt performance here.
What can you do to make SELECTs run fast
Proper keys. And proper keys. And just in case I forgot to mention: proper keys. You don't specify in detail what your selects look like, but I assume them to be similar to "SELECT * FROM entitychange WHERE entity_id=123 AND ts>...". A single compound index on entity_id and ts should be enough to make this operation fast. Since the index has to be updated with every INSERT, it may be worth trying the performance of both entity_id, ts and ts, entity_id: It might make a difference.
Partitioning. I wouldn't even bring this subject up, if you hadn't asked in your question. You don't say why you'd like to partition the table. Performance-wise it usually makes no difference, provided that you have proper keys. There are some specific setups that can boost performance, but you'll need the proper hardware setup to go along with this. If you do decide to partition your table, consider doing that by either the entity_id or the TIMESTAMP column. Using the timestamp, you could end up with archiving system with older data being put on an archive drive. Such a partitioning system would however require some maintenance (adding partitions over time).
It seems to me that you're not as concerned about query performance as about the raw insert speed, so I won't go into more detail on SELECT performance. If this does interest you, please provide more detail.
I would advise you to make a lot of in deep testing, but from my tests I am achiving very good results with both INSERT and SELECT with the table definition I posted before. I will detail my tests in depth so anyone could easily repeat and check if it gets better results. Backup your data before any test.
I must say that these are only tests, and may not reflect or improve your real case, but its a good way of learning and probably a way of finding usefull information and results.
The advises that we have seen here are really nice, and you will surely notice a great speed improvement by using a predefined type VARCHAR with size instead of TEXT. However you could gain speed, I would advise not to use MyISAM for data integrity reasons, stay with InnoDB.
TESTING:
1. Setup Table and INSERT 200 million of data:
CREATE TABLE `entity_versionable` (
`version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`fk_entity` INT(10) UNSIGNED NOT NULL,
`str1` VARCHAR(255) DEFAULT NULL,
`str2` VARCHAR(255) DEFAULT NULL,
`bool1` TINYINT(1) DEFAULT NULL,
`double1` DOUBLE DEFAULT NULL,
`date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB AUTO_INCREMENT=230297534 DEFAULT CHARSET=latin1
In order to insert +200 million rows in about 35 mins in a table, please check my other question where peterm has answered one of the best ways to fill a table. It works perfectly.
Execute the following query 2 times in order to insert 200 million rows of no random data (change data each time to insert random data):
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
SELECT 1, 'a1', 238, 2, 524627, '2013-06-16 14:42:25'
FROM
(
SELECT a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 + g.N * 1000000 + h.N * 10000000 + 1 N FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) c
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) d
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) e
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) f
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) g
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) h
) t;
*Since you already have the original table with 200 million rows of real random data, you wont probably need to fill it, just export your table data and schema and import it into a new Testing table with the same schema. That way you will make tests in a new table with your real data, and the improvements you get will also work for the original one.
2. ALTER the new Test table for performance (or use my example above in step 1 to get better results).
Once that we have our new Test table setup and filled with random data, we should check the above advises, and ALTER the table to speed it up:
Change TEXT to VARCHAR(255).
Select and make a good primary key unique index with two or three
columns. Test with version autoincrement and fk_entity in your first
test.
Partition your table if necessary, and check if it improves speed. I
would advise not to partition it in your first tests, in order to
check for real performance gain by changing data types and mysql
configuration. Check the following link for some partition and
improvement tips.
Optimize and repair your table. Index will be made again and will
speed searchs a lot:
OPTIMIZE TABLE test.entity_versionable;
REPAIR TABLE test.entity_versionable;
*Make a script to execute optimize and maintain your index up to date, launching it every night.
3. Improve your MySQL and hardware configuration by carefully reading the following threads. They are worth reading and Im sure you will get better results.
Easily improve your Database hard disk configuration spending a bit
of money: If possible use a SSD for your main MySQL database, and a
stand alone mechanical hard disk for backup purposes. Set MySQL logs
to be saved on another third hard disk to improve speed in your
INSERTs. (Remember to defragment mechanical hard disks after some
weeks).
Performance links: general&multiple-cores, configuration,
optimizing IO, Debiancores, best configuration,
config 48gb ram..
Profiling a SQL query: How to profile a query, Check for possible bottleneck in a query
MySQL is very memory intensive, use low latency CL7 DDR3 memory if
possible. A bit off topic, but if your system data is critical, you may look for ECC memory, however its expensive.
4. Finally, tests your INSERTs and SEARCHs in the test table. Im my tests with +200 million of random data with the above table schema, it spends 0,001seconds to INSERT a new row and about 2 minutes to search and SELECT 100 million rows. And however its only a test and seems to be good results :)
5. My System Configuration:
Database: MySQL 5.6.10 InnoDB database (test).
Processor: AMD Phenom II 1090T X6 core, 3910Mhz each core.
RAM: 16GB DDR3 1600Mhz CL8.
HD: Windows 7 64bits SP1 in SSD, mySQL installed in SSD, logs written in mechanical hard disk.
Probably we should get better results with one of the lastest Intel i5 or i7 easily overclocked to 4500Mhz+, since MySQL only uses one core for one SQL. The higher the core speed, the faster it will be executed.
6. Read more about MySQL:
O'Reilly High Performance MySQL
MySQL Optimizing SQL Statements
7. Using another database:
MongoDB or Redis will be perfect for this case and probably a lot faster than MySQL. Both are very easy to learn, and both has their advantages:
- MongoDB: MongoDB log file growth
Redis
I would definitively go for Redis. If you learn how to save the log in Redis, it will be the best way to manage the log with insanely high speed:
redis for logging
Have in mind the following advices if you use Redis:
Redis is compiled in C and its stored in memory, has some different
methods to automatically save the information into disk
(persistence), you wont probably have to worry about it. (in case of disaster
scenario you will end loosing about 1 second of logging).
Redis is used in a lot of sites which manages terabytes of data,
there are a lot of ways to handle that insane amount of information
and it means that its secure (used here in stackoverflow, blizzard, twitter, youporn..)
Since your log will be very big, it will need to fit in memory in
order to get speed without having to access the hard disk. You may
save different logs for different dates and set only some of them in
memory. In case of reaching memory limit, you wont have any errors and everything will still work perfectly, but check the Redis Faqs for more information.
Im totally sure that Redis will be a lot faster for this purpose than
MySQL. You will need to learn about how to play with lists and
sets to update data and query/search for data. If you may need really advanced query searches, you should go with MongoDB, but in this case of simple date searchs will be perfect for Redis.
Nice Redis article in Instagram Blog.
At work we have logtables on almost every table due to customer conditions (financial sector).
We have done it this way: Two tables ("normal" table, and log table) and then triggers on insert/update/delete of the normal table whichs stores a keyword (I,U,D) and the old record (on update, delete) or the new one (on insert) inside the logtable
We have both tables in the same database-schema
Related
I am facing a complet mystery.
I have create a table to store meteorolocal data. I have one value per hour, since 1979, for every 0.25 latitude and longitude.
This brings me to have billions of lines in the database.
Following multiples advices, I partionnated the table.
I choosed to partitionnate by years. This is how it looks like :
CREATE TABLE `MyTable` (
`latitude_100` SMALLINT NOT NULL, -- Smallint is 2 bytes, where float is 4. So we take latitude * 100
`longitude_100` SMALLINT NOT NULL, -- Same logic here
`time` DATETIME NOT NULL,
`final` TINYINT UNSIGNED NOT NULL,
`value` DOUBLE NOT NULL,
PRIMARY KEY (`latitude_100` ASC, `longitude_100` ASC, `time` ASC)
)
PARTITION BY HASH(YEAR(time)) PARTITIONS 45 ; -- This will work until 2023 included
In order to test, I injected in the table data only from 2015 to 2021.
The problem :
All SELECT from this table are extremly long.
Even worst, they are sometime stupidly long.
For example :
SELECT time, latitude_100, longitude_100, value
FROM MyTable
WHERE latitude_100 BETWEEN 500 AND 2000
AND longitude_100 BETWEEN 11600 AND 12800 AND
YEAR(time) = 1990 ;
Remember that there is NO data for 1990. By looking into the right partition, MySQL should see it immeditaly isn't it ?
MySQL explain me that it will look in all partition, which I do not understand why :
EXPLAIN SELECT time, latitude_100, longitude_100, value
FROM MyTable
WHERE latitude_100 BETWEEN 500 AND 2000
AND longitude_100 BETWEEN 11600 AND 12800 AND
YEAR(time) = 1990 ;
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, SIMPLE, MyTable, p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32,p33,p34,p35,p36,p37,p38,p39,p40,p41,p42,p43,p44, range, PRIMARY, PRIMARY, 4, , 118295536, 11.11, Using where
When I do
SELECT * FROM information_schema.partitions WHERE TABLE_SCHEMA='MySchema' AND TABLE_NAME = 'MyTable' AND PARTITION_NAME IS NOT NULL
I can see that only 6 partitions have data, all other are empty.
Last think I tried was to formulate the WHERE differently, to maybe take advantage of the index :
SELECT time, latitude_100, longitude_100, value
FROM MyTable
WHERE latitude_100 BETWEEN 500 AND 2000
AND longitude_100 BETWEEN 11600 AND 12800 AND
time BETWEEN "1990-01-01 00:00:00" AND "1990-12-31 23:00:00" AND
YEAR(time) = 1990 ;
But this does not accelerate the execution. Only the EXPLAIN is a bit different (but not in termes of partition reading) :
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, SIMPLE, MyTable, p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32,p33,p34,p35,p36,p37,p38,p39,p40,p41,p42,p43,p44, range, PRIMARY, PRIMARY, 9, , 118295536, 1.23, Using where
What do I do wrong ?
Why MySQL does not want to cooperate with partitionning ?
Thank you very much !
[Edit]
On technical side, the database is hosted on AWS RDS. It is powered by a "db.t4g.large" instance and user MySQL 8.0.27
Do not use PARTITION BY HASH! HASH will fail to do any pruning when using a date range (as you have!). Simply put, the Optimizer is not smart enough to see that your range fits in a single partition. Furthermore, HASH may unnecessarily be lumping two different years into the same partition. Instead, use PARTITION BY RANGE.
I know that RANGE(TO_DAYS(time)) works; perhaps RANGE(YEAR(time)) may work, depending on what version of MySQL you are using; check the specifics.
Hour: With some date arithmetic, you can shrink a 5-byte DATETIME down to a 3-byte MEDIUMINT. (A suitable change to PARTITION BY RANGE would be needed.)
Not enough: Since you are testing with only 7 years of data, my Partitioning suggestion will help only by a factor of 7.
DOUBLE? What are you measuring? DOUBLE takes 8 bytes and gives you about 16 significant digits. Even FLOAT (4 bytes, 7 digits) is likely to be overkill. For temperature (°C), consider DECIMAL(2) or TINYINT (-128..+127) or DECIMAL(4,2); they are 1,1,2 bytes, respectively. Extremes recorded: -89..+57. Note: °F would need one more byte in any INT or DECIMAL encoding. (I would guess that an instrument too close to a volcano or wildfire would fail to transmit data if the temp exceeded 99°C.)
Shrinking the DOUBLE would shrink the dataset size by about 1/3 -- worth the effort.
If you will end up with about 400GB rows, datatype size is very important.
So, let's dig deeper... Please provide
Amount of RAM
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
Any other SELECTs that you are likely to run, including WHERE clauses other than exactly one year.
How much disk space did your 7 years take? If using MyISAM, I would expect about 1.2TB; if using InnoDB, 3TB.
The lat/lng ranges in the sample Select were relatively small. Is this typical? If so, we may be able to take advantage of it.
ENGINE -- Since this is, I assume, mostly a readonly dataset, it may be a rare case where MyISAM is better. See estimates above; multiple by 6 to get estimates for the 43 years.
Usage -- What will you do with the results of a SELECT like the one you have? If that is the 'only' query, then there are more compact ways to store the data. But they will be more complex to Insert and Select. However, the speed improvement may be worth it. I need to see the various Selects before advising further.
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
The application we are developing is writing around 4-5 millions rows of data every day. And, we need to save these data for the past 90 days.
The table user_data has the following structure (simplified):
id INT PRIMARY AUTOINCREMENT
dt TIMESTAMP CURRENT_TIMESTAMP
user_id varchar(20)
data varchar(20)
About the application:
Data that is older than 7 days old will not be written / updated.
Data is mostly accessed based on user_id (i.e. all queries will have WHERE user_id = XXX)
There are around 13000 users at the moment.
User can still access older data. But, in accessing the older data, we can restrict that he/she can only get the whole day data only and not a time range. (e.g. If a user attempts to get the data for 2016-10-01, he/she will get the data for the whole day and will not be able to get the data for 2016-10-01 13:00 - 2016-10-01 14:00).
At the moment, we are using MySQL InnoDB to store the latest data (i.e. 7 days and newer) and it is working fine and fits in the innodb_buffer_pool.
As for the older data, we created smaller tables in the form of user_data_YYYYMMDD. After a while, we figured that these tables cannot fit into the innodb_buffer_pool and it started to slow down.
We think that separating / sharding based on dates, sharding based on user_ids would be better (i.e. using smaller data sets based on user and dates such as user_data_[YYYYMMDD]_[USER_ID]). This will keep the table in much smaller numbers (only around 10K rows at most).
After researching around, we have found that there are a few options out there:
Using mysql tables to store per user per date (i.e. user_data_[YYYYMMDD]_[USER_ID]).
Using mongodb collection for each user_data_[YYYYMMDD]_[USER_ID]
Write the old data (json encoded) into [USER_ID]/[YYYYMMDD].txt
The biggest con I see in this is that we will have huge number of tables/collections/files when we do this (i.e. 13000 x 90 = 1.170.000). I wonder if we are approaching this the right way in terms of future scalability. Or, if there are other standardized solutions for this.
Scaling a database is an unique problem to the application. Most of the times someone else's approach cannot be used as almost all applications writes its data in its own way. So you have to figure out how you are going to manage your data.
Having said that, if your data continue to grow, best solution is the shadring where you can distribute the data across different servers. As long as bound to a single server like creating different tables you are getting hit by resource limits like memory, storage and processing power. Those cannot be increased unlimited manner.
How to distribute the data, that you have to figure out based on your business use cases. As you mentioned, if you are not getting more request on old data, the best way to distribute the data base on date. Like DB for 2016 data, DB for 2015 and so on. Later you may purge or shutdown the servers which you have more old data.
This is a big table, but not unmanageable.
If user_id + dt is UNIQUE, make it the PRIMARY KEY, and get rid if id, thereby saving space. (More in a minute...)
Normalize user_id to a SMALLINT UNSIGNED (2 bytes) or, to be safer MEDIUMINT UNSIGNED (3 bytes). This will save a significant amount of space.
Saving space is important for speed (I/O) for big tables.
PARTITION BY RANGE(TO_DAYS(dt))
with 92 partitions -- the 90 you need, plus 1 waiting to be DROPped and one being filled. See details here .
ENGINE=InnoDB
to get the PRIMARY KEY clustered.
PRIMARY KEY(user_id, dt)
If this is "unique", then it allows efficient access for any time range for a single user. Note: you can remove the "just a day" restriction. However, you must formulate the query without hiding dt in a function. I recommend:
WHERE user_id = ?
AND dt >= ?
AND dt < ? + INTERVAL 1 DAY
Furthermore,
PRIMARY KEY(user_id, dt, id),
INDEX(id)
Would also be efficient even if (user_id, dt) is not unique. The addition of id to the PK is to make it unique; the addition of INDEX(id) is to keep AUTO_INCREMENT happy. (No, UNIQUE(id) is not required.)
INT --> BIGINT UNSIGNED ??
INT (which is SIGNED) will top out at about 2 billion. That will happen in a very few years. Is that OK? If not, you may need BIGINT (8 bytes vs 4).
This partitioning design does not care about your 7-day rule. You may choose to keep the rule and enforce it in your app.
BY HASH
will not work as well.
SUBPARTITION
is generally useless.
Are there other queries? If so they must be taken into consideration at the same time.
Sharding by user_id would be useful if the traffic were too much for a single server. MySQL, itself, does not (yet) have a sharding solution.
Try TokuDB engine at https://www.percona.com/software/mysql-database/percona-tokudb
Archive data are great for TokuDB. You will need about six times less disk space to store AND memory to PROCESS your dataset compared to InnoDB or about 2-3 times less than archived myisam.
1 million+ tables sounds like a bad idea. Having sharding via dynamic table naming by the app code at runtime has also not been a favorable pattern for me. My first go-to for this type of problem would be partitioning. You probably don't want 400M+ rows in a single unpartitioned table. In MySQL 5.7 you can even subpartition (but that gets more complex). I would first range partition on your date field, with one partition per day. Index on the user_id. If you are on 5.7 and want to dabble with subpartitioning, I would suggest range partition by date, then hash subpartition by user_id. As a starting point, try 16 to 32 hash buckets. Still index the user_id field.
EDIT: Here's something to play with:
CREATE TABLE user_data (
id INT AUTO_INCREMENT
, dt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
, user_id VARCHAR(20)
, data varchar(20)
, PRIMARY KEY (id, user_id, dt)
, KEY (user_id, dt)
) PARTITION BY RANGE (UNIX_TIMESTAMP(dt))
SUBPARTITION BY KEY (user_id)
SUBPARTITIONS 16 (
PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-25')),
PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-26')),
PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-27')),
PARTITION p4 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-28')),
PARTITION pMax VALUES LESS THAN MAXVALUE
);
-- View the metadata if you're interested
SELECT * FROM information_schema.partitions WHERE table_name='user_data';
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.