I have a very simple mysql query on a remote windows 7 server on which i cannot change most of the parameters. I need to execute it only once now, to create a table, but in upcoming projects i'll be confronted to the same issue.
The query's the following, and has been running for 24 hours now, it's a basic filtering query :
CREATE TABLE compute_us_gum_2013P13
SELECT A.HHID, UPC, DIVISION, yearweek, CAL_DT, RETAILER, DEAL, RAW_PURCH_QTY,
UNITS,VOL,GROSS_DOL,NET_DOL, CREATE_DATE
FROM work_us_gum_2013P13_digital_purchases_with_yearweek A
INNER JOIN compute_us_gum_2013_digital_panelists B
on A.hhid = B.hhid;
Table A is quite big, around 250 million lines.
table B is 5 million lines
hhid is indexed on both tables, i haven't put a unique index in table B though i could, but will it change things dramatically ?
My ram of 12 GB is completely saturated (actually there's 1GB free but i think mysql can't touch it). Of course i closed everything i could, and the processor is basically not used. The status of the query has been stuck on "sending data" for most of the time.
Table A has also a cover index on 7 column, that i could drop as it's not used, but i don't think it would change something would it ?
One big issue i have is i cannot test a lot of things because i wouldn't know if it works until it works, and i think this query will be long no matter what. Also I don't want to lose for nothing the computation time that's already been spent.
I could also if it helps keep only the columns HHID, UPC and yearweek (resp bigint(20),bigint(20),and int(11), though the columns i would drop are only decimal and dates.
And what if i split table B in several parts ? the operation is only a filtering one, so it can be done in several steps, would i win time ? If i don't gain time but don't lose either, at least i could see my progress.
Another possibility would be to directly delete rows from table A (and if really necessary, columns), so i wouldn't have to write another table, would it be faster ?
I can change some database parameters if i send an email to my client, but it take some tim and is not suitable for a lot of tweeking and testing.
Any solution would be appreciate, even the dirtiest one :), i'm really stuck here.
EDIT:
Explain gives me this:
Id select_type table type possible_keys key keylen ref row Extra
1 Simple B index hhidx hhidx 8 NULL 5003865 Using Index
1 Simple A ref hhidx hhidx 8 ncsmars.B.hhid 6 [nothing]
What is the Engine? Is it InnoDB?
What are the primary keys for both tables?
Did you start both primary keys with your HHID (if HHID is not a candidate key for a table - you can create composite key and start it from that field)?
If you start both PK from HHID and then will join your tables on that field - disk seek will be reduced dramatically so you should see much better performance. If you cannot alter both tables - start from smaller one - alter its PK to have HHID on the first place of it and then check execution plan.
ALTER TABLE compute_us_gum_2013_digital_panelists ADD PRIMARY KEY(HHID, [other necessary fields (if any)])
Suppose it will be better than before.
i want to store changes that i do on my "entity" table. This should be like a log. Currently it is implemented with this table in MySQL:
CREATE TABLE `entitychange` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`entity_id` int(10) unsigned NOT NULL,
`entitytype` enum('STRING_1','STRING_2','SOMEBOOL','SOMEDOUBLE','SOMETIMESTAMP') NOT NULL DEFAULT 'STRING_1',
`when` TIMESTAMP NOT NULL,
`value` TEXT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
entity_id = the primary key of my entity table.
entitytype = the field that was changed in the entity table. sometimes only one field is changed, sometimes multiple. one change = one row.
value = the string representation of the "new value" of the field.
Example when changing Field entity.somedouble from 3 to 2, i run those queries:
UPDATE entity SET somedouble = 2 WHERE entity_id = 123;
INSERT INTO entitychange (entity_id,entitytype,value) VALUES (123,'SOMEDOUBLE',2);
I need to select the changes of a specific entity and entitytype of the last 15 days. For example: The last changes with SOMEDOUBLE for entity_id 123 within the last 15 days.
Now, there are two things that i dislike:
All Data is stored as TEXT - although most (less than 1%) isn't really text, in my case, most values are DOUBLE. Is this a big problem?
The Table is getting really, really slow when inserting, since the table already has 200 million rows. Currently my Server load is up to 10-15 because of this.
My Question: How do i address those two "bottlenecks"? I need to scale.
My approaches would be:
Store it like this: http://sqlfiddle.com/#!2/df9d0 (click on browse) - Store the changes in the entitychange table and then store the value according to its datatype in entitychange_[bool|timestamp|double|string]
Use partitioning by HASH(entity_id) - i thought of ~50 partitions.
Should I use another database system, maybe MongoDB?
If I were facing the problem you mentioned, I would design LOG table like bellow:
EntityName: (String) Entity that is being manipulated.(mandatory)
ObjectId: Entity that is being manipulated, primary key.
FieldName: (String) Entity field name.
OldValue: (String) Entity field old value.
NewValue: (String) Entity field new value.
UserCode: Application user unique identifier. (mandatory)
TransactionCode: Any operation changing the entities will need to have a unique transaction code (like GUID) (mandatory), In case of an update on an entity changing multiple fields,these column will be the key point to trace all changes in the update(transcation)
ChangeDate: Transaction date. (mandatory)
FieldType: enumeration or text showing the field type like TEXT or Double. (mandatory)
Having this approach Any entity (table) could be traced Reports will be readableOnly changes will be logged. Transaction code will be the key point to detect changes by a single action.
BTW
Store the changes in the entitychange table and then store the value
according to its datatype in entitychange_[bool|timestamp|double|string]
Won't be needed, in the single table you will have changes and data types
Use partitioning by HASH(entity_id)
I will prefer partitioning by ChangeDate or creating backup tables for changeDate that are old enough to be backed up and remover from the main LOG table
Should I use another database system, maybe MongoDB?
Any data base comes with its own prob and cons , you can use the design on any RDBMS.
A useful comparison of documant based data bases like MongoDB could be found here
hope be helpful.
Now I think I understand what you need, a versionable table with history of the records changed. This could be another way of achieving the same and you could easily make some quick tests in order to see if it gives you better performance than your current solution. Its the way Symfony PHP Framework does it in Doctrine with the Versionable plugin.
Have in mind that there is a primary key unique index of two keys, version and fk_entity.
Also take a look at the values saved. You will save a 0 value in the fields which didnt change and the changed value in those who changed.
CREATE TABLE `entity_versionable` (
`version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`fk_entity` INT(10) UNSIGNED NOT NULL,
`str1` VARCHAR(255),
`str2` VARCHAR(255),
`bool1` BOOLEAN,
`double1` DOUBLE,
`date` TIMESTAMP NOT NULL,
PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "0", "0", "0", "2013-06-02 17:13:16");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a2", "0", "0", "0", "2013-06-11 17:13:12");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b1", "0", "0", "2013-06-11 17:13:21");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b2", "0", "0", "2013-06-11 17:13:42");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "0", "1", "0", "2013-06-16 17:19:31");
/*Another example*/
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "b1", "0", "0", CURRENT_TIMESTAMP);
SELECT * FROM `entity_versionable` t WHERE
(
(t.`fk_entity`="1") AND
(t.`date` >= (CURDATE() - INTERVAL 15 DAY))
);
And probably another step to improve performance, it could be to save all history log records in separate tables, once per month or so. That way you wont have many records in each table, and searching by date will be really fast.
There two main challenges here:
How to store data efficiently, i.e. taking less space and being in an easy to use format
2-3. Managing a big table: archiving, ease for backup and restore
2-3. Performance optimisation: faster inserts and selects
Storing data efficiently
value filed. I would suggest to make it VARCHAR (N).
Reasons:
Using N<255 will save 1 byte per row just because of the data type.
Using other data types for this filed: fixed types use space whatever the value is, and normally it will be 8 bytes per row (datetime, long integer, char (8)) and other variable datatypes are too big for this field.
Also TEXT data type results in performance penalties: (from manaul on BLOB and Text data types)
Instances of TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types. Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
Basically TEXT is designed to store big strings and pieced of text, whereas VARCHAR() is designed relatively short strings.
id field. (updated, thanks to #steve) I agree that this field does not carry any useful information. Use 3 columns for your primary key: entity_id and entitype and when . TIMESTAMP will guarantee you pretty well that there will be no duplicates. Also same columns will be used for partitioning/sub-partitioning.
Table manageability
There are two main options: MERGE tables and Partitioning. MERGE storage engine is based on My_ISAM, which is being gradually phased out as far as I understand. Here is some reading on [MERGE Storage Engine].2
Main tool is Partitioning and it provides two main benefits:
1. Partition switching (which is often an instant operation on large chunk of data) and rolling window scenario: insert new data in one table and then instantly switch all of it into archive table.
2. Storing data in sorted order, that enables partition pruning - querying only those partitions, that contain needed data. MySQL allows sub-partitioning to group data further.
Partitioning by entity_id makes sense. If you need to query data for extended periods of time or you have other pattern in querying your table - use that column for sub-partitioing. There is no need for sub- partitioning on all columns of primary key, unless partitions will be switched at that level.
Number of partitions depends on how big you want db file for that partition to be. Number of sub-partitions depends on number of cores, so each core can search its own partition, N-1 sub-partitions should be ok, so 1 core can do overall coordination work.
Optimisation
Inserts:
Inserts are faster on table without indexes, so insert big chunk of data (do your updates), then create indexes (if possible).
Change Text for Varchar - it take some strain off db engine
Minimal logging and table locks may help, but not often possible to use
Selects:
Text to Varchar should definitely improve things.
Have a current table with recent data - last 15 days, then move to archive via partition switching. Here you have an option to partition table different to archive table (eg. by date first, then entity_id), and change partitioning manner by moving small (1 day) of data to temp table anŠ² changing partitioning of it.
Also you can consider partitioning by date, you have many queries on date ranges. Put usage of your data and its parts first and then decide which schema will support it best.
And as for your 3rd question, I do not see how use of MongoDB will specifically benefit this situation.
This is called a temporal database, and researchers have been struggling with the best way to store and query temporal data for over 20 years.
Trying to store the EAV data as you are doing is inefficient, in that storing numeric data in a TEXT column uses a lot of space, and your table is getting longer and longer, as you have discovered.
Another option which is sometimes called Sixth Normal Form (although there are multiple unrelated definitions for 6NF), is to store an extra table to store revisions for each column you want to be tracked temporally. This is similar to the solution posed by #xtrm's answer, but it doesn't need to store redundant copies of columns that haven't changed. But it does lead to an explosion in the number of tables.
I've started to read about Anchor Modeling, which promises to handle temporal changes of both structure and content. But I don't understand it well enough to explain it yet. I'll just link to it and maybe it'll make sense to you.
Here are a couple of books that contain discussions of temporal databases:
Joe Celko's SQL for Smarties, 4th ed.
Temporal Data & the Relational Model, C.J. Date, Hugh Darwen, Nikos Lorentzos
Storing an integer in a TEXT column is a no-go! TEXT is the most expensive type.
I would go as far as creating one log table per field you want to monitor:
CREATE TABLE entitychange_somestring (
entity_id INT NOT NULL PRIMARY KEY,
ts TIMESTAMP NOT NULL,
newvalue VARCHAR(50) NOT NULL, -- same type as entity.somestring
KEY(entity_id, ts)
) ENGINE=MyISAM;
Partition them, indeed.
Notice I recommend using the MyISAM engine. You do not need transactions for this (these) unconstrained, insert-only table(s).
Why is INSERTing so slow, and what can you do to make it faster.
These are the things I would look at (and roughly in the order I would work through them):
Creating a new AUTO_INCREMENT-id and inserting it into the primary key requires a lock (there is a special AUTO-INC lock in InnoDB, which is held until the statement finishes, effectively acting as a table lock in your scenario). This is not usually a problem as this is a relatively fast operation, but on the other hand, with a (Unix) load value of 10 to 15, you are likely to have processes waiting for that lock to be freed. From the information you supply, I don't see any use in your surrogate key 'id'. See if dropping that column changes performance significantly. (BTW, there is no rule that a table needs a primary key. If you don't have one, that's fine)
InnoDB can be relatively expensive for INSERTs. This is a trade off made to allow additional functionality such as transactions and may or may not be affecting you. Since all your actions are atomic, I see no need for transactions. That said, give MyISAM a try. Note: MyISAM is usually a bad choice for huge tables because it only supports table locking and not record level locking, but it does support concurrent inserts, so it might be a choice here (especially if you do drop the primary key, see above)
You could play with database storage engine parameters. Both InnoDB and MyISAM have options you could change. Some of them have an impact on how TEXT data is actually stored, others have a broader function. One you should specifically look at is innodb_flush_log_at_trx_commit.
TEXT columns are relatively expensive if (and only if) they have non-NULL values. You are currently storing all values in that TEXT column. It is worth giving the following a try: add extra fields value_int and value_double to your table and store those values in the corresponding column. Yes, that will waste some extra space, but might be faster - but this will largely be dependant on the database storage engine and its settings. Please note that a lot of what people think about TEXT column performance is not true. (See my answer to a related question on VARCHAR vs TEXT)
You suggested spreading the information over more than one table. This is only a good idea if your tables are fully independant of one another. Otherwise you'll end up with more than one INSERT operation for any change, and you're more than likely to make things a lot worse. While normalizing data is usually good(tm), it is likely to hurt performance here.
What can you do to make SELECTs run fast
Proper keys. And proper keys. And just in case I forgot to mention: proper keys. You don't specify in detail what your selects look like, but I assume them to be similar to "SELECT * FROM entitychange WHERE entity_id=123 AND ts>...". A single compound index on entity_id and ts should be enough to make this operation fast. Since the index has to be updated with every INSERT, it may be worth trying the performance of both entity_id, ts and ts, entity_id: It might make a difference.
Partitioning. I wouldn't even bring this subject up, if you hadn't asked in your question. You don't say why you'd like to partition the table. Performance-wise it usually makes no difference, provided that you have proper keys. There are some specific setups that can boost performance, but you'll need the proper hardware setup to go along with this. If you do decide to partition your table, consider doing that by either the entity_id or the TIMESTAMP column. Using the timestamp, you could end up with archiving system with older data being put on an archive drive. Such a partitioning system would however require some maintenance (adding partitions over time).
It seems to me that you're not as concerned about query performance as about the raw insert speed, so I won't go into more detail on SELECT performance. If this does interest you, please provide more detail.
I would advise you to make a lot of in deep testing, but from my tests I am achiving very good results with both INSERT and SELECT with the table definition I posted before. I will detail my tests in depth so anyone could easily repeat and check if it gets better results. Backup your data before any test.
I must say that these are only tests, and may not reflect or improve your real case, but its a good way of learning and probably a way of finding usefull information and results.
The advises that we have seen here are really nice, and you will surely notice a great speed improvement by using a predefined type VARCHAR with size instead of TEXT. However you could gain speed, I would advise not to use MyISAM for data integrity reasons, stay with InnoDB.
TESTING:
1. Setup Table and INSERT 200 million of data:
CREATE TABLE `entity_versionable` (
`version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`fk_entity` INT(10) UNSIGNED NOT NULL,
`str1` VARCHAR(255) DEFAULT NULL,
`str2` VARCHAR(255) DEFAULT NULL,
`bool1` TINYINT(1) DEFAULT NULL,
`double1` DOUBLE DEFAULT NULL,
`date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB AUTO_INCREMENT=230297534 DEFAULT CHARSET=latin1
In order to insert +200 million rows in about 35 mins in a table, please check my other question where peterm has answered one of the best ways to fill a table. It works perfectly.
Execute the following query 2 times in order to insert 200 million rows of no random data (change data each time to insert random data):
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
SELECT 1, 'a1', 238, 2, 524627, '2013-06-16 14:42:25'
FROM
(
SELECT a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 + g.N * 1000000 + h.N * 10000000 + 1 N FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) c
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) d
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) e
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) f
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) g
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) h
) t;
*Since you already have the original table with 200 million rows of real random data, you wont probably need to fill it, just export your table data and schema and import it into a new Testing table with the same schema. That way you will make tests in a new table with your real data, and the improvements you get will also work for the original one.
2. ALTER the new Test table for performance (or use my example above in step 1 to get better results).
Once that we have our new Test table setup and filled with random data, we should check the above advises, and ALTER the table to speed it up:
Change TEXT to VARCHAR(255).
Select and make a good primary key unique index with two or three
columns. Test with version autoincrement and fk_entity in your first
test.
Partition your table if necessary, and check if it improves speed. I
would advise not to partition it in your first tests, in order to
check for real performance gain by changing data types and mysql
configuration. Check the following link for some partition and
improvement tips.
Optimize and repair your table. Index will be made again and will
speed searchs a lot:
OPTIMIZE TABLE test.entity_versionable;
REPAIR TABLE test.entity_versionable;
*Make a script to execute optimize and maintain your index up to date, launching it every night.
3. Improve your MySQL and hardware configuration by carefully reading the following threads. They are worth reading and Im sure you will get better results.
Easily improve your Database hard disk configuration spending a bit
of money: If possible use a SSD for your main MySQL database, and a
stand alone mechanical hard disk for backup purposes. Set MySQL logs
to be saved on another third hard disk to improve speed in your
INSERTs. (Remember to defragment mechanical hard disks after some
weeks).
Performance links: general&multiple-cores, configuration,
optimizing IO, Debiancores, best configuration,
config 48gb ram..
Profiling a SQL query: How to profile a query, Check for possible bottleneck in a query
MySQL is very memory intensive, use low latency CL7 DDR3 memory if
possible. A bit off topic, but if your system data is critical, you may look for ECC memory, however its expensive.
4. Finally, tests your INSERTs and SEARCHs in the test table. Im my tests with +200 million of random data with the above table schema, it spends 0,001seconds to INSERT a new row and about 2 minutes to search and SELECT 100 million rows. And however its only a test and seems to be good results :)
5. My System Configuration:
Database: MySQL 5.6.10 InnoDB database (test).
Processor: AMD Phenom II 1090T X6 core, 3910Mhz each core.
RAM: 16GB DDR3 1600Mhz CL8.
HD: Windows 7 64bits SP1 in SSD, mySQL installed in SSD, logs written in mechanical hard disk.
Probably we should get better results with one of the lastest Intel i5 or i7 easily overclocked to 4500Mhz+, since MySQL only uses one core for one SQL. The higher the core speed, the faster it will be executed.
6. Read more about MySQL:
O'Reilly High Performance MySQL
MySQL Optimizing SQL Statements
7. Using another database:
MongoDB or Redis will be perfect for this case and probably a lot faster than MySQL. Both are very easy to learn, and both has their advantages:
- MongoDB: MongoDB log file growth
Redis
I would definitively go for Redis. If you learn how to save the log in Redis, it will be the best way to manage the log with insanely high speed:
redis for logging
Have in mind the following advices if you use Redis:
Redis is compiled in C and its stored in memory, has some different
methods to automatically save the information into disk
(persistence), you wont probably have to worry about it. (in case of disaster
scenario you will end loosing about 1 second of logging).
Redis is used in a lot of sites which manages terabytes of data,
there are a lot of ways to handle that insane amount of information
and it means that its secure (used here in stackoverflow, blizzard, twitter, youporn..)
Since your log will be very big, it will need to fit in memory in
order to get speed without having to access the hard disk. You may
save different logs for different dates and set only some of them in
memory. In case of reaching memory limit, you wont have any errors and everything will still work perfectly, but check the Redis Faqs for more information.
Im totally sure that Redis will be a lot faster for this purpose than
MySQL. You will need to learn about how to play with lists and
sets to update data and query/search for data. If you may need really advanced query searches, you should go with MongoDB, but in this case of simple date searchs will be perfect for Redis.
Nice Redis article in Instagram Blog.
At work we have logtables on almost every table due to customer conditions (financial sector).
We have done it this way: Two tables ("normal" table, and log table) and then triggers on insert/update/delete of the normal table whichs stores a keyword (I,U,D) and the old record (on update, delete) or the new one (on insert) inside the logtable
We have both tables in the same database-schema