A single insert statement is taking, occasionally, more than 2 seconds. The inserts are potentially concurrent, as it depends on our site traffic which can result in 200 inserts per minute.
The table has more than 150M rows, 4 indexes and is accessed using a simple select statement for reporting purposes.
SHOW INDEX FROM ouptut
How to speed up the inserts considering that all indexes are required?
You haven't provided many details but it seems like you need partitions.
An insertion operation in an database index has, in general, an O(logN) time complexity where N is the number of rows in the table. If your table is really huge even logN may become too much.
So, to address that scalability issue you can make use of index partitions to transparently split up your table indexes in smaller internal pieces and reduce that N without changing your application or SQL scripts.
https://dev.mysql.com/doc/refman/5.7/en/partitioning-overview.html
[EDIT]
Considering information initially added in the comments and now updated in the question itself.
200 potentially concurrent inserts per minute
4 indexes
1 select for reporting purposes
There are a few not mutually exclusive improvements:
Check the output of EXPLAIN for that SELECT and remove indexes not being used, or, otherwise, combine them in a single index.
Make the inserts in batch
https://dev.mysql.com/doc/refman/5.6/en/insert-optimization.html
https://dev.mysql.com/doc/refman/5.6/en/optimizing-innodb-bulk-data-loading.html
Partitioning still an option.
Alternatively, change your approach: save the data to a nosql database like redis and populate the mysql table asynchronously for reporting purpose.
Related
So i have database in project Mysql .
I have a main table that have main staff for updating and inserting .
I have huge data traffic on the data . what i am doing mainly reading .csv file and inserting to table .
Everything works file for 3 days but when table record goes above 20 million the database start responding slow , and in 60 million more slow.
What i have done ?
I have applied index in the record where i think i need of it . (where clause field for fast searching) .
I think query optimisation can not be issue because database working fine for 3 days and when data filled in table it get slow . and as i reach 60 million it work more slow .
Can you provide me the approach how can i handle this ?
What should i do ? Should i shift data in every 3 days or what ? What you have done in such situation .
The purpose of database is to store a huge information. I think the problem is not in your database, it should be poor query, joins, Database buffer, index and cache. These are the following reason which makes your response to slow up. For more info check this link
I have applied index in the record where i think i need of it
Yes, index improve the performance of SELECT query, but at the same time it will degrade your DML operation and index has to be restructure whenever you perform any changes to indexed column.
Now, this is totally depending on your business need, whether you need index or not, whether you can compromise SELECT or DML.
Currently, many industries uses two different schemas OLAP for reporting and analytics and OLTP to store real-time data (including some real-time reporting).
First of all it could be helpful for us to now which kind of data you want to store.
Normally it makes no sense to store such a huge amount of data in 3 days because no one ever will be able to use this in an effective way. So it is better to reduce the data before storing in the database.
e.g.
If you get measuring values from a device which gives you one value a millisecond, you should think if any user is ever asking for a special value at a special millisecond or if it not makes more sense to calculate the average value of once a second, minute or hour or perhaps once a day?
If you really need the milliseconds but only if the user takes a deeper look, you can create a table from the main table with only the average values of an hour or day or whatever and work with that table. Only if the user goes in ths "milliseconds" view you use the main table and have to live with the more bad performance.
This all is of course only possible if the database data is read only. If the data in the database is changed from the application (and not only appended by the CSV import) then using more then one table will be error prone.
Whick operation do you want to speed up?
insert operation
A good way to speed it up is to insert records in batch. For example, insert 1000 records in each insert statement:
insert into test values (value_list),(value_list)...(value_list);
other operations
If your table got tens of millions of records, everything will be slowing down. This is quite common.
To speed it up in this situation, here is some advice:
Optimize your table definition. It depends on your particular case. Creating indexes is a common way.
Optimize your SQL statements. Apparently a good SQL statement will run much faster, and a bad SQL statement might be a performance killer.
Data migration. If only part of your data is used frequently, you can shift the infrequently-used data to another big table.
Sharding. This is a more complicated way, but usually used in big data system.
For the .csv file, use LOAD DATA INFILE ...
Are you using InnoDB? How much RAM do you have? What is the value of innodb_buffer_pool_size? That may not be set right -- based on queries slowing down as the data increases.
Let's see a slow query. And SHOW CREATE TABLE. Often a 'composite' index is needed. Or reformulation of the SELECT.
I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.
I have a table which is frequently updated (insert/delete). I also have a script to periodically count how many records are stored in the table. How can I optimize the performance?
Do nothing: Just use the COUNT function.
Create another field to store the number of records: Whenever a new record's added, we increase that field and vice versa.
If your database's main function is storing (frequently inserting/updating), switch storage engine to InnoDB, which is faster with INSERT and UPDATE queries, but slower with reading.
Read more here, here or here.
Method #2 is pretty much the standard way of doing it (if your table is incredibly huge and COUNT is giving you performance issues). You could also store the COUNT value in a MEMORY table which would make retrieval exceedingly fast.
Increment/decrement as you see fit.
If you need accurate numbers, I would build this into the app the updates the database or use triggers to keep the counts up to date. As others have mentioned, the counts could be kept in a MEMORY table, or a Redis instance if you want performance and persistence. There are counts in the INFORMATION_SCHEMA.TABLES table, but they're not precise for InnoDB (+-20% or more).
I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).
Just finished rewriting many queries as batch queries - no more DB calls inside of foreach loops!
One of these new batch queries, and insert ignore into a pivot table, is taking 1-4 seconds each time. It is fairly large (~100 rows per call) and the table is also > 2 million rows.
This is the current bottleneck in my program. Should I consider something like locking the table (never done this before, but I have heard it is ... dangerous) or are there other options I should look at first.
As it is a pivot table, there is a unique key comprised of both the rows I am updating.
Are you using indexes? Indexing the correct columns speeds things up immensely. If you are doing a lot of updating and inserting, sometimes it makes sense to disable indexes until finished, since re-indexing takes time. I don't understand how locking the table would help. Is this table in user by other users or applications? That would be the main reason locking would increase speed.