Optimization of merging huge tables in MySQL - mysql

I have a huge indexed MySQL table (~300-400 GB) that I need to append with new entries from time to time (where the new data takes ~10-20 GB). The raw file with new data may contain mistakes, that could be fixed only manually and are visible only when the processing script reach them. Also the new data should be available in the main db only after the full processing of the raw data is finished. So to not screw up the main table I decided to have the following workflow:
The script creates temporary table with the structure identical to the main table and fills it.
Once it is done and verified, the temporary table is inserted into main one:
INSERT INTO main_table (all_fields_except_primary_key) SELECT all_fields_except_primary_key FROM new_table;
And this procedure is extremely slow, as I understand due to indexing new results.
I have read that inserting into indexed tables is very slow in general and some professionals suggest to DROP INDEX'es before insert big amount of data and then index again. But with such huge data indexing of whole table is very long (much longer than my naive INSERT INTO .. SELECT ..) and what is more important, the main table almost couldn't be used during it (without indexes SELECTS takes ages).
So I had an idea of indexing the temprory table before inserting (since it is very fast) and then do merge combining both indexes.
Is it possible somehow in MySQL?
And another question: possibly there're another workaround for my task?

Related

Galera: Cannot write to database when indexing large data

I am using Galera Cluster with 3 nodes. I am currently meet following problems. I want to write more than 500 millions of records into database, for example table Data. Here is following steps:
Create table NewData with schema as Data but without index.
Write 500 millions records into this table. (using multiple threads to write, each thread will write bunch of records).
After finishing, assign index for this table.
Rename Data to OldData and rename NewData to Data.
The problem I am currently have is at indexing phrase, other services cannot write / read data. After I increase innodb_buffer_pool_size other nodes can read data but still cannot write.
I have configured so that written job writes at different node than other api's but problem still the same. I think that if one node is very high workload, other node should still behave normally. Please tell me why and how to fix this.
Thanks
I think you missed a step.
(one-time) Create table NewData with schema as Data but without index.
Insert into NewData.
Create table `Empty (again like Data but without any index)
RENAME TABLE NewData TO ToIndex, Empty TO NewData; -- Now the ingestion can proceed.
ALTER TABLE ToIndex ADD INDEX ...
RENAME TABLE Data TO Old, ToIndex TO Data;
The point is to have two things going on:
Continually writing to the unindexed NewData.
Swap tables around so that periodically that table (under a new name) gets indexed, and then used to replace the live table (which is always seen as Data).
This is not quite the same situation, but has some similarities: http://mysql.rjweb.org/doc.php/staging_table

MYSQL Table Times Out During Drop

I have a table with several hundred million rows of data. I want to delete the table, but every operation I perform on the table loses connection after running for 50,000+ seconds (about 16 hours), which is under the 60,000 second time out condition I have set in the database. I've tried creating a stored procedure with the Drop Table code thinking that if I send the info to the DB to perform the operation it will not need a connection to process it, but it does the same thing. Is it just timing out? Or do I need to do something else?
Instead do TRUNCATE TABLE. Internally it creates an equivalent, but empty, table, then swaps. This technique might take a second, even for a very big table.
If you are deleting most of a table, then it is usually faster (sometimes a lot faster), to do
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT ... FROM real
WHERE ... -- the rows you want to keep
Why do you need to delete everything?
For other techniques in massive deletes, including big chunks out of a huge table, see https://mariadb.com/kb/en/mariadb/big-deletes/

Table with 50 million data and adding index takes too much time

I was working on table which has near about 50 million data(2GB-size). I had requirement to optimize the performance. So when I add index on column through phpmyadmin panel, table got lock and result in holding up all queries in queue on that table and ultimately results in restart/kill all queries. (And yeah, I forgot to mention I was doing this on production. My bad!)
When I did some research I found out some solution like creating duplicate table but any alternative method ?
You may follow this steps,
Create a temp table
Creates triggers on the first table (for
inserts, updates, deletes) so that they are replicated to the temp
table
In small batches, migrate data When done, rename table to new
table, and drop the other table
But as you said you are doing it in production then you need to consider live traffic while dropping a table and creating another one

How to avoid to blow up transaction log?

I have a table which stores data out of a complex query. This table is truncated and new populated once per hour. As you might assume this is for performance reason so the application accesses this table and not the query.
Is truncate and insert the only way to resolve this task cheap, or are there other possibilities in respect of the transaction log?
If I am assuming right, you are using this table as a temp table to store some records and want to remove all records from this table every one hour, right?
Truncate is always minimally logged. So yes, truncate and then insert will work. Another option is to create a new table with same structure. Drop old table and then rename new table to the old table name.
If you want to avoid the above, you can explore the "simple" recovery model (this has implications on point of time recovery - so be very careful with this if you have other tables in this same database). Or you can create a new database which will just have this one table, set recovery for this DB to "simple". Simple recovery model will help you keep your t-log small.
Lastly, if you have to have full recovery and also cannot use "truncate" or "drop" options from above, you should at the very least backup your t-log at very regular intervals (depending on how big its growing and how much space you have).

How to manage Huge operations on MySql

I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).