Update or insert a mysql database with 60 million entries - mysql

I have a mysql database which has a table with around 60 million entries with primary key say 'x'. I have a data set(csv file) which also has around 60 million entries. This dataset also has index 'x'. For values of key 'x' common to both the mysql table and dataset, the corresponding entries in the mysql table just gets updated with increment to a counter variable. The new ones in the dataset are to be inserted.
A simple serial execution in which we try to update the entry if present or else insert takes around 8 hours to complete. What can I do to increase the speed of this whole procedure?

Plan A: IODKU, as #Rogue suggested.
Plan B: Two sqls; they might run faster because part of the 8 hours is gathering a huge amount of undo information in case of a crash. The normalization section comes close to those 2 queries.
Plan C: Walk through the pair of tables, using the PRIMARY KEY of one of them to do IODKU in chunks of, say, 1000 rows. See my Chunking code (and adapt it from DELETE to IODKU).
In Plans B and C, turn on autocommit so that you don't build up a huge redo log.
Plan D: Build a new table as you merge the two tables with a JOIN. Finish with an atomic
RENAME TABLE real TO old,
new TO real;
DROP TABLE old; -- when happy with the result.
Plan E: Plan D + Chunking of the INSERT ... SELECT real JOIN tmp ...

Related

Deleting Billion records in a range vs exact ID lookup MYSQL

I have a database table which is around 700GB with 1 Billion rows, the data is approximately 500 GB and index is 200GB,
I am trying to delete all the data before 2021,
Roughly around 298,970,576 rows in 2021 and there are 708,337,583 rows remaining.
To delete this I am running a non-stop query in my python shell
DELETE FROM table_name WHERE id < 1762163840 LIMIT 1000000;
id -> 1762163840 represent data from 2021. Deleting 1Mil row taking almost 1200-1800sec.
Is there any way I can speed up this because the current way is running for more than 15 days and there is not much data delete so far and it's going to do more days.
I thought that if I make a table with just ids of all the records that I want to delete and then do an exact map like
DELETE FROM table_name WHERE id IN (SELECT id FROM _tmp_table_name);
Will that be fast? Is it going to be faster than first making a new table with all the records and then deleting it?
The database is setup on RDS and instance class is db.r3.large 2 vCPU and 15.25 GB RAM, only 4-5 connections running.
I would suggest recreating the data you want to keep -- if you have enough space:
create table keep_data as
select *
from table_name
where id >= 1762163840;
Then you can truncate the table and re-insert new data:
truncate table table_name;
insert into table_name
select *
from keep_data;
This will recreate the index.
The downside is that this will still take a while to re-insert the data (renaming keep_data would be faster). But it should be much faster than deleting the rows.
AND . . . this will give you the opportunity to partition the table so future deletes can be handled much faster. You should look into table partitioning if you have such a large table.
Multiple techniques for big deletes: http://mysql.rjweb.org/doc.php/deletebig
It points out that LIMIT 1000000 is unnecessarily big and causes more locking than might be desirable.
In the long run, PARTITIONing would be beneficial, it mentions that.
If you do Gordon's technique (rebuilding table with what you need), you lose access to the table for a long time; I provide an alternative that has essentially zero downtime.
id IN (SELECT...) can be terribly slow -- both because of the inefficiency of in-SELECT and due to the fact that DELETE will hang on to a huge number of rows for transactional integrity.

MySql performance issues in queries

I'm a newbie using MySql. I'm reviewing a table that has around 200,000
records. When I execute a simple:
SELECT * FROM X WHERE Serial=123
it takes a long time, around 15-30 secs in return a response (with 200,000 rows) .
Before adding an index it takes around 50 seconds (with 7 million) to return a simple select where statement.
This table increases its rows every day. Right now it has 7 million rows. I added an index in the following way:
ALTER TABLE `X` ADD INDEX `index_name` (`serial`)
Now it takes 109 seconds to return a response.
Which initial approaches should I apply to this table to improve the performance?
Is MySql the correct tool to handle big tables that will have around 5-10 million of records? or should I move to another tool?
Assuming serial is some kind of numeric datatype...
You do ADD INDEX only once. Normally, you would have foreseen the need for the index and add it very cheaply when you created the table.
Now that you have the index on serial, that select, with any value other than 123, will run very fast.
If there is only one row with serial = 123, the indexed table will spit out the row in milliseconds whether it has 7 million rows or 7 billion.
If serial = 123 shows up in 1% of the table, then finding all 70M rows (out of 7B) will take much longer than finding all 70K rows (out of 7M).
Indexes are your friends!
If serial is a VARCHAR, then...
Plan A: Change serial to be a numeric type (if appropriate), or
Plan B: Put quotes around 123 so that you are comparing strings to strings!

How to handle large amounts of data in MySQL database?

Background
I have spent couple of days trying to figure out how I should handle large amounts of data in MySQL. I have selected some programs and techniques for the new server for the software. I am probably going to use Ubuntu 14.04LTS running nginx, Percona Server and will be using TokuDB for the 3 tables I have planned and InnoDB for the rest of the tables.
But yet I have the major problem unresolved. How to handle the huge amount of data in database?
Data
My estimates for the possible data to receive is 500 million rows a year. I will be receiving measurement data from sensors every 4 minutes.
Requirements
Insertion speed is not very critical, but I want to be able to select few hundred measurements in 1-2 seconds. Also the amount of required resources is a key factor.
Current plan
Now I have thought of splitting the sensor data in 3 tables.
EDIT:
On every table:
id = PK, AI
sensor_id will be indexed
CREATE TABLE measurements_minute(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_hour(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
CREATE TABLE measurements_day(
id bigint(20),
value float,
sensor_id mediumint(8),
created timestamp
) ENGINE=TokuDB;
So I would be storing this 4 minute data for one month. After the data is 1 month old it would be deleted from minute table. Then average value would be calculated from the minute values and inserted into the measurements_hour table. Then again when the data is 1 year old all the hour data would be deleted and daily averages would be stored in measurements_day table.
Questions
Is this considered a good way of doing this? Is there something else to take in consideration? How about table partitioning, should I do that? How should I execute the splitting of the date into different tables? Triggers and procedures?
EDIT: My ideas
Any idea if MonetDB or Infobright would be any good for this?
I have a few suggestions, and further questions.
You have not defined a primary key on your tables, so MySQL will create one automatically. Assuming that you meant for "id" to be your primary key, you need to change the line in all your table create statements to be something like "id bigint(20) NOT NULL AUTO_INCREMENT PRIMARY KEY,".
You haven't defined any indexes on the tables, how do you plan on querying? Without indexes, all queries will be full table scans and likely very slow.
Lastly, for this use-case, I'd partition the tables to make the removal of old data quick and easy.
I had to solve that type of ploblem before, with nearly a Million rows per hour.
Some tips:
Engine Mysam. You don't need to update or manage transactions with that tables. You are going to insert, select the values, and eventualy delete it.
Be careful with the indexes. In my case, It was critical the insertion and sometimes Mysql queue was full of pending inserts. A insert spend more time if your table has more index. The indexes depends of your calculated values and when you are going to do it.
Sharding your buffer tables. I only trigger the calculated values when the table was ready. When I was calculating my a values in buffer_a table, it's because the insertions was on buffer_b one. In my case, I calculate the values every day, so I switch the destination table every day. In fact, I dumped all the data and exported it in another database to make the avg, and other process without disturb the inserts.
I hope you find this helpful.

Incorrect key file for table; try to repair it, when merging very large MySQL tables into one

I am using MySQL server 5.0.67 on a Windows Vista 32 bit with 3GB of RAM...
I have a 4.5 million MyISAM row table (Table A) and I have created a C# .NET program which goes through each of these rows, extracts certain information and populates another MyISAM table (Table B). For each row in Table A, I gather about 80 rows for Table B. Table B's structure is as follows: Field1 (integer), Field2 (bit), Field3 (varchar(3)), Field4 (mediumint(8)), Field5 (mediumint(8)), Field6 (integer), and there is a unique key on the combination of Field1,2,3.
I insert into Table B with the IGNORE ON DUPLICATE KEY UPDATE...clause
When I looped through all the rows from A, the program started initially OK (about 50 rows a second), but as about 120,000 rows were processed, it started to become painstakingly slower. So I decided to split the inserting of rows into 500 smaller tables (rather than just Table B) with the same structure as Table B. This completed in about 1 day.
So then I tried to merge all the 500 tables into Table B (which up until now was empty). I created a script which ran a stored procedure doing an INSERT INTO B SELECT * FROM (one of those 500 tables) IGNORE ON DUPLICATE KEY UPDATE...I did this for a while, and I eventually decided to group about 150 calls to this sp with the respective tables, and let it run overnight. The problem is that at a certain point, after about 100 merges, I got the error "Incorrect key file for table B.MYI; try to repair it".
Could this be because the temporary file might have grown too big? Or is there any other reason why the index became corrupted? Maybe because it was dealing with more than 2GB of data? Would running batches of 5 merges (rather than 150) be a wiser solution? Repairing the table is not an option for me as I have no way of knowing where in the smaller tables it had erred, thus I have to start again and make sure it works. I am now thinking about updating to MySQL 5.5.19 using an .msi file, but I do not know if it is worth the hassle. I just want to populate this Table B and then dump it and move it elsewhere. I still have the original 500 tables so if someone can point me in the right direction of merging them into 1 that would be great!
Thanks in advance,
Tim

MySQL Analyze and Optimize - Are they required if only inserts - and the table has no joins?

I have a MyISAM table in MySQL which consists of two fields (f1 integer unsigned, f2 integer unsigned) and contains 320 million rows. I have an index on f2. Every week I insert about 150,000 rows into this table. I would like to know what is the frequency with which I need to run "analyze" and "optimize" on this table (as it would probably take a long time and block in the meantime)? I do not do any deletes or update statements, but just insert new rows every week. Also, I am not using this table in any joins so, based on this information, are "analyze" and "optimize" really required?
Thanks in advance,
Tim
ANALYZE TABLE checks the keys, OPTIMIZE TABLE kind of reorganizes tables.
If you never...ever... delete or update the data in your table, only insert new ones, you won't need analyze or optimize.