Best way to merge data on MYSQL - mysql

I have a table that is populated twice a day with data collected from other tables, and I use it to create some reports.
The steps to achieve this is the following: Get all data that will be placed on the table, truncate the table and and then insert all data again.
Is it the best way to perform this in terms of performance? Isn't there a way to updating only things that really changed, insert the new data and skip the rest?

You can follow this for updating a table with merge:
UPDATE multiple tables in MySQL using LEFT JOIN
To manage the UPDATE or INSERT you can use this method:
Insert into a MySQL table or update if exists
Obviously all depends on your DB structure, what kind of data you are merging and, mainly, which keys are available.
Regards

CREATE TABLE new LIKE real;
INSERT INTO new ...;
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
Pros/cons:
You always have table real. (The RENAME is 'instantaneous' and 'atomic'.)
It is simple.
It may be slower than a combination of UPDATE and INSERT, but does that really matter, based on my first comment?
If there is more than you have let on, then see http://mysql.rjweb.org/doc.php/staging_table , which discusses other aspects of rapidly "ingesting" new data.

Related

MySQL: Best way to update a large table

I have a table with huge amount of data. The source of data is an external api. Every few hours, I need to sync the database so that the changes are up to date from the external api. I am doing a full sync (api doesn't allow delta sync).
While sync happens, I want to make sure that the data from the database is also available for read. So, I am following below steps:
I have a cloumn in the table which acts as a flag for whether or not data is readable. Only the data with flag set is marked for read.
I am inserting all the data from the api into the table.
Once all the data is written, I am deleting all the data in the table with flag set.
After deletion, I am updating the table and setting the flag for all the rows.
Table has around ~50 million rows and is expected to grow. There is a customerId field in the table. Sync usually happens based on customerId by passing it to the api.
My problem is, step 3 and 4 above are taking a lot of time. Queries are something like:
Step 3 --> delete from foo where customer_id=12345678 and flag=1
Step 4 --> update foo set flag=1 where customer_id=12345678
I have tried partitioning the table based on customer_id and it works great where customer_id has less number of rows but for some customer_id, the number of rows in each partition itself goes till ~5 million.
Around 90% of data doesn't change between two syncs. How can I make this fast?
I was thinking of using just the update queries instead of insert queries and then check if there was any update. If not, I can issue an insert query for the same row. This way any updates will be taken care of along with the insert. But I am not sure if the operation will block read queries for this while update is in progress.
For your setup (read only data, full sync), the fastest way to update the table is to not update at all, but to import the data into a different table and to rename it afterwards to make it the new table.
Create a table like your original table, e.g. use
create table foo_import like foo;
If you have e.g. triggers, add them too.
From now on, let the import api write its (full) sync to this new table.
After a sync is done, swap the two tables:
RENAME TABLE foo TO foo_tmp,
foo_import TO foo,
foo_tmp to foo_import;
It will (literally) just require a second.
This command is atomic: it will wait for transactions that access these tables to finish, it will not present a situation where there is no table foo and it will completely fail (and not do anything) if one of the tables doesn't exist or foo_tmp already exists.
As a final step, empty your import table (that now contains your old data) to be ready for your next import:
truncate foo_import;
This will again just require a second.
The rest of your querys probably assume that flag=1. Until (if at all) you update the code to not use the flag anymore, you can set its default value to 1 to keep it compatible, e.g. use
alter table foo modify column flag tinyint default 1;
Since you don't have foreign keys, it doesn't have to bother you, but for others with a similar problem it might be useful to know that foreign keys will get adjusted, so foreign keys that are referencing foo will reference foo_import after renaming the tables. To make them point to the new table foo again, they have to be dropped and recreated. Everything else (e.g. views, queries, procedures) will resolve by the current name, so they will always access the current foo.
CREATE TABLE new LIKE real;
Load `new` by whatever means you have; take as long as needed.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
The RENAME is atomic and "instantaneous"; real is "always" available.
(I don't see the need for flag.)
OR...
Since you are actually updating a chunk of a table, consider these...
If the chunk is small...
Load the new data into a tmp table
DELETE the old rows
INSERT ... SELECT ... to move the new rows in. (Having the new data already in a table is probably the fastest way to achieve this.)
If the chunk is big, and you don't want to lock the table for "too long", there are some other tricks. But first, is there some form of unique row number for each row for the customer? (I'm thinking about batch-moving a bunch or rows at a time, but need more specifics before spelling it out.)

Performance comparison for SELECT and UPDATE when update a table partially

There's a table need to be update. However, the amount of data changed (comparing the fresh data we got and those in database) is unknown.
I can think of two ways to implement this.
Select all data and compare them in web server. Then only update
those changed.
Simply update all data.
I guess there's an performance borderline for them. If the effected rows is, let's say, less than 1,000, then maybe method 2 is better.
My question is:
Is there a general criteria for this?
Can select compare with update operations generally?
Suppose the database is MySQL, if needed.
If you are replacing the entire table (possibly with mostly the same data), it is fairly straight forward to do it this way, and not worry about which approach:
CREATE TABLE new LIKE real;
Load the new data entirely into `new`
RENAME TABLE real TO old, new TO real; -- atomic and instantaneous (no downtime)
DROP TABLE old;
If only part of the rows are available, load them into a temp table, then do a multi-table UPDATE to transfer any new values into the real table.
If your new data might have new rows, then you need another step to locate the new rows and INSERT ... SELECT LEFT JOIN ... them into the real table.
Please provide more details if you need further discussion.

Duplicate data in SQL

I have to keep duplicate data in my database so my question is...Is it preferable to keep the duplicate data in the same table and just add a column to identify the original data or I have to create another table to hold the copied data?
I suggest to save the duplicate data in a different table or even a different schema so it won't be confusing to keep working with this table.
Imagine yourself in six months form now trying to guess what are all this duplicate rows for.
In addition those duplicate rows does not reflect the business purpose of this table.
It will be nicer to store them in a table named [table_name]_dup or a schema named [schema_name]_dup
To create a backup you should read this
To duplicate a website with it's content. Bad solution but you still have to make a backup and restore it in a different database.
Duplicate a table in mysql:
CREATE TABLE newtable LIKE oldtable;
INSERT newtable SELECT * FROM oldtable;

How to insert a new column in a huge MYSQL Database Table?

I have this table in MYSQL databse which has about 10 million records/rows. I want to insert a new column in the table. However a simple insert column query doesn't seem to work well for me.
This is what I have tried,
ALTER TABLE contacts ADD processed INT(11);
I waited for about 5 hours, but nothing happened. Is there any way to insert a new column in such a huge table?
Hope I am clear with my question. Any help would be appreciated.
If it's production:
You should use pt-online-schema-change of Percona Toolkit.
pt-online-schema-change emulates the way that MySQL alters tables internally, but it works on a copy of the table you wish to alter. This means that the original table is not locked, and clients may continue to read and change data in it.
pt-online-schema-change works by creating an empty copy of the table to alter, modifying it as desired, and then copying rows from the original table into the new table. When the copy is complete, it moves away the original table and replaces it with the new one. By default, it also drops the original table.
Or oak-online-alter-table which is part of openark kit
oak-online-alter-table allows for non blocking ALTER TABLE operations, table rebuilds and creating a table's ghost.
Altering tables will be slower, but it doesn't lock tables.
If it's not production and downtime is okay, try this approach:
CREATE TABLE contacts_tmp LIKE contacts;
ALTER TABLE contacts_tmp ADD COLUMN ADD processed INT UNSIGNED NOT NULL;
INSERT INTO contacts_tmp (contact_table_fields) SELECT * FROM contacts;
RENAME TABLE contacts_tmp TO contacts, contacts TO contacts_old;
DROP TABLE contacts_old;

How do I efficiently change a MySQL table structure on a table with millions of entries?

I have a MySQL database that is up to about 17 GB in size and has 38 million entries. At the moment, I need to both increase the size of one column (varchar 40 to varchar 80) and add more columns.
Many of the fields are indexed including the one that I need to change. It is part of a unique pair that is necessary for the applications to work. In attempting to just make the change yesterday, the query ran for almost four hours without finishing, when I decided to cut our outage and just bring the service back up.
What is the most efficient way to make changes to something of this size?
Many of these entries are also old and if there is a good way to sort of shard off entries but still have them available that might help with this problem by making the table a much more manageable size.
You have some choices.
In any case you should take a backup before you do this stuff.
One possibility is to take your service offline and do it in place, as you have tried. If you do that you should disable key checks and constraints.
ALTER TABLE bigtable DISABLE KEYS;
SET FOREIGN_KEY_CHECKS=0;
ALTER TABLE (whatever);
ALTER TABLE (whatever else);
...
SET FOREIGN_KEY_CHECKS=1;
ALTER TABLE bigtable ENABLE KEYS;
This will allow the ALTER TABLE operation to go faster. It will regenerate the indexes all at once when you do ENABLE KEYS.
Another possibility is to create a new table with the new schema you want, then disable the keys on the new table, then do as #Bader suggested and insert the contents of the old table.
After your new table is built you will re-enable the keys on it, then rename the old table to some name like "old_bigtable" then rename the new table to "bigtable".
It's possible that you can keep your service online while you're populating the new table. But that might work poorly.
A third possibility is to dump your giant table (to a flat file) and then load it to a new table with the new layout. That is pretty much like the second possibility except that you get a table backup for free. You can make this go pretty fast with SELECT DATA INTO OUTFILE and LOAD DATA INFILE. You'll need to have access to your server machine's file system to do this.
In all cases, disable, then re-enable, the constraints and keys to get things to go fast.
Create a new table with the new structure you want with a different name for example NewTable.
Then insert data into this new table from the old table using the following query:
INSERT INTO NewTable (field1, field2, etc...) SELECT field1, field2, ... FROM OldTable
After this is done, you can drop the old table and rename the new table to the original name
DROP TABLE `OldTable`;
RENAME TABLE `NewTable` TO `OldTable` ;
I have tried this approach on a very large table and it's much much faster than altering the table.
With MySQL 5.1 and again with 5.5 certain alter statements were enhanced to just modify the structure without rewriting the entire table ( http://dev.mysql.com/doc/refman/5.5/en/alter-table.html - search for in-place). The availability of this though varies by the type of change you are making and the engine in use, the most value comes from InnoDB Plugin. In the case of your specific changes though the entire table would be rewritten.
When we encounter these issues, we typically try to leverage replica databases. As long as you are adding and not removing you can run your DDL against the replica first and then schedule a brief outage for promoting the replica to the master role. If you happen to be on RDS this is even one of their suggested uses for their replica instances http://aws.amazon.com/about-aws/whats-new/2012/10/11/amazon-rds-mysql-rr-promotion/.
Some other alternatives include:
Selecting out a subset of records into a new table with the desired structure (use INTO OUTFILE to avoid a table lock). Once complete you can schedule a maintenance window and REPLACE INTO or UPDATE any records that have changed in the origin table since the initial data copy. Once the update is complete a RENAME TABLE... of both tables wraps the changes up.
Using a tool like Percona's pt-online-schema-change: http://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-change.html. This tool works with triggers so if you already have triggers on the tables you want to change this may not fit your needs.