I am developer and working on performance improvment on MySQL queries. The flow is full load in 2 stages. First query will read FILES table and load into TABLE_STAGE table with full delete and in 2nd level also it read from TABLE_STAGE to TABLE_MAIN table. First delete it and then select all the records.
delete from TABLE_STAGE
select from FILES
delete from TABLE_MAIN ;
insert into TABLE_MAIN from select * from TABLE_STAGE
As a first part of step i have replaced delete with truncate . It improved performance immediately , but when i using delete again , performance is same ,the time is not increasing. I am not getting the reason behind this why it is showing the same result..
Don't use DELETE, it is slow.
CREATE TABLE new LIKE real;
LOAD DATA INFILE INTO new (or use batched insert)
RENAME TABLE real TO old,
new TO real;
DROP TABLE old;
(I don't understand where your FILES table is used.)
Related
I have a set of tables in a MySql database which contain a set of related data (50 000 rows total, so low volume), which are accessed all the time (7 million/day) . Periodically (let's say once a day) I need to update ALL the data in all the tables (full refresh).
I'm considering 2 possibilities:
use transactions, but I'm not sure how it will work with reads/locks
use versioning: adding a version column in all tables and set all rows on the same "publication" with the same version. The next publication will have a version+1, then the lower version rows can be deleted. The current version is stored in a parameter table allowing the reading query to always pick the latest available version.
Anybody has experimented with both solutions? Or any different/better solution?
Thanks
Replacing an entire table
CREATE TABLE new LIKE real;
populate `new` with the new stuff -- the slow part
RENAME TABLE real TO old,
new TO real; -- atomic and fast.
Replacing an entire database: Do the above for each table, but hold off to do the RENAMEs until all the other work is done. Then do all of them in a single RENAME TABLE statement.
No locking, no transactions, no nothing.
I have a table with huge amount of data. The source of data is an external api. Every few hours, I need to sync the database so that the changes are up to date from the external api. I am doing a full sync (api doesn't allow delta sync).
While sync happens, I want to make sure that the data from the database is also available for read. So, I am following below steps:
I have a cloumn in the table which acts as a flag for whether or not data is readable. Only the data with flag set is marked for read.
I am inserting all the data from the api into the table.
Once all the data is written, I am deleting all the data in the table with flag set.
After deletion, I am updating the table and setting the flag for all the rows.
Table has around ~50 million rows and is expected to grow. There is a customerId field in the table. Sync usually happens based on customerId by passing it to the api.
My problem is, step 3 and 4 above are taking a lot of time. Queries are something like:
Step 3 --> delete from foo where customer_id=12345678 and flag=1
Step 4 --> update foo set flag=1 where customer_id=12345678
I have tried partitioning the table based on customer_id and it works great where customer_id has less number of rows but for some customer_id, the number of rows in each partition itself goes till ~5 million.
Around 90% of data doesn't change between two syncs. How can I make this fast?
I was thinking of using just the update queries instead of insert queries and then check if there was any update. If not, I can issue an insert query for the same row. This way any updates will be taken care of along with the insert. But I am not sure if the operation will block read queries for this while update is in progress.
For your setup (read only data, full sync), the fastest way to update the table is to not update at all, but to import the data into a different table and to rename it afterwards to make it the new table.
Create a table like your original table, e.g. use
create table foo_import like foo;
If you have e.g. triggers, add them too.
From now on, let the import api write its (full) sync to this new table.
After a sync is done, swap the two tables:
RENAME TABLE foo TO foo_tmp,
foo_import TO foo,
foo_tmp to foo_import;
It will (literally) just require a second.
This command is atomic: it will wait for transactions that access these tables to finish, it will not present a situation where there is no table foo and it will completely fail (and not do anything) if one of the tables doesn't exist or foo_tmp already exists.
As a final step, empty your import table (that now contains your old data) to be ready for your next import:
truncate foo_import;
This will again just require a second.
The rest of your querys probably assume that flag=1. Until (if at all) you update the code to not use the flag anymore, you can set its default value to 1 to keep it compatible, e.g. use
alter table foo modify column flag tinyint default 1;
Since you don't have foreign keys, it doesn't have to bother you, but for others with a similar problem it might be useful to know that foreign keys will get adjusted, so foreign keys that are referencing foo will reference foo_import after renaming the tables. To make them point to the new table foo again, they have to be dropped and recreated. Everything else (e.g. views, queries, procedures) will resolve by the current name, so they will always access the current foo.
CREATE TABLE new LIKE real;
Load `new` by whatever means you have; take as long as needed.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
The RENAME is atomic and "instantaneous"; real is "always" available.
(I don't see the need for flag.)
OR...
Since you are actually updating a chunk of a table, consider these...
If the chunk is small...
Load the new data into a tmp table
DELETE the old rows
INSERT ... SELECT ... to move the new rows in. (Having the new data already in a table is probably the fastest way to achieve this.)
If the chunk is big, and you don't want to lock the table for "too long", there are some other tricks. But first, is there some form of unique row number for each row for the customer? (I'm thinking about batch-moving a bunch or rows at a time, but need more specifics before spelling it out.)
There's a table need to be update. However, the amount of data changed (comparing the fresh data we got and those in database) is unknown.
I can think of two ways to implement this.
Select all data and compare them in web server. Then only update
those changed.
Simply update all data.
I guess there's an performance borderline for them. If the effected rows is, let's say, less than 1,000, then maybe method 2 is better.
My question is:
Is there a general criteria for this?
Can select compare with update operations generally?
Suppose the database is MySQL, if needed.
If you are replacing the entire table (possibly with mostly the same data), it is fairly straight forward to do it this way, and not worry about which approach:
CREATE TABLE new LIKE real;
Load the new data entirely into `new`
RENAME TABLE real TO old, new TO real; -- atomic and instantaneous (no downtime)
DROP TABLE old;
If only part of the rows are available, load them into a temp table, then do a multi-table UPDATE to transfer any new values into the real table.
If your new data might have new rows, then you need another step to locate the new rows and INSERT ... SELECT LEFT JOIN ... them into the real table.
Please provide more details if you need further discussion.
Using a MySQL DB, I am having trouble with a stored procedure and event timer that I created.
I made an empty table that gets populated with data from another via SELECT INTO.
Prior to populating, I TRUNCATE the current data. It's used to track only log entries that occur within 2 months from the current date.
This turns a 350k+ log table into about 750 which really speeds up reporting queries.
The problem is that if a client sends a query precisely between the TRUNCATE statement and the SELECT INTO statement (which has a high probability considering the EVENT is set to run every 1 minute), the query returns no rows...
I have looked into locking a read on the table while this PROCEDURE is ran, but locks are not allowed in STORED PROCEDURES.
Can anyone come up with a workaround that (preferably) doesn't require a remodel?
I really need to be pointed in the right direction here.
Thanks,
Max
I'd suggest an alternate approach instead of truncating the table, and then selecting into it...
You can instead select your new data set into a new table. Next, using a single RENAME command, rename the new table to the existing table and the existing table to some backup name.
RENAME TABLE existing_table TO backup_table, new_table TO existing_table;
This is a single, atomic operation... so it wouldn't be possible for the client to read from the data after it is emptied but before it is re-populated.
Alternately, you could change your TRUNCATE to a DELETE FROM, and then wrap this in a transaction along with the SELECT INTO:
START TRANSACTION
DELETE FROM YourTable;
SELECT INTO YourTable...;
COMMIT
I have a table with 10+ million rows. I need to create an index on a single column, however, the index takes so long to create that I get locks against the table.
It may be important to note that the index is being created as part of a 'rake db:migrate' step... I'm not adverse to creating the index manually if that will work.
UPDATE: I suppose I should have mentioned that this a write often table.
MySQL NDBCLUSTER engine can create index online without locking the writes to the table. However, the most widely used InnoDB engine does not support this feature. Another free and open source DB Postgres supports 'create index concurrently'.
you can prevent the blockage with something like this (pseudo-code):
create table temp like my_table;
update logger to log in temp;
alter table my_table add index new_index;
insert into my_table select * from temp;
update logger to log in my_table;
drop table temp
Where logger would be whatever adds rows/updates to your table in regular use(ex.: php script). This will set up a temporary table to use while the other one updates.
Try to make sure that the index is created before the records are inserted. That way, the index will also be filled during the population of the table. Although that will take longer, at least it will be ready to go when the rake task is done.