I need to retrieve around 60,000+ MySQL Records from a partner's server and save it to my database. My Script needs to do this 3X a day (60K+ X 3)
Which one is better and faster
DELETE ALL Records from my DB -> Retrieve Records from Partner DB -> Insert Records to my DB
OR
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
NOTE : if UPDATE, I need to update all the fields of the record
Acording to my opinion, second approach will be more faster than the first one. Because if recoeds already exist then it will skip that record from insertion.
The two operation sequences you've proposed are NOT equivalent.
The second operation sequence does NOT delete the rows which were removed from the partner DB, while the first sequence does delete them.
MySQL provides REPLACE statement, which does effectively the same as your second sequence and will probably be the fastest one. Benchmark your code to be sure.
Definitely the 2nd one
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
Deleting is a costly operation, esp. when u have a case of 60k+ records and considering that schema remains same overtime only the values change.
Also, consider it from this point of view that, if updating, then not all values may not need to be updated, only some of them....so this is a comparatively cheaper than deleting and again writing the values which might even contain some of the same values which you just deleted! :)
Don't just consider it from deleting point of view, also consider that you have to Update the DB too....what would you prefer, always updating 60k+ values or less than that....
Related
I have two tables and in both tables I get 1 million records .And I am using cron job every night for inserting records .In first table I am truncating the table first and then inserting the records and in second table I am updating and inserting record according to primary key. I am using mysql as my database.My problem is I need to do this task each day but I am unable to insert all data .So what can be the possible solution for this problem
Important is to set off all kind of actions and checks MySQL wants to perform when posting data, like autocommit, indexing, etc.
https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html
Because if you do not do this, MySQL does a lot of work after every record added, and it adds up, when the process is proceeding, resulting in a very slow processing and importing in the end, and may not complete in one day.
If you must use MySql : For the first table, disable the indexes, do the inserts, than enable indexes. This will works faster.
Alternatively MongoDb will be faster, and Redis is very fast.
I wanted to delete old records from 10 related tables every 6 months using primary keys and foreignkeys. I am planning to do it in a single transaction block, because in case of any failure I have to rollback the changes. My queries will be somethign like this
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (1, 2, 3,etc);
DELETE FROM CHILD_TABLE1 WHERE PARENT_ID IN (1, 2, 3,etc);
The records to delete will be around 1million. Is it safe to delete all these in a single transaction? how will be the performanace?
Edit
To be more clear on my question. I will detail my execution plan
I am first retreiving primary keys of all the records from the parent table which has to be deleted and store it to a temporary table
START TRANSACITON
DELETE FROM CHILD_ONE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM CHILD_TWO WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
COMMIT;
ROLLBACK on any failure.
Given that I can have around a million records to delete from all these tables, is it safe to put everything inside a single transaction block?
You can probably succeed. But it is not wise. Something random (eg, a network glitch) could come along to cause that huge transaction to abort. You might be blocking other activity for a long time. Etc.
Are the "old" records everything older than date X? If so, it would much more efficient to make use of PARTITIONing for DROPping old rows. We can discuss the details. Oops, you have FOREIGN KEYs, which are incompatible with PARTITIONing. Do all the tables have FKs?
Why do you wait 6 months before doing the delete? 6K rows a day would would have the same effect and be much less invasive and risky.
IN ( SELECT ... )
has terrible performance, use a JOIN instead.
If some of the tables are just normalizations, why bother deleting from them?
Would it work to delete 100 ids per transaction? That would be much safer and less invasive.
First of all: Create a proper backup AND test it before you start to delete the records
The number of record you asked for is mostly depends on the configuration (hardware) of your database server. You have to test it out, how many records could be deleted on that specific server without problems. Start with e.g. 1000 records then increase the amount in each iteration while it seems to be too slow. If you have replication, the setup and the slave's performance affects the row number too (too much write requests could cause serious delay in replication).
An advice: Remove all foreign keys and indexes (except the primary key and the indexes related to the where clauses you use to perform the action) if possible before you start the delete.
Edit:
If the count of records which will be deleted is larger than the count of records which will not, consider to just copy the records into a new table, then rename the old and new tables. For the first step, copy the structure of table using the CREATE TABLE .. LIKE statement, then drop all unnecessary indexes and constraints, copy the records, add the indexes, then rename the tables. (Copy the lastest new records from the original table into the copy if necessary), then you can drop the old table.
what i believe first you have to move the data in another database then
use single Transaction to delete all 10 table which is very safe to rollback immediately and delete the data from live data base when interaction of user is very less more info
I am using MySQL as database. I need to update some data. However the data may haven't changed so I may not need to update the row in such case.
I wanted to know which one will be better (performance wise):
a) Search the table to determine if the data has changed. For example I can search by primary key and then see if the value of remaining fields have changed or not. If yes, then continue with update statement and if not then leave it.
b) Use UPDATE query directly. If there are no changes in the data, MySQL will automatically ignore it and not process updating the data.
So which one will be perform better in such case.
From the MySQL manual:
If you set a column to the value it currently has, MySQL notices this
and does not update it.
So save yourself the latency and leave that task to MySQL. It will even tell you how many rows were actually affected.
First options seems better to me but in a specific scenerio.
Select all or some rows from table and get them in a result set
Traverse the result set, as in-memory traversal is really fast enough, and pick up the primary keys whose records you want to update and then execute update queries.
It seems comparetively efficient solution to me as compared to executing update query on each row regardless whether you need to do so or not.
If you use the select-then-update approach, you need to lock the row (e.g. select for update), otherwise you are in a race condition - the row can be changed after you selected and checked that it hasn't be changed.
As #AndreKR pointed out, MySQL won't perform any write operation, if the values are the same, so using update directly is faster than using 2 queries.
I have 15k objects I need to write to the database everyday. I'm using a cronjob to write to mysql server every night. Each night, 99% of these 15k objects will be the same and identified uniquely in the DB.
I have set up a DB rule stating there will be no duplicate rows via specifying a unique key.
I do NOT want to check for an existing row before actually inserting it.
Therefore, I have opted to INSERT all 15k objects every night and allow mysql to prevent duplicate rows...(of course it will throw errors).
I do this because if I check for a pre-existing row - it will significantly reduce speed.
My question: Is there anything wrong with inserting all 15k at once and allowing mysql to prevent duplicates? (without manually checking for pre-existing rows) Is there a threshold where if mysql errors out 1,000 times that it will lock itself and reject all subsequent queries?
please help!
Using INSERT IGNORE INTO... will make MySQL discard the row without error and keep the row already present. Maybe this is what you want?
If you instead want to overwrite the existing row you can do INSERT INTO ... ON DUPLICATE KEY UPDATE .... See docs
EDIT: To clarify the records originally come from a flat-file database and is not in the MySQL database.
In one of our existing C programs which purpose is to take data from the flat-file and insert them (based on criteria) into the MySQL table:
Open connection to MySQL DB
for record in all_record_of_my_flat_file:
if record contain a certain field:
if record is NOT in sql_table A: // see #1
insert record information into sql_table A and B // see #2
Close connection to MySQL DB
select field from sql_table A where field=XXX
2 inserts
I believe that management did not feel it is worth it to add the functionality so that when the field in the flat file is created, it would be inserted into the database. This is specific to one customer (that I know of). I too, felt it odd that we use tool such as this to "sync" the data. I was given the duty of using and maintaining this script so I haven't heard too much about the entire process. The intent is to primarily handle additional records so this is not the first time it is used.
This is typically done every X months to sync everything up or so I'm told. I've also been told that this process takes roughly a couple of days. There is (currently) at most 2.5million records (though not necessarily all 2.5m will be inserted and most likely much less). One of the table contains 10 fields and the other 5 fields. There isn't much to be done about iterating through the records since that part can't be changed at the moment. What I would like to do is speed up the part where I query MySQL.
I'm not sure if I have left out any important details -- please let me know! I'm also no SQL expert so feel free to point out the obvious.
I thought about:
Putting all the inserts into a transaction (at the moment I'm not sure how important it is for the transaction to be all-or-none or if this affects performance)
Using Insert X Where Not Exists Y
LOAD DATA INFILE (but that would require I create a (possibly) large temp file)
I read that (hopefully someone can confirm) I should drop indexes so they aren't re-calculated.
mysql Ver 14.7 Distrib 4.1.22, for sun-solaris2.10 (sparc) using readline 4.3
Why not upgrade your MySQL server to 5.0 (or 5.1), and then use a trigger so it's always up to date (no need for the monthly script)?
DELIMITER //
CREATE TRIGGER insert_into_a AFTER INSERT ON source_table
FOR EACH ROW
BEGIN
IF NEW.foo > 1 THEN
SELECT id AS #testvar FROM a WHERE a.id = NEW.id;
IF #testvar != NEW.id THEN
INSERT INTO a (col1, col2) VALUES (NEW.col1, NEW.col2);
INSERT INTO b (col1, col2) VALUES (NEW.col1, NEW.col2);
END IF
END IF
END //
DELIMITER ;
Then, you could even setup update and delete triggers so that the tables are always in sync (if the source table col1 is updated, it'll automatically propagate to a and b)...
Here's my thoughts on your utility script...
1) Is just a good practice anyway, I'd do it no matter what.
2) May save you a considerable amount of execution time. If you can solve a problem in straight SQL without using iteration in a C-Program, this can save a fair amount of time. You'll have to profile it first to ensure it really does in a test environment.
3) LOAD DATA INFILE is a tactic to use when inserting a massive amount of data. If you have a lot of records to insert (I'd write a query to do an analysis to figure out how many records you'll have to insert into table B), then it might behoove you to load them this way.
Dropping the indexes before the insert can be helpful to reduce running time, but you'll want to make sure you put them back when you're done.
Although... why aren't all the records in table B in the first place? You haven't mentioned how processing works, but I would think it would be advantageous to ensure (in your app) that the records got there without your service script's intervention. Of course, you understand your situation better than I do, so ignore this paragraph if it's off-base. I know from experience that there are lots of reasons why utility cleanup scripts need to exist.
EDIT: After reading your revised post, your problem domain has changed: you have a bunch of records in a (searchable?) flat file that you need to load into the database based on certain criteria. I think the trick to doing this as quickly as possible is to determine where the C application is actually the slowest and spends the most time spinning its proverbial wheels:
If it's reading off the disk, you're stuck, you can't do anything about that, unless you get a faster disk.
If it's doing the SQL query-insert operation, you could try optimizing that, but your'e doing a compare between two databases (the flat-file and the MySQL one)
A quick thought: by doing a LOAD DATA INFILE bulk insert to populate a temporary table very quickly (perhaps even an in-memory table if MySQL allows that), and then doing the INSERT IF NOT EXISTS might be faster than what you're currently doing.
In short, do profiling, and figure out where the slowdown is. Aside from that, talk with an experienced DBA for tips on how to do this well.
I discussed with another colleague and here is some of the improvements we came up with:
For:
SELECT X FROM TABLE_A WHERE Y=Z;
Change to (currently waiting verification on whether X is and always unique):
SELECT X FROM TABLE_A WHERE X=Z LIMIT 1;
This was an easy change and we saw some slight improvements. I can't really quantify it well but I did:
SELECT X FROM TABLE_A ORDER BY RAND() LIMIT 1
and compared the first two query. For a few test there was about 0.1 seconds improvement. Perhaps it cached something but the LIMIT 1 should help somewhat.
Then another (yet to be implemented) improvement(?):
for record number X in entire record range:
if (no CACHE)
CACHE = retrieve Y records (sequentially) from the database
if (X exceeds the highest record number in cache)
CACHE = retrieve the next set of Y records (sequentially) from the database
search for record number X in CACHE
...etc
I'm not sure what to set Y to, are there any methods for determining what's a good sized number to try with? The table has 200k entries. I will edit in some results when I finish implementation.