overwriting unique rows in database - is this method wrong? - mysql

I have 15k objects I need to write to the database everyday. I'm using a cronjob to write to mysql server every night. Each night, 99% of these 15k objects will be the same and identified uniquely in the DB.
I have set up a DB rule stating there will be no duplicate rows via specifying a unique key.
I do NOT want to check for an existing row before actually inserting it.
Therefore, I have opted to INSERT all 15k objects every night and allow mysql to prevent duplicate rows...(of course it will throw errors).
I do this because if I check for a pre-existing row - it will significantly reduce speed.
My question: Is there anything wrong with inserting all 15k at once and allowing mysql to prevent duplicates? (without manually checking for pre-existing rows) Is there a threshold where if mysql errors out 1,000 times that it will lock itself and reject all subsequent queries?
please help!

Using INSERT IGNORE INTO... will make MySQL discard the row without error and keep the row already present. Maybe this is what you want?
If you instead want to overwrite the existing row you can do INSERT INTO ... ON DUPLICATE KEY UPDATE .... See docs

Related

Maintain data integrity and consistency when performing sql batch insert/update with unique columns

I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.

Inserting 1 million records in mysql

I have two tables and in both tables I get 1 million records .And I am using cron job every night for inserting records .In first table I am truncating the table first and then inserting the records and in second table I am updating and inserting record according to primary key. I am using mysql as my database.My problem is I need to do this task each day but I am unable to insert all data .So what can be the possible solution for this problem
Important is to set off all kind of actions and checks MySQL wants to perform when posting data, like autocommit, indexing, etc.
https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html
Because if you do not do this, MySQL does a lot of work after every record added, and it adds up, when the process is proceeding, resulting in a very slow processing and importing in the end, and may not complete in one day.
If you must use MySql : For the first table, disable the indexes, do the inserts, than enable indexes. This will works faster.
Alternatively MongoDb will be faster, and Redis is very fast.

Best way to insert and query a lot of data MySQL

I have to read approximately 5000 rows of 6 columns max from a xls file. I'm using PHP >= 5.3. I'll be using PHPExcel for that task. I haven't try it but I think it can handle (If you have other options, they are welcome).
The issue is that every time I read a row, I need to query the database to verify if that particular row exists, If it does, then overwrite it, If not, then add it.
I think that's going to take a lot of time and PHP will just simply timeout ( I can't modify the timeout variable since it's a shared server).
Could you give me a hand with this?
Appreciate your help
Since you're using MySQL, all you have to do is insert data and not worry about a row being there at all.
Here's why and how:
If you query a database from PHP to verify a row exists, that's bad. Reason it's bad is because you are prone to getting false results. There's a lag between PHP and MySQL, and PHP can't be used to verify data integrity. That's the job of the database.
To ensure there are no duplicate rows, we use UNIQUE constraints on our columns.
MySQL extends SQL standard using INSERT INTO ... ON DUPLICATE KEY UPDATE syntax. That lets you just insert data, and if there's a duplicate row - you can just update it with new data.
Reading 5000 rows is quick. Inserting 5000 is also quick, if you wrap it in a transaction. I would suggest reading 100 rows from the excel file, starting a transaction and just insert data (using ON DUPLICATE KEY UPDATE to handle duplicates). That will let you spend 1 I/O of your hard drive to save 100 records. Doing so, you can finish this whole process in a few seconds, which lets you not to worry about performance or timeouts.
At first run this process via exec, and timeout has no matter
At second, select all rows before read excel file. Select not at one query, read 2000 rows at time for example, and collect it to array.
At third use xlsx format and chunkReader, that allows read not whole file.
It's not 100% garantee, but i did the same.

MySQL INSERT VS UPDATE

I need to retrieve around 60,000+ MySQL Records from a partner's server and save it to my database. My Script needs to do this 3X a day (60K+ X 3)
Which one is better and faster
DELETE ALL Records from my DB -> Retrieve Records from Partner DB -> Insert Records to my DB
OR
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
NOTE : if UPDATE, I need to update all the fields of the record
Acording to my opinion, second approach will be more faster than the first one. Because if recoeds already exist then it will skip that record from insertion.
The two operation sequences you've proposed are NOT equivalent.
The second operation sequence does NOT delete the rows which were removed from the partner DB, while the first sequence does delete them.
MySQL provides REPLACE statement, which does effectively the same as your second sequence and will probably be the fastest one. Benchmark your code to be sure.
Definitely the 2nd one
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
Deleting is a costly operation, esp. when u have a case of 60k+ records and considering that schema remains same overtime only the values change.
Also, consider it from this point of view that, if updating, then not all values may not need to be updated, only some of them....so this is a comparatively cheaper than deleting and again writing the values which might even contain some of the same values which you just deleted! :)
Don't just consider it from deleting point of view, also consider that you have to Update the DB too....what would you prefer, always updating 60k+ values or less than that....

Millions of MySQL Insert On Duplicate Key Update - very slow

I have a table called research_words which has some hundred million rows.
Every day I have tens of million of new rows to be added, about 5% of them are totally new rows, and 95% are updates which have to add to some columns in that row. I don't know which is which so I use:
INSERT INTO research_words
(word1,word2,origyear,cat,numbooks,numpages,numwords)
VALUES
(34272,268706,1914,1,1,1,1)
ON DUPLICATE KEY UPDATE
numbooks=numbooks+1,numpages=numpages+1,numwords=numwords+1
This is an InnoDB table where the primary key is over word1,word2,origyear,cat.
The issue I'm having is that I have to insert the new rows each day and it's taking longer than 24 hours to insert each days rows! Obviously I can't have it taking longer than a day to insert the rows for the day. I have to find a way to make the inserts faster.
For other tables I've had great success with ALTER TABLE ... DISABLE KEYS; and LOAD DATA INFILE, which allows me to add billions of rows in less than an hour. That would be great, except that unfortunately I am incrementing to columns in this table. I doubt disabling the keys would help either because surely it will need them to check whether the row exists in order to add it.
My scripts are in PHP but when I add the rows I do so by an exec call directly to MySQL and pass it a text file of commands, instead of sending them with PHP, since it's faster this way.
Any ideas to fix the speed issue here?
Old question, but perhaps worth an answer all the same.
Part of the issue stems from the large number of inserts being run essentially one at a time, with a unique index update after each one.
In these instances, a better technique might be to select n rows to insert and put them in a temp table, left join them to the destination table, calculate their new values (in OP's situation IFNULL(dest.numpages+1,1) etc.) and then run two further commands - an insert where the insert fields are 1 and an update where they're greater. The updates don't require an index refresh, so they run much faster; the inserts don't require the same ON DUPLICATE KEY logic.