Best way to insert and query a lot of data MySQL - mysql

I have to read approximately 5000 rows of 6 columns max from a xls file. I'm using PHP >= 5.3. I'll be using PHPExcel for that task. I haven't try it but I think it can handle (If you have other options, they are welcome).
The issue is that every time I read a row, I need to query the database to verify if that particular row exists, If it does, then overwrite it, If not, then add it.
I think that's going to take a lot of time and PHP will just simply timeout ( I can't modify the timeout variable since it's a shared server).
Could you give me a hand with this?
Appreciate your help

Since you're using MySQL, all you have to do is insert data and not worry about a row being there at all.
Here's why and how:
If you query a database from PHP to verify a row exists, that's bad. Reason it's bad is because you are prone to getting false results. There's a lag between PHP and MySQL, and PHP can't be used to verify data integrity. That's the job of the database.
To ensure there are no duplicate rows, we use UNIQUE constraints on our columns.
MySQL extends SQL standard using INSERT INTO ... ON DUPLICATE KEY UPDATE syntax. That lets you just insert data, and if there's a duplicate row - you can just update it with new data.
Reading 5000 rows is quick. Inserting 5000 is also quick, if you wrap it in a transaction. I would suggest reading 100 rows from the excel file, starting a transaction and just insert data (using ON DUPLICATE KEY UPDATE to handle duplicates). That will let you spend 1 I/O of your hard drive to save 100 records. Doing so, you can finish this whole process in a few seconds, which lets you not to worry about performance or timeouts.

At first run this process via exec, and timeout has no matter
At second, select all rows before read excel file. Select not at one query, read 2000 rows at time for example, and collect it to array.
At third use xlsx format and chunkReader, that allows read not whole file.
It's not 100% garantee, but i did the same.

Related

Maintain data integrity and consistency when performing sql batch insert/update with unique columns

I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.

Will a MySQL SELECT statement interrupt INSERT statement?

I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

How do I memory-efficiently process all rows of a MySQL table?

I have a MySQL table with 237 million rows. I want to process all of these rows and update them with new values.
I do have sequential ID's, so I could just use a lot of select statements:
where id = '1'
where id = '2'
This is the method mentioned in Sequentially run through a MYSQL table with 1,000,000 records?.
But I'd like to know if there is a faster way using something like a cursor that would be used to sequentially read a big file without needing to load the full set into memory. The way I see it, a cursor would be much faster than running millions of select statements to get the data back in manageable chunks.
Ideally, you get the DBMS to do the work for you. You make the SQL statement so it runs solely in the database, not returning data to the application. All else apart, this saves the overhead of 237 million messages to the client and 237 million messages back to the server.
Whether this is feasible depends on the nature of the update:
Can the DBMS determine what the new values should be?
Can you get the necessary data into the database so that the DBMS can determine what the new values should be?
Will every single one of the 237 million rows be changed, or only a subset?
Can the DBMS be used to determine the subset?
Will any of the id values be changed at all?
If the id values will never be changed, then you can arrange to partition the data into manageable subsets, for any flexible definition of 'manageable'.
You may need to consider transaction boundaries; can it all be done in a single transaction without blowing out the logs? If you do operations in subsets rather than as a single atomic transaction, what will you do if your driving process crashes at 197 million rows processed? Or the DBMS crashes at that point? How will you know where to resume operations to complete the processing?

overwriting unique rows in database - is this method wrong?

I have 15k objects I need to write to the database everyday. I'm using a cronjob to write to mysql server every night. Each night, 99% of these 15k objects will be the same and identified uniquely in the DB.
I have set up a DB rule stating there will be no duplicate rows via specifying a unique key.
I do NOT want to check for an existing row before actually inserting it.
Therefore, I have opted to INSERT all 15k objects every night and allow mysql to prevent duplicate rows...(of course it will throw errors).
I do this because if I check for a pre-existing row - it will significantly reduce speed.
My question: Is there anything wrong with inserting all 15k at once and allowing mysql to prevent duplicates? (without manually checking for pre-existing rows) Is there a threshold where if mysql errors out 1,000 times that it will lock itself and reject all subsequent queries?
please help!
Using INSERT IGNORE INTO... will make MySQL discard the row without error and keep the row already present. Maybe this is what you want?
If you instead want to overwrite the existing row you can do INSERT INTO ... ON DUPLICATE KEY UPDATE .... See docs