MySQL, best way to insert 1000 rows - mysql

I have an "item" table with the following structure:
item
id
name
etc.
Users can put items from this item table into their inventory. I store it in the inventory table like this:
inventory
id
item_id
user_id
Is it OK to insert 1000 rows into inventory table? What is the best way to insert 1000 rows?

MySQL can handle millions of records in a single table without any tweaks. With little tweaks it can handle hundreds of millions (I did that). So I wouldn't worry about that.
To improve insert performance you should use batch inserts.
insert into table my_table(col1, col2) VALUES (val1_1, val2_1), (val1_2, val2_2);
Storing records to a file and using load data infile yields even better results (best in my case), but it requires more effort.

It's okay to insert 1000 rows. You should do this as a transaction so the indices are updated all at once at the commit.
You can also construct a single INSERT statement to insert many rows at a time. (See the syntax for INSERT.) However, I wonder how advisable it would be to do that for 1,000 rows.
Most efficient would probably be to use LOAD DATA INFILE or LOAD XML

When it gets to 1000's, I usually use write to a pipe-delimited CSV file and use LOAD DATA INFILE to suck it in quick. By writing to disk, you avoid issues with overflowing your string buffer, if the language you are using has limits on string size. LOAD DATA INFILE is optimized for bulk uploads.
I've done this with up to 1 billion rows (on a cheap $400 4GB 3 year old 32-bit Ubuntu box), so one thousand is not an issue.
Added note: if you don't care about the id assigned and you just want a new unique ID for every record you insert, you could consider setting up AUTO_INCREMENT on id in the table and let Mysql assign an ID for you.

It kind of also depends on how many users you have, if you have 1,000,000 users all doing 1,000 inserts every few minutes then the server is going to struggle to keep up. From a mysql point of view it is certainly capable of handling that much data.

Related

Maintain data integrity and consistency when performing sql batch insert/update with unique columns

I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.

Selectively reading from CSV to MySQL

This is a two part question.
The first is what architecture should I use for the following issue?
And the second is the how i.e. what commands I should use?
I have some log files I want to read into a database. The log files contain fields that are unnecessary (because they can be calculated from other fields).
Approach 1: Should I parse each line of the log file and insert it into the database?
Con: The log entries have to be unique, so I need to first do a SELECT, check if the LogItemID exists, and then INSERT if it doesn’t. This seems to be a high overhead activity and at some point this will be done on an hourly basis.
Approach 2: Or do I use LOAD DATA INFILE (can I even use that in PHP?) and just load the log file into a temporary table, then move the records into the permanent table?
Con: Even in this method though, I will still have to go through the cycle of SELECT, then INSERT.
Approach 3: Or is there a better way? Is there a command to bulk copy records from one table to another with selected fields? Will REPLACE INTO .... ON DUPLICATE UPDATE work (I don't want to UPDATE if the item exists, just ignore) as long as LogItemID is set to UNIQUE ? Either way, I need to throw the extraneous fields out. Which of these approaches is better? Not just easier, but from the standpoint of writing good, scalable code?
P.S. Unrelated, but part of the Architecture issue here is this...
If I have StartTime, EndTime and Interval (EndTime-StartTime), which should I keep - the first two or Interval? And Why?
Edit: To clarify why I did not want to store all three fields - the issue is of course normalization and therefore not good practice. For audit reasons, perhaps I'll store them. Perhaps in another table?
TIA
LOAD DATA INFILE is going to be a lot faster than running individual inserts.
You could load to a separate, temporary table, and then run an INSERT ... SELECT from the temporary table into your actual store. But it's not clear why you would need to do that. To "skip" some fields in the CSV, just assign those to dummy user-defined variables. There's no need to load those fields into the temporary table.
I'd define a UNIQUE key (constraint) and just use INSERT IGNORE; that will be a lot faster than running a separate SELECT, and faster than a REPLACE. (If your requirement is that you don't have any need to update the existing row, you just want to "ignore" the new row.
LOAD DATA INFILE 'my.csv'
IGNORE
INTO TABLE mytable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
( mycol
, #dummy2
, #dummy3
, #mm_dd_yyyy
, somecol
)
SET mydatecol = STR_TO_DATE(#mm_dd_yyyy,'%m-%d-%Y')
If you have start, end and duration, go ahead and store all three. There's redundancy there, the main issues are performance and update anomalies. (If you update end, should you also update duration?) If I don't have a need to do updates, I'd just store all three. I could calculate duration from start_time and end_time, but having the column stored would allow me to add an index, and get better performance on queries looking for durations less than 10 minutes, or whatever. Absent the column, I'd be forced to evaluate an expression for every row in the table, and that gets expensive on large sets.
You could use perl to parse out a subset of the csv fields you want to load, then use the command 'uniq' to remove duplicates, then use LOAD DATA INFILE to load the result.
Typically loading data into a temporary table, then traversing is slower than preprocessing the data ahead of time. As for the LogItemID, if you set it to unique the inserts should fail when you load subsequent matching lines.
When it comes to deciding to store StartTime+Interval (more typically called Duration) or StartTime and EndTime, it really depends on how you plan on using the resulting database table. If you store the duration and are constantly computing the end time it might be better to just store the start/end. If you believe the duration will be commonly used, store it. Depending on how big the database you might decide to just store all three, one more field may not add much overhead.

How should I speed up Inserts in a large table

Scenario
I have an hourly cron that inserts roughly 25k entries into a table that's about 7 million rows. My Primary Key is a composite of 5 different fields. I did this so that I wouldn't have to search the table for duplicates prior to insert, assuming the dupes would just fall to the floor on insert. Due to PHP memory issues I was seeing while reading these 25k entries in (downloading multiple json files from a url and constructing insert queries), I break the entries into 2k chunks and insert them at once via INSERT INTO blah (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);. Lastly I should probably mention I'm on DreamHost so I doubt my server/db setup is all that great. Oh and the db is MyIsam(default).
Problem
Each 2k chunk insert is taking roughly 20-30 seconds(resulting in about a 10 minute total script time including 2 minutes for downloading 6k json files) and while this is happening, user selects from that table appear to be getting blocked/delayed making the website unresponsive for users. My guess would be that the slowdown is coming from the insert trying to index the 5 field PK into a table of 7 million.
What I'm considering
I originally thought enabling concurrent insert/selects would help the unresponsive site, but as far as I can tell, my table is already MyIsam and I have concurrent inserts enabled.
I read that LOAD DATA INFILE is a lot faster so I was thinking of maybe inserting all my values into an empty temp table that will be mostly collision free(besides dupes from the current hour), exporting those w/ SELECT * INTO OUTFILE and then using LOAD DATA INFILE, but i don't know if the overhead of inserting and writing negates the speed benefit. Also the guides I've read talk about further optimizing by disabling my indexes prior to insert, but i think that would break my method of avoiding duplicates on insert...
It's probably obvious that I'm a bit clueless here, I know just enough to get myself really confused on what to do next. Any advice on how to speed up the inserts or just to make selects still responsive while these inserts are occurring would be greatly appreciated.

MySQL multiple insert performance

I have data containing about 30 000 records. And I need to insert this data into MySQL table. I group this data in packages by 1000 and create multiple inserts like this:
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
How can I optimize performance of this inserting? Can I insert more than 1000 records per time? Each row contains data with size about 1KB. Thanks.
Try wrapping your bulk insert inside a transaction.
START TRANSACTION
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
COMMIT
That might improve performance, I'm not sure if mySQL can partially commit a bulk insert though (if it can't then this likely won't really help much)
Remember that even at 1.5 seconds, for 30,000 records each at ~1k in size, you're doing 20MB/s commit speed you could actually be drive limited depending on your hardware setup.
Advice then would be to investigate a SSD or changing your Raid setup or get faster mechanical drives (there's plenty of online articles on the pros and cons of using a SQL db mounted on a SSD).
You need to check mysql server configurations and specifically check buffer size etc.
You can remove indexes from the table, if any, to make it faster. Create the indexes onces data is in.
Look here, you should get all you need.
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
One insert statement with multiple values, it says, is much faster than multiple insert statements.
Is this a once off operation?
If so, just generate a single sql statement per data element and execute them all on the server. 30,000 really shouldnt take very long and you will have the simplest means of inputting your data.

How to manage Huge operations on MySql

I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).