Update huge array of data in MySQL - mysql

I've came across the situation, where I need to select huge amount of data (say 100k records which look like ID | {"points":"9","votes":"2","breakdown":"0,0,0,1,1"}), process it in PHP and then put it back. Question is about putting it back efficiently. I saw a solution using INSERT ... ON DUPLICATE KEY UPDATE, I saw a solution with UPDATE using CASE. Are there any other solutions? Which would be the most efficient way to update huge data array?

Better choice is using simple update.
When you try to put data with insert exceptions your DB will do more additional work: try to insert, verify constraints, raise exception, update row, verify constraints again.
Update
Run tests on my local PC for insert into ... ON DUPLICATE KEY UPDATE and UPDATE statements against the table with 43k rows.
the first approach works on 40% faster.
But both worked faster then 1.5s. I suppose, you php code will be bottleneck of your approach and you should not worry about speed of MySQL statements. Of course, it works if you table not huge and does not had dozens millions rows.
Update 2
My local PC uses MySQL 5.6 in default configuration.
CPU: 8Gb

Related

Improving Speed of SQL 'Update' function - break into Insert/ Delete?

I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.

Uploading enormous amount of data to MySQL server

I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.

Remove over 100,000 rows from mysql table - server crashes

I have a question when I try to remove over 100,000 rows from a mysql table the server freezes and non of its websites can be accessed anymore!
I waited 2 hours and then restarted the server and restored the account.
I used following query:
DELETE FROM `pligg_links` WHERE `link_id` > 10000
-
SELECT* FROM `pligg_links` WHERE `link_id` > 10000
works perfectly
Is there a better way to do this?
You could delete the rows in smaller sets. A quick script that deletes 1000 rows at a time should see you through.
"Delete from" can be very expensive for large data sets.
I recommend using partitioning.
This may be done slightly differently in PostgreSQL and MySQL, but in PostgreSQL you can create many tables that are "partitions" of the larger table or on a partition. Queries and whatnot can be run on the larger table. This can greatly increase the speed with which you can query given you partition correctly. Also, you can delete a partition by simply dropping it. This happens very very quickly because it is somewhat equivalent to dropping a table.
Documentation for table partitioning can be found here:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html
Make sure you have an index on link_id column.
And try to delete with chunks like 10.000 in a time.
Deleting from table is very costy operation.

Mysql performance with multiple update in ONE query

I need to update about 100 000 records in MySQL table (with indexes) so this process can take long time. i'm searching solution which will work faster.
I have three solutions but i have no time for speed tests.
Solutions:
usual UPDATE with each new record in array loop (bad perfomance)
using UPDATE syntax like here Update multiple rows with one query? - can't find any perfomance result
using LOAD DATA INFILE with the same value for key field, i guess in this case it will call UPDATE instead UNSERT - i guess should work faster when ever
Do you know which is solution is best.
The one important criteria is execution speed.
Thanks.
LOAD DATA INFILE the fastest way to upsert large amount of data from file;
second solution is not so bad as you might think. Especially if you can execute something like
update table
set
field = values
where
id in (list_of_ids)
but it would be better to post your update query.

MySQL multiple insert performance

I have data containing about 30 000 records. And I need to insert this data into MySQL table. I group this data in packages by 1000 and create multiple inserts like this:
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
How can I optimize performance of this inserting? Can I insert more than 1000 records per time? Each row contains data with size about 1KB. Thanks.
Try wrapping your bulk insert inside a transaction.
START TRANSACTION
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
COMMIT
That might improve performance, I'm not sure if mySQL can partially commit a bulk insert though (if it can't then this likely won't really help much)
Remember that even at 1.5 seconds, for 30,000 records each at ~1k in size, you're doing 20MB/s commit speed you could actually be drive limited depending on your hardware setup.
Advice then would be to investigate a SSD or changing your Raid setup or get faster mechanical drives (there's plenty of online articles on the pros and cons of using a SQL db mounted on a SSD).
You need to check mysql server configurations and specifically check buffer size etc.
You can remove indexes from the table, if any, to make it faster. Create the indexes onces data is in.
Look here, you should get all you need.
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
One insert statement with multiple values, it says, is much faster than multiple insert statements.
Is this a once off operation?
If so, just generate a single sql statement per data element and execute them all on the server. 30,000 really shouldnt take very long and you will have the simplest means of inputting your data.