Slow select when inserting large amounts of data (MYSQL) - mysql

I have a process that imports a lot of data (950k rows) using inserts that insert 500 rows at a time. The process generally takes about 12 hours, which isn't too bad. Normally doing a query on the table is pretty quick (under 1 second) as I've put (what I think to be) the proper indexes in place. The problem I'm having is trying to run a query when the import process is running. It is making the query take almost 2 minutes! What can I do to make these two things not compete for resources (or whatever)? I've looked into "insert delayed" but not sure I want to change the table to MyISAM.
Thanks for any help!

Have you tried using priority hints?
SELECT HIGH_PRIORITY ... and INSERT LOW_PRIORITY ...

12 hours to insert 950k rows is pretty heavy duty. How big are these rows? What kind of indexes are on them? Even if the actual data insertion goes quickly, the continual updating of the indexes will definitely cause performance degradation for anything using those table(s) at the time.
Are you doing these imports with the bulk INSERT syntax (insert into tab (x) values (a), (b), (c), etc...) or one INSERT per row? Doing the bulk insert will require a longer index updating period (as it has to generate index data for 500 rows) than doing it for a single row. There will be no doubt be some sort of internal lock placed on the indexes while the data's updated, in which case you're contending with 950k/500 = 1,900 locking sessions at minimum.
I found that on some of my bulk-insert scripts (an http log analyzer for some custom data mining), it was quicker to DISABLE indexes on the relevant tables, then reenabling/rebuilding them after the data dump was completed. If I remember right, it was about 37 minutes to insert 200,000 rows of hit data with keys enabled, and about 3 minutes with no indexing.

So I finally found the slowdown while searching during the import of my data. I had one query like this:
SELECT * FROM `properties` WHERE (state like 'Florida%') and (county like 'Hillsborough%') ORDER BY created_at desc LIMIT 0, 50
and when I ran an EXPLAIN on it, I found out it was scanning around 215,000 rows (even with proper indexes on state and county in place). I then ran an EXPLAIN on the following query:
SELECT * FROM `properties` WHERE (state = 'Florida') and (county = 'Hillsborough') ORDER BY created_at desc LIMIT 0, 50
and saw that it only had to scan 500 rows. Considering that the actual result set was something like 350, I think I identified the slowdown.
I've made the switch to not using "like" in my queries and am very happy with the snappier results.
Thanks to everyone for your help and suggestions. They are much appreciated!

You can try import your data to some auxiliary table and then merge it into the main table. You don't lose performance in your main table, and I think your db can manage the merge much faster than the multiple insertions.

Related

Improving Speed of SQL 'Update' function - break into Insert/ Delete?

I'm running an ETL process and streaming data into a MySQL table.
Now it is being written over a web connection (fairly fast one) -- so that can be a bottleneck.
Anyway, it's a basic insert/ update function. It's a list of IDs as the primary key/ index .... and then a few attributes.
If a new ID is found, insert, otherwise, update ... you get the idea.
Currently doing an "update, else insert" function based on the ID (indexed) is taking 13 rows/ second (which seems pretty abysmal, right?). This is comparing 1000 rows to a database of 250k records, for context.
When doing a "pure" insert everything approach, for comparison, already speeds up the process to 26 rows/ second.
The thing with the pure "insert" approach is that I can have 20 parallel connections "inserting" at once ... (20 is max allowed by web host) ... whereas any "update" function cannot have any parallels running.
Thus 26 x 20 = 520 r/s. Quite greater than 13 r/s, especially if I can rig something up that allows even more data pushed through in parallel.
My question is ... given the massive benefit of inserting vs. updating, is there a way to duplicate the 'update' functionality (I only want the most recent insert of a given ID to survive) .... by doing a massive insert, then running a delete function after the fact, that deletes duplicate IDs that aren't the 'newest' ?
Is this something easy to implement, or something that comes up often?
What else I can do to ensure this update process is faster? I know getting rid of the 'web connection' between the ETL tool and DB is a start, but what else? This seems like it would be a fairly common problem.
Ultimately there are 20 columns, max of probably varchar(50) ... should I be getting a lot more than 13 rows processed/ second?
There are many possible 'answers' to your questions.
13/second -- a lot that can be done...
INSERT ... ON DUPLICATE KEY UPDATE ... ('IODKU') is usually the best way to do "update, else insert" (unless I don't know what you mean by it).
Batched inserts is much faster than inserting one row at a time. Optimal is around 100 rows giving 10x speedup. IODKU can (usually) be batched, too; see the VALUES() pseudo function.
BEGIN;...lots of writes...COMMIT; cuts back significantly on the overhead for transaction.
Using a "staging" table for gathering things up update can have a significant benefit. My blog discussing that. That also covers batch "normalization".
Building Summary Tables on the fly interferes with high speed data ingestion. Another blog covers Summary tables.
Normalization can be used for de-dupping, hence shrinking the disk footprint. This can be important for decreasing I/O for the 'Fact' table in Data Warehousing. (I am referring to your 20 x VARCHAR(50).)
RAID striping is a hardware help.
Batter-Backed-Write-Cache on a RAID controller makes writes seem instantaneous.
SSDs speed up I/O.
If you provide some more specifics (SHOW CREATE TABLE, SQL, etc), I can be more specific.
Do it in the DBMS, and wrap it in a transaction.
To explain:
Load your data into a temporary table in MySQL in the fastest way possible. Bulk load, insert, do whatever works. Look at "load data infile".
Outer-join the temporary table to the target table, and INSERT those rows where the PK column of the target table is NULL.
Outer-join the temporary table to the target table, and UPDATE those rows where the PK column of the target table is NOT NULL.
Wrap steps 2 and 3 in a begin/commit (or [start transaction]/commit pair for a transaction. The default behaviour is probably autocommit, which will mean you're doing a LOT of database work after every insert/update. Use transactions properly, and the work is only done once for each block.

Remove over 100,000 rows from mysql table - server crashes

I have a question when I try to remove over 100,000 rows from a mysql table the server freezes and non of its websites can be accessed anymore!
I waited 2 hours and then restarted the server and restored the account.
I used following query:
DELETE FROM `pligg_links` WHERE `link_id` > 10000
-
SELECT* FROM `pligg_links` WHERE `link_id` > 10000
works perfectly
Is there a better way to do this?
You could delete the rows in smaller sets. A quick script that deletes 1000 rows at a time should see you through.
"Delete from" can be very expensive for large data sets.
I recommend using partitioning.
This may be done slightly differently in PostgreSQL and MySQL, but in PostgreSQL you can create many tables that are "partitions" of the larger table or on a partition. Queries and whatnot can be run on the larger table. This can greatly increase the speed with which you can query given you partition correctly. Also, you can delete a partition by simply dropping it. This happens very very quickly because it is somewhat equivalent to dropping a table.
Documentation for table partitioning can be found here:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html
Make sure you have an index on link_id column.
And try to delete with chunks like 10.000 in a time.
Deleting from table is very costy operation.

table optimization

I am using MySQl , I have a table named cars which is in my dev_db database.
I inserted about 6,000,000 data into the table (That's a large amount of data insertion) by using bulk insertion like following:
INSERT INTO cars (cid, name, msg, date)
VALUES (1, 'blabla', 'blabla', '2001-01-08'),
(11, 'blabla', 'blabla', '2001-11-28'),
... ,
(3, 'blabla', 'blabla', '2010-06-03');
After this large data insertion into my cars table
I decide to also optimize the table like following:
OPTIMIZE TABLE cars;
I waited 53min for the optimization, finally it is done and mysql console shows me the following message:
The Msg_text shows me this table does not support optimize... , which makes my brain yields two questions to ask :
1. Does the mysql message above means the 53min I waited actually did nothing useful??
2. is it necessary to optimize my table after large amount data insertion? and why?
Optimize is useful if you have removed or overwritten rows, or if you have changed indexes. If you just inserted data it is not needed to optimize.
The MySQL Optimize Table command will effectively de-fragment a mysql
table and is very useful for tables which are frequently updated
and/or deleted.
Also look here: http://www.dbtuna.com/article.php?id=15
It looks like, You have InnoDB table, which doesn't support OPTIMIZE TABLE
As you can read in the output InnoDB does not support optimize as such.
Instead it does a recreate + optimize on the indexes instead.
The result is much the same and should not really bother you, you end up with optimized indexes.
However you only ever have to optimize your indexes if you delete rows or update indexed fields.
If you only ever insert then your B-trees will not get unbalanced and do not need optimization.
So:
Does the mysql message above means the 53min I waited actually did nothing useful??
The time spend waiting was useless, but not for the reason you think.
If there is anything to optimize, MySQL will do it.
is it necessary to optimize my table after large amount data insertion? and why?
No, never.
The reason is that MySQL (InnoDB) uses B-trees, which are fast only if they are balanced.
If the nodes are all on one side of the tree, the index degrades into a ordered list, which gives you O(n) worst case time, An fully balanced tree has O(log n) time.
However the index can only become unbalanced if you delete rows, or alter the values of indexed fields.

How to manage Huge operations on MySql

I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).

Approximately how long should it take to delete 10m records from an MySQL InnoDB table with 30m records?

I am deleting approximately 1/3 of the records in a table using the query:
DELETE FROM `abc` LIMIT 10680000;
The query appears in the processlist with the state "updating". There are 30m records in total. The table has 5 columns and two indexes, and when dumped to SQL the file about 9GB.
This is the only database and table in MySQL.
This is running on a machine with 2GB of memory, a 3 GHz quad-core processor and a fast SAS disk. MySQL is not performing any reads or writes other than this DELETE operation. No other "heavy" processes are running on the machine.
This query has been running for more than 2 hours -- how long can I expect it to take?
Thanks for the help! I'm pretty new to MySQL, so any tidbits about what's happening "under the hood" while running this query are definitely appreciated.
Let me know if I can provide any other information that would be pertinent.
Update: I just ran a COUNT(*), and in 2 hours, it's only deleted 200k records. I think I'm going to take Joe Enos' advice and see how well inserting the data into a new table and dropping the previous table performs.
Update 2: Sorry, I actually misread the number. In 2 hours, it's not deleted anything. I'm confused. Any suggestions?
Update 3: I ended up using mysqldump with --where "true LIMIT 10680000,31622302" and then importing the data into a new table. I then deleted the old table and renamed the new one. This took just over half an hour.
Don't know if this would be any better, but it might be worth thinking about doing the following:
Create a new table and insert 2/3 of the original table into the new one.
Drop the original table.
Rename the new table to the original table's name.
This would prevent the log file from having all the deletes, but I don't know if inserting 20m records is faster than deleting 10m.
You should post the table definition.
Also, to know why is it taking to much time, try to enable the profile mode on the delete request via :
SET profiling=1;
DELETE FROM abc LIMIT 10680000;
SET profiling=0;
SHOW PROFILES;
SHOW PROFILE ALL FOR QUERY X; (X is the ID of your query shown in SHOW PROFILES)
and post what it returns (But I think the query must end to return the profiling data)
http://dev.mysql.com/doc/refman/5.0/en/show-profiles.html
Also, I think you'll get more responses on ServerFault ;)
When you run this query, the InnoDB log file for the database is used to record all the details of the rows that are deleted - and if this log file isn't large enough from the outset it'll be auto-extended as and when necessary (if configured to do so) - I'm not familiar with the specifics but I expect this auto-extension is not blindingly fast. 2 hours does seem like a long time - but doesn't surprise me if the log file is growing as the query is running.
Is the table from which the records are being deleted on the end of a foreign key (i.e. does another table reference it through a FK constraint)?
I hope your query ended by now ... :) but from what I've seen, LIMIT with large numbers (and I never tried this kind of numbers) is very slow. I would try something based on the pk like
DELETE FROM abc WHERE abc_pk < 10680000;