I have a MySql table that has about 5.5 million rows of data. Every month I need to reload this data with a new data file. Since it takes a while to remove the old data, I'm adding limit 2000000 to split the job up into chunks. Example:
DELETE FROM `list_content` WHERE list_id = 3 limit 2000000
My theory is that memory might be released after the query is done, and splitting it into chunks like this might be beneficial in not consuming resources. However I haven't found anything that supports my theory. Is there any benefit to splitting up a query like this instead of just letting it run for 20 minutes?
To answer your question directly - there is no benefit. Assuming that you're not hitting any physical limits (like the maximum amount of physical memory your OS supports) and that MySQL is a well written program, then no.
A couple things you might consider
(1)
DELETE QUICK FROM `list_content` WHERE list_id = 3
followed by a
OPTIMIZE TABLE `list_content`
DELETE QUICK will not perform any house-keeping on the index blocks as its deleting. You can then do all the house-keeping at once with the OPTIMIZE TABLE statement.
(2)
If you're wiping out the whole table, then load the new data into a new table... call it list_temp
DROP TABLE `list_content`; -- very quick since this is simple a delete file op
RENAME TABLE `list_temp` to `list_content`; -- also quick since this is a file rename op
Related
I have a table with a few billion rows of data and I am trying to build 5 indexes on it at once. The table format is MyISAM to save space. Once I build the indexes this will be a static table, I just need it to be read only.
I created the indexes using this command:
alter table links8 add index(uid,tid), add index (date), add index (tid), add index (userid), add index (updated,uid,tid,userid,date);
The command has been running for over 45 days. You read that right: 45 DAYS. I can see that the temp files are still being accessed, it isn't a dead query.
My question is: wtf? Seems like it should take a few hours at most to sort and build an index even with a few billion rows.
Since I have a static table, is there another storage engine that makes sense to use? Innodb takes up way too much space.
45 days doesn't seem right, because in that time, MySQL is bound to do something, and that something is likely either consuming RAM or storage, likely both, which means that you should have run out of either at some point.
I'd assume it's RAM, because that usually is where things get sparse ;)
Now, you're absolutely right, sorting a few billion values in memory shouldn't take ages. Sorting a few billion values that are the concatenated values in (updated,uid,tid,userid,date) though most likely doesn't happen in RAM. Assuming updated and date are of type datetime, they take 8 bytes each; uid,tid,userid would normally be 32 bit ints, but since your table has > 2**32 entries (I'm assuming that), unique ID's would be 8 byte long, too. So one value of type (updated,uid,tid,userid,date) would be 40B long.
Now throw in let's say 5 billion of these; you get 200 GB of pure row data that you'll need to sort to build an index. Assuming you're not doing this on some huge machine, you obviously need to swap out parts of these values to disk -- since you see temporary files appear, my wild guess is that this is happening, and MySQL is actively doing that itself. Now, sorting algorithms that work on parts of the rows iteratively are much slower, because first you sort all parts, then you mix up the parts in a manner that's better sorted than before, than you re-partition your data, you sort your parts ... with storing and loading from disk in between.
By the way, a 45 day lasting memory operation is likely to be prone to memory bit errors, if no correctional measures are taken (basically, use ECC for this kind of task, or you end up with indexed, corrupted data).
MySQL themselves suggest that you just build a special MD5 index that takes the hash of your search tuple and looks for that, since sorting 128bit (==16 byte) MD5 hashes might be easier than sorting 5*8Byte == 40*8 bit == 320bit long composite rows.
I found a better solution.
I created a new table with the indexes already in place then issued an insert from one table to the other. The way this works is it fills the MYD (raw data file) up and then creates the indexes after that. Once it has started creating the indexes I killed the query. Then on the filesystem I used myisamchk to repair the table manually.
That command looked like this:
myisamchk --force --fast --update-state --key_buffer_size=2000M --sort_buffer_size=2000M --read_buffer_size=10M --write_buffer_size=10M TABLE.MYI
And the whole thing took less than 12 hours and the data looks good!
UPDATE:
Here is the flow summarized.
create table2 indentical to table1 with indexes;
insert into table2 select * from table1;
once the MYD file is full and it starts on the MYI file kill the query
then shutdown mysql and run the myisamchk query and restart mysql
OR
copy table2.MYD and table2.MYI to table3.MYD and table3.MYI, then run myisamchk, then copy table2.frm to table3.frm and change the permissions, when it's all done you should be able to access table3 without a restart of mysql
I got a mysql database with approx. 1 TB of data. Table fuelinjection_stroke has apprx. 1.000.000.000 rows. DBID is the primary key that is automatically incremented by one with each insert.
I am trying to delete the first 1.000.000 rows using a very simple statement:
Delete from fuelinjection_stroke where DBID < 1000000;
This query is takeing very long (>24h) on my dedicated 8core Xeon Server (32 GB Memory, SAS Storage).
Any idea whether the process can be sped up?
I believe that you table becomes locked. I've faced same problem and find out that can delete 10k records pretty fast. So you might want to write simple script/program which will delete records by chunks.
DELETE FROM fuelinjection_stroke WHERE DBID < 1000000 LIMIT 10000;
And keep executing it until it deletes everything
Are you space deprived? Is down time impossible?
If not, you could fit in a new INT column length 1 and default it to 1 for "active" (or whatever your terminology is) and 0 for "inactive". Actually, you could use 0 through 9 as 10 different states if necessary.
Adding this new column will take a looooooooong time, but once it's over, your UPDATEs should be lightning fast as long as you do it off the PRIMARY (as you do with your DELETE) and you don't index this new column.
The reason why InnoDB takes so long to DELETE on such a massive table as yours is because of the cluster index. It physically orders your table based upon your PRIMARY (or first UNIQUE it finds...or whatever it feels like if it can't find PRIMARY or UNIQUE), so when you pull out one row, it now reorders your ENTIRE table physically on the disk for speed and defragmentation. So it's not the DELETE that's taking so long. It's the physical reordering after that row is removed.
When you create a new INT column with a default value, the space will be filled, so when you UPDATE it, there's no need for physical reordering across your huge table.
I'm not sure exactly what your schema is exactly, but using a column for a row's state is much faster than DELETEing; however, it will take more space.
Try setting values:
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT (for non-windows machine)
innodb_buffer_pool_size=25GB (currently it is close to 21GB)
innodb_doublewrite=0
innodb_support_xa=0
innodb_thread_concurrency=0...1000 (try different values, beginning with 200)
References:
MySQL docs for description of different variables.
MySQL Server Setting Tuning
MySQL Performance Optimization basics
http://bugs.mysql.com/bug.php?id=28382
What indexes do you have?
I think your issue is that the delete is rebuilding the index on every iteration.
I'd delete the indexes if any, do the delete, then re-add the indexes. It'll be far faster, (I think).
I was having the same problem, and my table has several indices that I didn't want to have to drop and recreate. So I did the following:
create table keepers
select * from origTable where {clause to retrieve rows to preserve};
truncate table origTable;
insert into origTable null,keepers.col2,...keepers.col(last) from keepers;
drop table keepers;
About 2.2 million rows were processed in about 3 minutes.
Your database may be checking for records that need to be modified in a foreign key (cascades, delete).
But I-Conica answer is a good point(+1). The process of deleting a single record and updating a lot of indexes during done 100000 times is inefficient. Just drop the index, delete all records and create it again.
And of course, check if there is any kind of lock in the database. One user or application can lock a record or table and your query will be waiting until the user release the resource or it reachs a timeout. One way to check if your database is doing real work or just waiting is lauch the query from a connection that sets the --innodb_lock_wait_timeout parameter to a few seconds. If it fails at least you know that the query is OK and that you need to find and realease that lock. Examples of locks are Select * from XXX For update and uncommited transactions.
For such long tables, I'd rather use MYISAM, specially if there is not a lot of transactions needed.
I don't know exact ans for ur que. But writing another way to delete those rows, pls try this.
delete from fuelinjection_stroke where DBID in
(
select top 1000000 DBID from fuelinjection_stroke
order by DBID asc
)
I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).
I am deleting approximately 1/3 of the records in a table using the query:
DELETE FROM `abc` LIMIT 10680000;
The query appears in the processlist with the state "updating". There are 30m records in total. The table has 5 columns and two indexes, and when dumped to SQL the file about 9GB.
This is the only database and table in MySQL.
This is running on a machine with 2GB of memory, a 3 GHz quad-core processor and a fast SAS disk. MySQL is not performing any reads or writes other than this DELETE operation. No other "heavy" processes are running on the machine.
This query has been running for more than 2 hours -- how long can I expect it to take?
Thanks for the help! I'm pretty new to MySQL, so any tidbits about what's happening "under the hood" while running this query are definitely appreciated.
Let me know if I can provide any other information that would be pertinent.
Update: I just ran a COUNT(*), and in 2 hours, it's only deleted 200k records. I think I'm going to take Joe Enos' advice and see how well inserting the data into a new table and dropping the previous table performs.
Update 2: Sorry, I actually misread the number. In 2 hours, it's not deleted anything. I'm confused. Any suggestions?
Update 3: I ended up using mysqldump with --where "true LIMIT 10680000,31622302" and then importing the data into a new table. I then deleted the old table and renamed the new one. This took just over half an hour.
Don't know if this would be any better, but it might be worth thinking about doing the following:
Create a new table and insert 2/3 of the original table into the new one.
Drop the original table.
Rename the new table to the original table's name.
This would prevent the log file from having all the deletes, but I don't know if inserting 20m records is faster than deleting 10m.
You should post the table definition.
Also, to know why is it taking to much time, try to enable the profile mode on the delete request via :
SET profiling=1;
DELETE FROM abc LIMIT 10680000;
SET profiling=0;
SHOW PROFILES;
SHOW PROFILE ALL FOR QUERY X; (X is the ID of your query shown in SHOW PROFILES)
and post what it returns (But I think the query must end to return the profiling data)
http://dev.mysql.com/doc/refman/5.0/en/show-profiles.html
Also, I think you'll get more responses on ServerFault ;)
When you run this query, the InnoDB log file for the database is used to record all the details of the rows that are deleted - and if this log file isn't large enough from the outset it'll be auto-extended as and when necessary (if configured to do so) - I'm not familiar with the specifics but I expect this auto-extension is not blindingly fast. 2 hours does seem like a long time - but doesn't surprise me if the log file is growing as the query is running.
Is the table from which the records are being deleted on the end of a foreign key (i.e. does another table reference it through a FK constraint)?
I hope your query ended by now ... :) but from what I've seen, LIMIT with large numbers (and I never tried this kind of numbers) is very slow. I would try something based on the pk like
DELETE FROM abc WHERE abc_pk < 10680000;
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?