Cleaning out an insanely large table

Cleaning out an insanely large table - sql-server-2008

I have a backup table that - thanks to poorly planned management by a programmer who is bad at math - has 3.5 billion records in it. The data drive is nearly full, performance is suffering. I need to clean out this table but just a simple SELECT COUNT(1) statement takes 30 minutes or more to return.
There is no primary key on the table.
The database uses SIMPLE logging. There's only 25gb left on the drive, so I need to be mindful that whatever I do has to leave space for the database to continue functioning for everyone else. I'm waiting for confirmation as I type this, but I don't think I need to keep any of the data that's in there now.
On a table with that many records, would TRUNCATE TABLE grind the system to a halt?
Also looking into the solutions proposed here: How to delete large data of table in SQL without log?
The idea is to allow my clients to keep working while I'm doing all this.

Truncate table would work if you no longer need the records. This will not reduce the size of the database. If you need to do that you would need to shrink the data file.
If you would rather delete the records Aaron Bertrand has good examples and test results he did located here: https://sqlperformance.com/2013/03/io-subsystem/chunk-deletes

Related

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.

700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.

If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

Modify database files

I have a system that a client designed and the table was originally not supposed to get larger than 10 gigs (maybe 10 million rows) over a few years. Well, they've imported a lot more information than they were thinking and within a month, the table is now up to 208 gigs (900 million rows).
I have very little experience with MySQL and a lot more experience with Microsoft SQL. Is there anything in MySQL that would allow the client to have the database span multiple files so the queries that are run wouldn't have to use the entire table and index? There is a field on the table that could easily be split on, but I wasn't sure how to do this.
The main issue I'm trying to solve is a retrieval query from this table. Inserts aren't a big deal at all since it's all done by a back-end service. I have a test system where the table is about 2 gigs (6 million rows) and my query takes less than a second. When this same query is run on the production system, it takes 20 seconds. I have feeling that the query is doing well, it's just the size of the table that's causing the issue. There is an index on this table created specifically for this query, and using an EXPLAIN, it is using it.
If you have any other suggestions/questions, please feel free to ask.

Use partitioning and especially the part of create table that sets the data_directory and index_directory.
With these options you can put partitions on separate drives if needed. Usially though, it's enough to partition with a key that you can use on each query, usually time.

In addition to partitioning which has been mentioned you might also want to run the tuning-primer script to ensure your mysql configuration is optimal.

How to continuously remove anything older than the newst 10 entries of a MySQLdatabase (Possibly in JPQL/JPA)

I'm looking for a way to continuously monitor and delete the oldest entries so that the database is never larger than a certain value. I'm only interested in the latest 10 for example and everything past that number should be deleted. The database is updated through varous programs but the program that does the monitoring and deleting will probably be a Java EE application with JPA. I don't know at which layer of the implementation this will be done. If MySQL has build in management that does this, if I'll have to write a query that does this, or if there is a feature of Java that can do this.
Edit: I'm using an autoincremented id that could be used to determine threshhold of deleting.

This is a complex problem, because unless your table is not linked to any other table, you might very well have the latest row in table A referencing a very old row in table B. In this case, although the table B's row is very old, you can't delete it without breaking the coherence of your database.
Doing it "continuously" is even harder (read: impossible). I would first
examine if it's really needed. Disks are cheap, and 10 entries in an enterprise database is really nothing.
implement some purge mechanism and execute it very now and then, when the database is not used by anyone else.

I'll have a stab without knowing anything about your table schema:
DELETE FROM MyTable WHERE Id NOT IN (SELECT TOP 10 Id FROM MyTable ORDER BY Date DESC)
This is pretty inefficient to run all the time and there may by a MySql-specific TRUNCATE that does the job nicer. You'd probably get better performance from limiting your reads to the 10 rows you need, and actually archiving / deleting the extraneous data only periodically.

MySQL locking processing large LOAD DATA INFILE when live SELECT queries still needed

Looking for some help and advice please from Super Guru MySQL/PHP pros who can spare a moment of their time.
I have a web application in PHP/MySQL which has grown over the years and gets alot of searches on it. Its hitting bottlenecks now when the various daily data dumps of new rows get processed using MySQL LOAD DATA INFILE.
Its a large MyISAM table with about 1.5 million rows and all the SELECT queries occur on it. When these take place during the LOAD DATA INFILE of about 600k rows (and deletion of out dated data) they just get backed up and take about 30+ minutes to be freed up making any of those searches fruitless.
I need to come up with a way to get that table updated while retaining the ability to provide SELECT results in a reasonable timeframe.
Im completely out of ideas and have not been able to come up with a solution myself as its the first time ive encountered this sort of issue.
Any helpful advice, solutions or pointers from similar past experiences would be greatly appreciated as I would love to learn to resolve this sort of problem.
Many thanks everyone for your time! J

You can use the CONCURRENT keywords for LOAD DATA INFILE. This way, when you load the data, the table is still able to server SELECTs.
Concerning the delete, this is more complicated. I would personally add a column called 'status' INT(1), who will define if the line is active or not( = deleted), and then partition my table with a rule based on this column status.
This way, it will be easier to delete all rows where status=0 :P I haven;t tested this last solution, I may do that in a near future.
The CONCURRENT keywords will work if your table is optimized. If there is any FREE_SPACE, then the LOAD DATA INFILE will lock the table.

MyISAM doesn't support row-level locking, so operations like mysqldump are forced to lock the entire table to guarantee a consistent dump. Your only practical options are to switch to another table like (like InnoDB) that supports row-level locking, and/or split your dump up into smaller pieces. The small dumps will still lock the table while they're dumping/reloading, but the lock periods would be shorter.
A hairier option would be to have "live" and "backup" tables. Do the dump/load operations on the backup table. When they're copmlete, swap it out for the live table (rename tables, or have your code dynamically change which table they're using).. If you can live with a short window of potential stale data, this could be a better option.

You should switch your table storage engine from MyISAM to InnoDB. InnoDB provides row-locking (as opposed to MyISAM's table-locking) meaning while one query is busy updating or inserting a row, another query can update a different row at the same time.

Is Postgres better than MySql when one needs to add a column to a table with millions of rows?

We're having problems with Mysql. When I search around, I see many people having the same problem.
I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.
I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?

In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.

When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.
In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.
I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.
When I ran the simple statement on this table:
alter table mytable add test text;
it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.
Dropping a column is just as simple and fast.
Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.
you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.
So, both systems have warts and bumps to deal with.

maybe because this should not be a regualr occurrence.
perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008