Does the space occupied by deleted rows get re-used? - mysql

I have read several times that after you delete a row in an InnoDB table in MySQL, its space is not reused, so if you make a lot of INSERTs into a table and then periodically DELETE some rows the table will use more and more space on disk, as if the rows were not deleted at all.
Recently I've been told though that the space occupied by deleted rows is re-used but only after some transactions are complete and even then - not fully. I am now confused.
Can someone please make sense of this to me? I need to do a lot of INSERTs into an InnoDB table and then every X minutes I need to DELETE records that are more than Y minutes old. Do I have a problem of ever-growing InnoDB table here, or is it paranoia?

It is paranoia :)
DB's don't grow in size unnecessarily, but for performance issues space is not freed either.
What you've heard most probably is that if you delete records that space is not given back to the Operating System. Instead, it's kept as an empty space for the DB to re-use afterwards.
This is because:
DB needs to have some HD space to save its data; if it doesn't have any space, it reserves some empty space at first.
When you insert a new row, a piece of that space is used.
When you run out of free space, a new block is reserved, and so on.
Now, when you delete some rows, in order to prevent reserving more and more blocks, its space is kept free but never given back to the Operating System, so you can use it again later without any need of reserving new blocks.
As you can see, space is re-used, but never given back. That's the key point to your question.

in innodb, there is no practical way of freeing up the space.
use per table ibdata file, that will
enable you to delete record copy the
data to a new table and delete old
table, thus recovering records.
use mysqldump and whole lots of
receipe to clean up the whole
server. Check following:
http://dev.mysql.com/doc/refman/5.0/en/adding-and-removing.html
All of these methods become impractical when you are using huge tables(in my case they are more than 250GB) and you must keep them deleting records to better performance.
You will have to seriously think, whether you have enough space on your harddisk to perform one of the above function (in my case I do not think 1TB is enough for all these actions)
with Innotab table (and mysql itself) the option are fairly limited if have serious database size.

Related

How to fine tune AWS R4 Aurora MySql database

I have a database currently at 6.5Gb but growing fast...
Currently on a R4L Aurora server, 15.25G Ram, 2 core CPU
I am looking at buying a Reserved Instance to cut costs, but worried that if the database grows fast, e.g. reaches over 15G within a year, I'll need to get a bigger server.
99% of the data is transactional history, this table is the biggest by far. It is written very frequently, but once a row has been written it doesn't change often (although it does on occasion).
So few questions...
1) Should I disable the cache?
2) Will I be ok with 15G ram, even if the database itself goes to (say) 30G, or will I see massive speed issues
3) The database is well indexed, but could this be improved? E.g. if (say) 1 million records belong to 1 user, is there a way to partition the data to prevent that slowing down access for other users?
Thanks
"Should I disable the cache?" -- Which "cache"?
"will I see massive speed issues" -- We need to see the queries, etc.
"The database is well indexed" -- If that means you indexed every column, then it is not well indexed. Please show us SHOW CREATE TABLE and a few of the important queries.
"partition" -- With few exceptions, partitioning does not speed up MySQL tables. Again, we need details.
"15.25G Ram" & "database...15G" -- It is quite common for the dataset size to be bigger, even much bigger, than RAM. So, this pair of numbers are not necessarily good to compare to each other.
"1 million records belong to 1 user" -- Again, details, please.
You should statistically explain the data growth. This can be done by running a count(*) query group by created date (year) column. Once you have a count of records per year then you can understand what's going on.
Now you can think of possible solutions
You can remove data which is no longer relevant from history standpoint and keep the storage limited.
If there's large amount of data e.g. Blob etc. possibly you can target storing that in S3 and store reference into database table
Delete any unwanted tables. Sometimes DBA creates temporary backup tables and they leave them there after work. You can clean such tables.
The memory of the instance just comes into play when the engine fetches pages into the buffer pool for page misses. It does not depend on your actual data size (except in extreme cases, for example, your records are really really huge). The rule of thumb is to make sure you always keep your working set warm in the buffer pool, and avoid pages getting flushed.
If your app does need to touch a large amount of data, then the ideal way to do that would be to have dedicated replicas for specific kinds of queries. That way, you avoid swapping out valid pages in favor of newer queries. Aurora has custom endpoints support now, and that makes this even easier to manage.
If you need more specific guidelines, you may need to share details about your data, indices, queries etc.

Microsoft Access Lost Back-end File Size

I have the problem with MS. Access. The problem is, My current MS. Access Back-end file size is 320 MB but after I compact database it still has file size only 222 MB so it mean I lost file size 98 MB. My question is, what that problem? after it lost file size 98 MB why it keep more slower then before when user use it? What about record in that file lost or not? Thank you in advance.
This is normal behaviour, and you did not lose any data. A compact + repair (C + R) is a normal maintains that you should do on your database. How often you do this kind off maintains will much depend on how many users, how much data “churning” etc.
So some can go for weeks, or even perhaps longer without having to C + R the back end. Some, much less time.
So why does the file grow like this?
There are several reasons, but one is simply that to allow multiple users, then when you delete a record, access cannot reclaim the disk space, because you (may) have multiple users working with the data. You cannot “move” all that other data down to fill the “hole” because that would CHANGE the position of existing data.
So if I am editing record 400, and another user deletes record 200, then a “hole” exists at 200. However, if I want to reclaim the space, I would have to “move down” every single record to fill that hole. So if the database has 100,000 records, and I delete record 50, then I now have to move a MASSIVE 99950 records back down to fill that one hole! That is way too slow.
So in place of the HUGE (and slow) process of processing 99950 records (a lot of data) then access just simple leaves the “hole” in that spot.
The reason is multi-user. With say 5 users working on the system, then you can’t start moving around data WHILE users are working. The place or spot of an existing record would thus be moving all the time.
So moving records around is NOT practical if you are to allow multiple users.
The other issue that causes file growth is that if you open up and edit record (again say position and record 50 out of 100,000). What happens if you type in extra information and now the record is TOO LARGE for that spot at position 50?
So now your record is too large. Now we have the opposite problem of a delete – we need to expand and make the “hole” or spot 50 larger. And to do that, we might have to move 100,000 or more records to increase the size of the hole for that one record.
The “hole” or “spot” for the record is NOT large enough anymore.
So what access does is simply mark and set the old record (the old spot) as deleted, and then access puts the too large record we just editing at the end of the file (thus expands at the end of the file). So the file grows, even with just editing, and not necessary only due to delets.
So deleting a record does not really “remove” the hole, and is to slow from a performance point of view.
And as noted, if we move records (which is way too slow), then other users working on the data would find the position of the current record they are working on NOT in the same place anymore.
So we can’t start “moving around” that data during edting.
So access is NOT able to re-claim space during operation. It is too slow, causes way too much disk i/o for a simple delete, and also as noted would not work in multi-user when position of records are always changing due to some delete.
To reclaim all those “holes” and “spots”, then you do a C + R. So this is a scheduled type of maintains that you do when no one is working on data. (say late at night, or after all workers go home). This also explains why only ONE user can be working to do a C + R.
So you not losing any data – the C + R simply is re-claiming all those “holes” and “spots” of un-used space, but the process is time consuming.
So it is too slow “during” operation of your application to re-claim those spots. Such re-claim of wasted and un-used space thus only occurs during a C + R, and not during high speed and interactive operations when your users are working.
I should point out that “most” database systems have this issue, and while “some” attempt re-claim the un-used space, it is simple better to have a separate process and separate action to reclaim that space during system maintains, not during use of the application.
What you are seeing is normal.
And after a C + R you should see improved performance. Often not much, but if the file is really large, full of lots of gaps and holes, then a C + R will reduce the file size a lot, and can help perfoamnce. Access also re-builds the indexes, and also orders data by PK order – this can also increase performance as you “more often” get to read data in PK order.

Mysql optimization for simple records - what is best?

I am developing a system that will eventually have millions of users. Each user of the system may have acces to different 'tabs' in the system. I am tracking this with a table called usertabs. There are two ways to handle this.
Way 1: A single row for each user containing userid and tab1-tab10 as int columns.
The advantage of this system is that the query to get a single row by userid is very fast while the disadvantage is that the 'empty' columns take up space. Another disadvantage is that when I needed to add a new tab, I would have to re-org the entire table which could be tedious if there are millions of records. But this wouldn't happen very often.
Way 2: A single row contains userid and tabid and that is all. There would be up to 10 rows per user.
The advantage of this system is easy sharding or other mechanism for optimized storage and no wasted space. Rows only exist when necessary. The disadvantage is up to 10 rows must be read every time I access a record. If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
My programmer side is leaning towards Way 1 while my big data side is leaning towards Way 2.
Which would you choose? Why?
Premature optimization, and all that...
Option 1 may seem "easier", but you've already identified the major downside - extensibility is a huge pain.
I also really doubt that it would be faster than option 2 - databases are pretty much designed specifically to find related bits of data, and finding 10 records rather than 1 record is almost certainly not going to make a difference you can measure.
"Scattered" records don't really matter, the database uses indices to be able to retrieve data really quickly, regardless of their physical location.
This does, of course, depend on using indices for foreign keys, as #Barmar comments.
If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
They don't have to be scattered if you use clustering correctly.
InnoDB tables are always clustered and if your child table's PK1 looks similar to: {user_id, tab_id}2, this will automatically store tabs belonging to the same user physically close together, minimizing I/O during querying for "tabs of the give user".
OTOH, if your child PK is: {tab_id, user_id}, this will store users connected to the same tab physically close together, making queries such as: "give me all users connected to given tab" very fast.
Unfortunately MySQL doesn't support leading-edge index compression (a-la Oracle), so you'll still pay the storage (and cache) price for repeating all these user_ids (or tab_ids in the second case) in the child table, but despite that, I'd still go for the solution (2) for flexibility and (probably) ease of querying.
1 Which InnoDB automatically uses as clustering key.
2 I.e. user's PK is at the leading edge of the child table's PK.

diff 2 large database tables

given 2 large tables(imagine hundreds of millions of rows), each one has a string column, how do you get the diff?
Check out the open-source Percona Toolkit ---specifically, the pt-table-sync utility.
Its primary purpose is to sync a MySQL table with its replica, but since its output is the set of MySQL commands necessary to reconcile the differences between two tables, it's a natural fit for comparing the two.
What it actually does under the hood is a bit complex, and it actually uses different approaches depending on what it can tell about your tables (indexes, etc.), but one of the basic ideas is that it does fast CRC32 checksums on chunks of the indexes, and if the checksums don't match, it examines those records more closely. Note that this method is much faster than walking both indexes linearly and comparing them.
It only gets you part of the way, though. Because the generated commands are intended to sync a replica with its master, they simply replace the current contents of the replica for all differing records. In other words, the commands generated modify all fields in the record (not just the ones that have changed). So once you use pt-table-sync to find the diffs, you'd need to wrap the results in something to examine the differing records by comparing each field in the record.
But pt-table-sync does what you already knew to be the hard part: detecting diffs, really fast. It's written in Perl; the source should provide good breadcrumbs.
I'd think about creating an index on that column in each DB, then using a program to process through each DB in parallel using an ordering on that column. It would advance in both as you have records that are equal, and in one or the other as you find they are out of sync (keeping track of the out of sequence records). The creation of the index could be very costly in terms of both time and space (at least initially). Keeping it updated, though, if you are going to continue adding records may not add to much overhead. Once you have the index in place you should be able to process the difference in linear time. Producing the index -- assuming you have enough space -- should be an O(nlogn) operation.

Is mysql UPDATE faster than INSERT INTO?

This is more of a theory question.
If I'm running 50,000 queries that insert new rows, and 50,000 queries that updates those rows, which one will take less time?
Insert would be faster because with update you need to first search for the record that you are going to update and then perform the update.
Though this hardly seems like a valid comparison as you never have a choice whether to insert or update as the two fill two completely different needs.
EDIT: I should add too that this is with the assumption that there are no insert triggers or other situations that could cause potential bottlenecks.
Insert Operation : Create -> Store
Update Operation : Retrieve -> Modify -> Store
Insert Operation faster.
With an insert into the same table, you can always insert all the rows with one query, making it much faster than inserting one by one. When updating, you can update several rows at a time, but you cannot apply this to every update situation, and often you have to run one update query at a time (when updating a specific id) - and on a big table this is very slow having to find the row and then update it every time. It is also slower even if you have indexed the table, by my experience.
As an aside here, don't forget that by doing loads more inserts than updates, you have more rows when you come to select, so you'll slow down the read operation.
So the real question then becomes - what do you care about more, a quick insert or a speedy read. Again, this is dependant on certain factors - particularly (and not yet mentioned) DB engine, such as InnoDB (which is now standard in PHPMyAdmin incidentally).
I agree with everyone else though - there's too much to consider on a case-by-case basis and therefore you really need to run your own tests and assess the situation from there based on your needs.
There's a lot of non-practical answers here. Yes, theoretically inserts are slower because they have to do the extra step of looking up the row. But this is not at all the full picture if you're working with a database made after 1992.
Short answer: they're the same speed. (Don't pick one operation over the other for the sake of speed, just pick the right operation).
Long answer: When updating, you're writing to memory pages and marking them as dirty. Any modern database will detect this and keep these pages in cache longer (this is opposed to a normal select statement which doesn't set this flag). This cache is also smart enough to hold on to pages that are accessed frequently (See LRU-K). So subsequent updates to the same rows will be pretty much instant, no lookups needed. This is assuming you're updating based on index'd columns such as IDs (I'll talk about that in a second).
Compare this to a rapid amount of inserts, new pages will need to be made and these pages needed to be loaded into the cache. Sure you can put multiple new rows on the same page, but as you continue to insert this page is filled up and tossed away never to be used again. Thus, not taking advantage of re-using pages in the cache. (And just as a note, "loading pages into the cache" is also known as a "page fault", which is the #1 slower-downer of database technology in most environments, MonogoDB is always inclined to share this idea).
If you're inserting on basis of a column that isn't index: yeah that is WAY slower than inserting. This should be made infrequent in any app. But mind you, if you DO have indexes on a table, it will speed up your updating but also will slow your inserting because this means newly inserted rows will have to insert new index data as well (as compared to updates which re-use existing index data instead of generating new ones). See here for more details on that in terms of how MySQL does it.
Finally, Multi-threaded/multi-processing environments can also turn this idea on its head. Which, I'm not going to get into that. That's a whole 'nother can of worms. You can do your research on your type of database + storage engine for this as well as gauge your apps use of concurrent enviroment... Or, you can just ignore all that and just use the most intuitive operation.