We run a system that for cache purposes, currently writes and deletes about 1,000 small files (10k) every hour.
In the near future this number will raise to about 10,000 - 20,000 files being written and deleted every hour.
For every files that is being written a new row on our mysql DB is added and deleted when the file is deleted an hour later.
My question:
Can this excessive write & delete operation hurt our server performance eventually somehow?
(btw we currently run this on a VPS and soon on a dedicated server.)
Can writing and deleting so many rows eventually slow our DB?
This greatly depends on operating system, file system and configuration of file system caching. Also this depends on whether your database is stored on the same disk as files that are written/deleted.
Usually, operation that affect file system structure such as file creations and file deletions require some synchronous disk IO, so operating system will not loose these changes after power failure. Though, some operating systems and file systems may support more relaxed policy for this. For example, UFS file system on FreeBSD has nice "soft updates" option that does this. Probably etx3/Linus should have similar feature.
Once you will move to dedicated server I think it would be reasonable to attach several HDDs to it and to make sure that database is stored on once disk while massive file operations are performed on another disk. In this cases DB performance should not be affected.
You should make some calculations and estimate needed throughtput for the storage. In your worst scenario, 20000 files x 10K = 200MB per hour which is a very low requirement.
Deleting a file, on modern filesystems, takes a very little time.
In my opinion you don't have to worry, especially if your applications creates and deletes files sequentially.
Consider also that modern operative systems cache parts of file system in memory to improve performance and reduce disk access (this is true especially for multiple deletes).
Your database will grow but engines are optimized for it, no need to care about it.
Only downside is that handling many small files could cause disk fragmentation if your file system is subjected to it.
For a performance bonus, you should consider to use a separate phisical storage for these files (e.g. a different disk drive or disk array) so you will take advantage of full bandwidth transfer with no other interferences.
Related
I'm wondering if MySQL has any capability to specify that data belonging to a certain account (representing e.g., a particular app, or a particular corporate program) be stored at at some particular place in the filesystem (such as a particular drive or RAID), instead of bundling it inside the same physical file structure that is shared by every other account, table, and data element managed by MySQL for everybody on that server.
I'm aware that I can jigger MySQL to store its entire data bundle at a place other than the default place, but I was hoping there might be a way to do this by function, for "some data but not all data."
In MySQL 8.0, there are options to specify the location for each table or tablespace. See https://dev.mysql.com/doc/refman/8.0/en/innodb-create-table-external.html
In earlier versions of MySQL, these options didn't work consistently. You could specify the directory for individual table partitions, if your table was partitioned, but not for a non-partitioned table. Go figure. :-)
That said, I've never encountered a situation where it was worth the time to specify the physical location of tables. Basically, if your performance depends on the difference between one RAID filesystem vs. carefully choosing among different drives, you're already losing.
Instead, I've always done this approach:
Use one datadir on a fast RAID filesystem. Use the default configuration of all tables and logs under this datadir.
Allocate as much RAM as I can afford to the InnoDB buffer pool (up to the size of the database of course - no need to use more RAM than that). RAM is orders of magnitude faster than any disk, even an SSD. So you'd prefer to be reading data out of RAM.
If that's not enough performance, there are other things you can do to optimize, like creating indexes, or modifying the application code to do more caching to reduce database reads, or using a message queue to postpone database writes.
If that's still not enough performance, then scale out to multiple database servers. In other words, sharding.
There is a table in our database that takes about 25GB. It is no longer used by the current code.
Will it give any performance improvements (for rest of the tables) if we archive this table, even though it's not queried/used? Please provide explanation.
We are using MySQL with AWS Aurora.
Archiving tables will not have any impact on Aurora. Unused pages are evicted from buffer pool eventually [1], and since then, they never get pulled back onto the db instances, unless you make a query that would touch those pages.
You would continue to pay storage costs (and other in-direct costs like snapshots) by keeping them as unused. A better option would be to move the unused data to a new cluster, create a snapshot out of it, and remove the cluster. You can always recover the data when you need it by restoring a snapshot. The original database can then be cleaned by dropping these unused tables. This way you end up only paying for the snapshot, which is cheaper.
You could also export the data out of mysql (CSV let say) and store it in S3/Glacier. Only caviat is that when you need to access the data, it can end up being a much more time consuming effort to load it back to an existing or new database cluster.
[1] Buffer pool uses LRU for eviction. When you workload runs for long, you would eventually end up evicting all the pages associated with the unused table. Link: https://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html
Yes, archiving will improve performance also along with reduction in side and quickness of of backup/recovery cycles.
I have tried it on different projects in my recent full time job and results are amazing. For those who deny I would only say:
Reduction in footprint reduce disk IO and scans
Reduction in foot print reduce buffer requirements and hence RAM requirements.
YES, archiving infrequently used data will ease the burden on faster and more frequently accessed data storage systems. Older data that is unlikely to be needed often is put on systems that don’t need to have the speed and accessibility of systems that contain data still in use
Archived data is stored on a lower-cost tier of storage, serving as a way to reduce primary storage consumption and related costs. Typically, data reduplication is performed on data being moved to a lower storage tier, which reduces the overall storage footprint and lowers secondary storage costs
I plan on mount persistent disks into folders Apache(/var/www) and Mysql (/var/lib/mysql) to avoid having to replicate information between servers.
Anyone has done tests to know the I/O performance of persistent disk is similar when attaching the same disk to 100 instances as well as only 2 instances? Also has a limit of how many instances can be attach one persistent disk?
I'm not sure exactly what setup you're planning to use, so it's a little hard to comment specifically.
If you plan to attach the same persistent disk to all servers, note that a disk can only be attached to multiple instances in read-only mode, so you may not be able to use temporary tables, etc. in MySQL without extra configuration.
It's a bit hard to give performance numbers for a hypothetical configuration; I'd expect performance would depend on amount of data stored (e.g. 1TB of data will behave differently than 100MB), instance size (larger instances have more memory for page cache and more CPU for processing I/O), and access pattern. (Random reads vs. sequential reads)
The best option is to set up a small test system and run an actual loadtest using something like apachebench, jmeter, or httpperf. Failing that, you can try to construct an artificial load that's similar to your target benchmark.
Note that just running bonnie++ or fio against the disk may not tell you if you're going to run into problems; for example, it could be that a combination of sequential reads from one machine and random reads from another causes problems, or that 500 simultaneous sequential reads from the same block causes a problem, but that your application never does that. (If you're using Apache+MySQL, it would seem unlikely that your application would do that, but it's hard to know for sure until you test it.)
it is said that there is only one spindle in hard disk which reads or writes data to/from hard disk, how is it possible to write or read 2 or more data's to/from hard disk SIMULTANEOUSLY. the operating system used is windows xp.EXAMPLE, i need to copy two different movies to hard disk from pen drive so i click both movies copy them from pen drive and am pasting them in a disk partition, coping process of two movies to hard disk occurs simultaneously. how does this happen?
These operations aren't simultaneous at all, but the operating system manages both operations concurrently.
What happens is the file manager (say, windows explorer) tells the operating system to copy file from one location to another, once each for the two copy operations.
The operating system breaks this command across two parts of its own system, the "filesystem" and the "disk driver". The file system works out what blocks on what disk are associated with the particular files in question, and tells the disk driver to read or write to those blocks.
The disk driver builds up a queue of reads and writes and figures out the most efficient way to satisfy them. A desktop operating system will usually try to service those requests quickly, to make the system as responsive as possible, but a server operating system will queue up the block operations as long as possible so that it can handle them in an order that allows it to make the most efficient use of block ordering.
Once the disk driver decides to act on a block operation, it tells the disk to move it's head and read or write some data. The result of the action is then passed back to the filesystem, and ultimately to the user application.
The fact that the operations appear simultaneous is only an illusion of the multitasking facilities of the operating system. This is pretty easy discern since multiple file copies take a little longer than just one copy (or sometimes a LOT longer, if you're trying to do a bunch at the same time.)
of course, the OS is still able to move two separate drives simultaneously if they really are different disks.
I have replication setup mysql databases....the log file location the bin-log file all are at one path that is default my data directory of mysql.
I have read that for better performance one should store them separately.
Can anyone provide me how this improves the performance. Is there is documentation available for the same. The reason why one should do so?
Mainly because then, reads and writes can be made almost in parallel. Stored separately meaning on different disks.
Linux and H/W optimizations for MySQL is a nice presentation of ways to improve MySQL performance - it presents benchmarks and conclusions of when to use SSD disks and when to use SCSI disks, what kind of processors are better for what tasks.
Very good presentation, a must read for any DBA!!
It also can be really embarrassing to have your log files fill the file system and bring the database to a halt.
One consideration is that using a separate disk for binlogging introduces another SPOF since if MySQL cannot write the binlog it will croak the same as if it couldn't write to the data files. Otherwise, adding another disk just better separates the two tasks so that binlog writes and data file writes don't have to contend for resources. With SSDs this is much less of an issue unless you have some crazy heavy write load and are already bound by SSD performance.
It's mostly for cases where your database write traffic is so high that a single disk volume can't keep up while writing for both data files and log files. Disks have a finite amount of throughput, and you could have a very busy database server.
But it's not likely that separating data files from binlogs will give better performance for queries, because MySQL writes to the binlog at commit time, not at query time. If your disks were too slow to keep up with the traffic, you'd see COMMIT become a bottleneck.
The system I currently support stores binlogs in the same directory as the datadir. The datadir is on a RAID10 volume over 12 physical drives. This has plenty of throughput to support our workload. But if we had about double our write traffic, this RAID array wouldn't be able to keep up.
You don't need to do every tip that someone says gives better performance, because any given tip might make no difference to your application's workload. You need to measure many metrics of performance and resource use, and come up with the right tuning or configuration to help the bottlenecks under your workload.
There is no magic configuration that makes everything have high performance.