What is a buffer write? - terminology

Can some help me understand what a buffer write is? I'm particularly interested in learning its function in the context database systems, MySQL say! Some of the following would be helpful:
What is the purpose of a buffer write
Are there any performance advantages
An example of a buffer write in a database application
I came across this term several times and I was unable to discern its meaning.
Thanks

Imagine a computer with 100M of memory, running a database. The database stores data in files, but also keeps 50M of it in memory in buffers. If there is a request for data (SELECT or INSERT), the request can be handled from buffers in memory, which is much faster that going all the way to disk.
Buffering request for access to information in a database's files is essentially caching requests for disk I/O. If information is INSERT-ed then DELETEd within a very short period of time, writing to disk may be unnecessary. Not writing (buffers) greatly increases performance.
If there as a request to INSERT 100M of data into the database, then all pending writes, (from buffers to disk), must be done. Then at least half the new data is written to disk. Data has to be written because there isn't enough memory for the 100M new data plus 50M old data to all reside in memory. This necessity to write some existing buffers to disk is a performance hit. Luckily it is only the buffers holding changed or new records that need to be written out(or flushed) to disk. Those changed buffers are referred to as "dirty."
After the aforementioned INSERT of 100M, some 50M of the new data may temporarily be held in memory until it's most convenient to write--because not writing increases performance. A convenient time to write write changed records back to disk is when the system has been idled for a while. Writing (buffer writes) when the system is idle doesn't lower performance.

Related

Should unused tables be archived?

There is a table in our database that takes about 25GB. It is no longer used by the current code.
Will it give any performance improvements (for rest of the tables) if we archive this table, even though it's not queried/used? Please provide explanation.
We are using MySQL with AWS Aurora.
Archiving tables will not have any impact on Aurora. Unused pages are evicted from buffer pool eventually [1], and since then, they never get pulled back onto the db instances, unless you make a query that would touch those pages.
You would continue to pay storage costs (and other in-direct costs like snapshots) by keeping them as unused. A better option would be to move the unused data to a new cluster, create a snapshot out of it, and remove the cluster. You can always recover the data when you need it by restoring a snapshot. The original database can then be cleaned by dropping these unused tables. This way you end up only paying for the snapshot, which is cheaper.
You could also export the data out of mysql (CSV let say) and store it in S3/Glacier. Only caviat is that when you need to access the data, it can end up being a much more time consuming effort to load it back to an existing or new database cluster.
[1] Buffer pool uses LRU for eviction. When you workload runs for long, you would eventually end up evicting all the pages associated with the unused table. Link: https://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html
Yes, archiving will improve performance also along with reduction in side and quickness of of backup/recovery cycles.
I have tried it on different projects in my recent full time job and results are amazing. For those who deny I would only say:
Reduction in footprint reduce disk IO and scans
Reduction in foot print reduce buffer requirements and hence RAM requirements.
YES, archiving infrequently used data will ease the burden on faster and more frequently accessed data storage systems. Older data that is unlikely to be needed often is put on systems that don’t need to have the speed and accessibility of systems that contain data still in use
Archived data is stored on a lower-cost tier of storage, serving as a way to reduce primary storage consumption and related costs. Typically, data reduplication is performed on data being moved to a lower storage tier, which reduces the overall storage footprint and lowers secondary storage costs

Tuning an write-only master mysql database

I have an master database when I only run write queries (inserts, deletes, updates).
I would like to know how to tune this having in mind that selects are not important here.
I'm using InnoDB. Replication with 1 Master and 2 Slaves. Running on a Ubuntu 16.04 server. MySQL 5.6
Disable the query cache. It's only beneficial for reads.
Disable the adaptive hash index. It's only beneficial for reads.
Increase the innodb_log_file_size. I recommend at least 2GB, unless disk space is short.
Drop indexes, except for those used by your UPDATE/DELETE statements. You can create more indexes on the slave to support SELECT queries.
Consider fine-tuning the Buffer Pool Flushing. The optimal settings depend on your workload, so you'll have to experiment.
If you want to sacrifice durability, you can make some other changes. Warning: these increase the risk of data loss.
innodb_flush_log_at_trx_commit = 2 or 0 to relax synchronous log writes.
innodb_doublewrite = OFF to disable page write protections.
sync_binlog = 0 to disable synchronous writes to the binary log.
Make sure your data directory is on fast disks, like SSD or a caching RAID array.
Never use NFS.
You may experiment with putting innodb_log_group_home_dir and innodb_undo_directory and log_bin_basename and tmpdir on different physical volumes from your data directory. But this won't give a benefit unless performance is really disk-bound.
Further tuning depends on your workload. For example, changing the thread concurrency or the number of IO write threads or the IO capacity. If you want to go to this level of tuning, get some consulting from a professional.
Comment from #spencer7593 brings up a good point, you might not be able to achieve the best optimization solely with database tuning options.
You haven't mentioned anything about the application or the type of writes, but eventually you'll have to consider changing the way you write to the database. Tuning changes alone are limited in how they improve database performance.
For example, applications could write to a queue, then create a consumer app to consume items from the queue and write data to the database in larger batches. That means more efficient database writes, but more importantly it allows applications to "write" with much lower latency because they are only writing to a queue.
Eventually, you may find that no single database instance can keep up with the rate of writes. At that point, you'll have to scale out, by spreading writes over multiple database instances. This is called "sharding" the data. Of course this adds more complexity to database reads, because your data is not all together. So try all the tuning changes you can try before resorting to sharding.

Difference between In memory databases and disk memory database

Recently i heard about the concept of In memory database.
In any type of database we are finally storing the data in the computer,from there our program will get the data .How in memory database operations are fast when compared to the others.
Will the in memory database load all the data from the database into memory(RAM).
Thanks in advance....
An in-memory database (IMDB; also main memory database system or MMDB or memory resident database) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.
Applications where response time is critical, such as those running telecommunications network equipment and mobile advertising networks, often use main-memory databases.
In reply to your query, yes it loads the data in RAM of your computer.
On-Disk Databases
All data stored on disk, disk I/O needed to move data into main
memory when needed.
Data is always persisted to disk.
Traditional data structures like B-Trees designed to store tables and
indices efficiently on disk.
Virtually unlimited database size.
Support very broad set of workloads, i.e. OLTP, data warehousing,
mixed workloads, etc.
In-Memory Databases
All data stored in main memory, no need to perform disk I/O to query
or update data.
Data is persistent or volatile depending on the in-memory database
product.
Specialized data structures and index structures assume data is
always in main memory.
Optimized for specialized workloads; i.e. communications
industry-specific HLR/HSS workloads.
Database size limited by the amount of main memory.
MySQL offerings
MySQL has several "Engines". In all engines, actions are performed in RAM. The Engines differ significantly in how good they are at making sure the data "persists" on disk.
ENGINE=MEMORY -- This is not persistent; the data is found only in RAM. It is limited to some preset max size. On a power failure, all data (in a MEMORY table) is lost.
ENGINE=MyISAM -- This is an old engine; it persists data to disk, but in the case of power failure, sometimes the indexes are corrupted and need 'repairing'.
ENGINE=InnoDB -- This is the preferred engine. It not only persists to disk but 'guarantees' consistency even across power failures.
In-memory db usually have the whole database in memory. (like MySQL DB Engine MEMORY)
This is a huge performance boost, but RAM is expensive and often not persistent, so you would loose data on restart.
There are some ways to reduce the last issue, e.g. by timed snapshots, or replication on a disk database.
Also there are some hybrid types, with just a part of the db in memory.
There are also in-memory databases like Tarantool that can work with data sets larger than available RAM. Tarantool is able to work with these sets because it is optimized for fast random writes, the main bottleneck that arises.

MYSQL concatenating large string

I have a web crawler that saves information to a database as it crawls the web. While it does this, it also saves a log file of its actions, and any errors it encounters to a log field in a mysql database (field becomes anywhere from 64kb to 100kb. It accomplishes this by concatenating (using the mysql CONCAT function).
This seems to work fine, but I am concerned about the cpu useage / impact it has on the mysql database. I've noticed that the web crawling is performing slower than before I implemented saving the log to the database.
I view this log file from a management webpage, and the current implementation seems to work fine other than the slow loading. Any recommendations for speeding this up, or implementation recommendations?
Reading 100kb strings into memory numerous time then write them to disk via a db. Of course your going to experience slowdown! Every part of what you are doing is going to task memory, disk, and cpu (especially if memory usage hits the system max and you start swapping to disk). Let me count some of the ways your going to possibly decrease overall site performance:
Sql connections max out and back up as the time to store 100kb records increases time a single process holds a connection
Webserver processes eat up free process pool and max out and take longer to free up because they have to wait on db connections to free.
Web server processes begin to bloat and take more memory each, possibly more than the system can handle without swapping. This is compounded by using the max. Umber of processes due to #2
... A book could be written on your situation.

Why should we store log files and bin-log files on different path or disks in mysql

I have replication setup mysql databases....the log file location the bin-log file all are at one path that is default my data directory of mysql.
I have read that for better performance one should store them separately.
Can anyone provide me how this improves the performance. Is there is documentation available for the same. The reason why one should do so?
Mainly because then, reads and writes can be made almost in parallel. Stored separately meaning on different disks.
Linux and H/W optimizations for MySQL is a nice presentation of ways to improve MySQL performance - it presents benchmarks and conclusions of when to use SSD disks and when to use SCSI disks, what kind of processors are better for what tasks.
Very good presentation, a must read for any DBA!!
It also can be really embarrassing to have your log files fill the file system and bring the database to a halt.
One consideration is that using a separate disk for binlogging introduces another SPOF since if MySQL cannot write the binlog it will croak the same as if it couldn't write to the data files. Otherwise, adding another disk just better separates the two tasks so that binlog writes and data file writes don't have to contend for resources. With SSDs this is much less of an issue unless you have some crazy heavy write load and are already bound by SSD performance.
It's mostly for cases where your database write traffic is so high that a single disk volume can't keep up while writing for both data files and log files. Disks have a finite amount of throughput, and you could have a very busy database server.
But it's not likely that separating data files from binlogs will give better performance for queries, because MySQL writes to the binlog at commit time, not at query time. If your disks were too slow to keep up with the traffic, you'd see COMMIT become a bottleneck.
The system I currently support stores binlogs in the same directory as the datadir. The datadir is on a RAID10 volume over 12 physical drives. This has plenty of throughput to support our workload. But if we had about double our write traffic, this RAID array wouldn't be able to keep up.
You don't need to do every tip that someone says gives better performance, because any given tip might make no difference to your application's workload. You need to measure many metrics of performance and resource use, and come up with the right tuning or configuration to help the bottlenecks under your workload.
There is no magic configuration that makes everything have high performance.