Optimizing MySQL files in database (blob) - mysql

due to internal reasons (framework structure) I save many images in a table with a mediumBLOB.
Considering the query to retrive these images are sent with a pretty low rate is there a way to tell mysql to keep off this table from memory? I don't want to have a table of 2GBs in memory used only once in a while.
Are there any way to optimize this?
(Note: if this helps I can move this table in a new database containing only this table)
Thanks

MySQL won't generate in-memory table for BLOB types, as the storage engine doesn't support it.
http://dev.mysql.com/doc/refman/5.0/en/internal-temporary-tables.html
"Some conditions prevent the use of an in-memory temporary table, in which case the server uses an on-disk table instead:
Presence of a BLOB or TEXT column in the table"
Which means you should put the BLOB into a different table, and leave other useful data in a BLOBless table so that table will be optimized.

Moving that sort of table into a separate database sounds like a perfectly valid approach to me. At work we have one database server for our operational content (orders, catalogue etc) and one for logs (web logs and copies of emails) and media (images, videos and other binaries). Even running separate instances on the same machine can be worthwhile since, as you mentioned, it partitions the buffer cache (or whatever your storage engine's equivalent is).

Related

Giant unpartitioned MySQL table issues

I have a MySQL table which is about 8TB in size. As you can imagine, querying is horrendous.
I am thinking about:
Create a new table with partitions
Loop through a series of queries to dump data into those partitions
But the loop will require lots of queries to be submitted & each will be REALLY slow.
Is there a better way to do this? Repartitioning a production database in-situ isn't going to work - this seemed like an OK option, but slow
And is there a tool that will make life easier? Rather than a Python job looping & submitting jobs?
Thanks a lot in advance
You could use pt-online-schema-change. This free tool allows you to partition the table with an ALTER TABLE statement, but it does not block clients from using the table while it's restructuring it.
Another useful tool could be pt-archiver. You would create a new table with your partitioning idea, then pt-archiver to gradually copy or move data from the old table to the new table.
Of course try out using these tools in a test environment on a much smaller table first, so you get some practice using them. Do not try to use them for the first time on your 8TB table.
Regardless of what solution you use, you are going to need enough storage space to store the entire dataset twice, plus binary logs. The old table will not shrink, even as you remove data from it. So I hope your filesystem is at least 24TB. Or else the new table should be stored on a different server (or ideally several other servers).
It will also take a long time no matter which solution you use. I expect at least 4 weeks, and perhaps longer if you don't have a very powerful server with direct-attached NVMe storage.
If you use remote storage (like Amazon EBS) it may not finish before you retire from your career!
In my opinion, 8TB for a single table is a problem even if you try partitioning. Partitioning doesn't magically fix performance, and could make some queries worse. Do you have experience with querying partitioned tables? And you understand how partition pruning works, and when it doesn't work?
Before you choose partitioning as your solution, I suggest you read the whole chapter on partitioning in the MySQL manual: https://dev.mysql.com/doc/refman/8.0/en/partitioning.html, especially the page on limitations: https://dev.mysql.com/doc/refman/8.0/en/partitioning-limitations.html Then try it out with a smaller table.
A better strategy than partitioning for data at this scale is to split the data into shards, and store each shard on one of multiple database servers. You need a strategy for adding more shards as I assume the data will continue to grow.

How can I put regularly accessed data into a "quick access" area in a database

Very soon I will be building a database structure that will contain 2 million rows. Generally there are no more than 200 rows queried per minute and of those 200 it'll be 10-20 of those rows that are being queried.
Given the size of the table, I'd like to "store" the queried row somewhere so that any other end users querying this row will be able to get the row data "quicker". I then want this row to be accessed via this for a while and then put back into the main table once it's no longer in use. I believe this will make access quicker and more efficient.
Using the below schema, I'll provide an example. In this case row 1 has been accessed from the application layer. The application layer queries the "accessed" table to see if the row is there. If it is, it uses this and updates the "accessed" table with any changed data. If it isn't, it is queried from the main large table and dropped into the "accessed" table until the cron runs (say 10 minutes later) when all "accessed" data is copied into the main table and deleted from the accessed table.
http://sqlfiddle.com/#!2/d76f6/2
I'm trying to work out the following:
1) Will this show an increase in efficiency (I would imagine each query against "accessed" instead of the main will be significantly faster)?
2) What technology should be used for the "accessed" data storage? It's likely the main table will be stored in MariaDB/MySQL, however I'm happy to run it in flat files, sqlite, a different instance or keep it within the same instance... I'm open to suggestions that will make this more efficient, and in theory there's no reason the application layer couldn't act as an intermediary between any technologies
Premature optimization. Overcomplex design to start with. What you want to implement is a most frequently accessed cache system. However, the duty of a DMBS system is indeed to do these kind of system optimizations for you. There are already caches at disk level, file system level, and database level. What you are saying is that, even before having the system in place, you already know it is not going to perform as expected.
Maybe you know more than you state in your question, but on the face of it, optimizations should be done after, with suitable profiling.
There are a lot of ways to cache data.
On mysql you can use memory tables. Memory tables are much more faster than innodb-myisam tables
You can use memory based key value storage systems like redis, memcached
On application layer you can cache your data to filesystem

PostgreSQL equivalent of MySQL memory tables?

Does PostgreSQL have an equivalent of MySQL memory tables?
These MySQL memory tables can persist across sessions (i.e., different from temporary tables which drop at the end of the session). I haven't been able to find anything with PostgreSQL that can do the same.
No, at the moment they don't exist in PostgreSQL. If you truly need a memory table you can create a RAM disk, add a tablespace for it, and create tables on it.
If you only need the temporary table that is visible between different sessions, you can use an UNLOGGED table. These are not true memory tables but they'll behave surprisingly similarly when the table data is significantly smaller than the system RAM.
Global temporary tables would be another option but are not supported in PostgreSQL as of 9.2 (see comments).
Answering a four year old question but since it comes on top of google search results even now.
There is no built in way to cache a full table in memory, but there is an extension that can do this.
In Memory Column Store is a library that acts as a drop in extension and also as a columnar storage and execution engine. You can refer here for the documentation. There is a load function that you can use to load the entire table into memory.
The advantage is the table is stored inside postgres shared_buffers, so when executing a query postgres immediately senses that the pages are in memory and fetches from there.
The downside is that shared_buffers is not really designed to operate in such a way and instabilities might occur (usually it doesn't), but you can probably have this in a secondary cluster/machine with this configuration just to be safe.
All other usual caveats about postgres and shared_buffers still apply.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

MySQL database size

Microsoft SQL Server has a nice feature, which allows a database to be automatically expanded when it becomes full. In MySQL, I understand that a database is, in fact, a directory with a bunch of files corresponding to various objects. Does it mean that a concept of database size is not applicable and a MySQL database can be as big as available disk space allows without any additional concern? If yes, is this behavior the same across different storage engines?
It depends on the engine you're using. A list of the ones that come with MySQL can be found here.
MyISAM tables have a file per table. This file can grow to your file system's limit. As a table gets larger, you'll have to tune it as there's index and data size optimizations that limit the default size. Also, this MyISAM documentation page says:
There is a limit of 2^32 (~4.295E+09)
rows in a MyISAM table. If you build
MySQL with the --with-big-tables
option, the row limitation is
increased to (2^32)^2 (1.844E+19) rows.
See Section 2.16.2, “Typical configure
Options”. Binary distributions for
Unix and Linux are built with this
option.
InnoDB can operate in 3 different modes: using innodb table files, using a whole disk as a table file or using innodb_file_per_table.
Table files are pre-created per your MySQL instance. You typically create a large amount of space and monitor it. When it starts filling up, you need to configure another file and restart your server. You can also set it to autoextend, so that it will add a chunk of space to the last table file when it starts to fill up. I typically don't use this feature, as you never know when you'll take the performance hit for extending the table. This page talks about configuring it.
I've never used a whole disk as a table file, but it can be done. Instead of pointing to a file, I believe you point your InnoDB table files at the un-formatted, unmounted device.
innodb_file_per_table makes InnoDB tables act like MyISAM tables. Each table gets its own table file. Last time I used this, the table files did not shrink if you deleted rows from them. When a table is dropped or altered, the file resizes.
The Archive engine is a gzipped MyISAM table.
A memory table doesn't use disk at all. In fact, when a server restarts, all the data is lost.
Merge tables are like a poor man's partitioning for MyISAM tables. It causes a bunch of identical tables to be queried as if there were one. Aside from the FRM table definition, no files exist other than the MyISAM ones.
CSV tables are wrappers around CSV files. The usual file system limits apply here. They are not too fast, since they can't have indexes.
I don't think anyone uses BDB any more. At least, I've never used it. It uses a Berkly database as a back end. I'm not familiar with its restrictions.
Federated tables are used to connect to and query tables on other database servers. Again, there is only an FRM file.
The Blackhole engine doesn't store anything locally. It's used primarily for creating replication logs and not for actual data storage, since there is no data storage :)
MySQL Cluster is completely different: it stores just about everything in memory (recent editions allow disk storage) and is very different from all the other engines.
what you describe is roughly true for MyISAM tables. for InnoDB tables the picture is different, and more similar to what other DBMSs do: one (or a few) big file with complex internal structure for the whole server. to optimize it, you can use a whole disk (or partition) as a file. (at least in unix-like systems, where everything is a file)