Keep part of the data in memory and part in Disk - snappydata

I have a column table with millions of records. I would like to keep only the last 3 months in memory, the rest need to be on disk but can be consulted. Is it possible to do this in SnappyData?

Column tables automatically overflow to disk using an LRU algorithm when memory fills up. You can also explicitly configure your eviction settings. All the configuration options are listed here.

Related

Resources consumed by a simple SELECT query in MySql

There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.

How can I limit the size of temporary tables?

I have largish (InnoDB) tables in a database; apparently the users are capable of making SELECTs with JOINs that result in temporary, large (and thus on-disk) tables. Sometimes, those are so large that they exhaust disk space, leading to all sorts of weird issues.
Is there a way to limit temp table maximum size for an on-disk table, so that the table doesn't overgrow the disk? tmp_table_size only applies to in-memory tables, despite the name. I haven't found anything relevant in the documentation.
There's no option for this in MariaDB and MySQL.
I ran into the same issue as you some months ago, I searched a lot and I finally partially solved it by creating a special storage area on the NAS for themporary datasets.
Create a folder on your NAS or a partition on an internal HDD, it will be by definition limited in size, then mount it, and in the mysql ini, assign the temporary storage to this drive: (choose either windows/linux)
tmpdir="mnt/DBtmp/"
tmpdir="T:\"
mysql service should be restarted after this change.
With this approach, once the drive is full, you still have "weird issues" with on-disk queries, but the other issues are gone.
There was a discussion about an option disk-tmp-table-size, but it looks like the commit did not make it through review or got lost for some other reason (at least the option does not exist in the current code base anymore).
I guess your next best try (besides increasing storage) is to tune MySQL to not make on-disk temp tables. There are some tips for this on DBA. Another attempt could be to create a ramdisk for the storage of the "on-disk" temp tables, if you have enough RAM and only lack disk storage.
While it does not answer the question for MySQL, MariaDB has tmp_disk_table_size and potentially also useful max_join_size settings. However, tmp_disk_table_size is only for MyISAM or Aria tables, not for InnoDB. Also, max_join_size works only on the estimated row count of the join, not the actual row count. On the bright side, the error is issued almost immediately.

MySQL cache of subset queries

I am attempting to make a query run on a large database in acceptable time. I'm looking at optimizing the query itself (e.g. Clarification of join order for creation of temporary tables), which took me from not being able to complete the query at all (with a 20 hr cap) to completing it but with time that's still not acceptable.
In experimenting, I found the following strange behavior that I'd like to understand: I want to do the query over a time range of 2 years. If I try to run it like that directly, then it still will not complete within the 10 min I'm allowing for the test. If I reduce it to the first 6 months of the range, it will complete pretty quickly. If I then incrementally re-run the query by adding a couple of months to the range (i.e. run it for 8 months, then 10 months, up to the full 2 yrs), each successive attempt will complete and I can bootstrap my way up to being able to get the full two years that I want.
I suspected that this might be possible due to caching of results by the MySQL server, but that does not seem to match the documentation:
If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again.
http://dev.mysql.com/doc/refman/5.7/en/query-cache.html
The key word there seems to be "identical," and the apparent requirement that the queries be identical was reenforced by other reading that I did. (The docs even indicate that the comparison on the query is literal to the point that logically equivalent queries written with "SELECT" vs. "select" would not match.) In my case, each subsequent query contains the full range of the previous query, but no two of them are identical.
Additionally, the tables are updated overnight. So at the end of the day yesterday we had the full, 2-yr query running in 19 sec when, presumably, it was cached since we had by that point obtained the full result at least once. Today we cannot make the query run anymore, which would seem to be consistent with the cache having been invalidated when the table was updated last night.
So the questions: Is there some special case that allows the server to cache in this case? If yes, where is that documented? If not, any suggestion on what else would lead to this behavior?
Yes, there is a cache that optimizes (general) access to the harddrive. It is actually a very important part of every storage based database system, because reading data from (or writing e.g. temporary data to) the harddrive is usually the most relevant bottleneck for most queries.
For InnoDB, this is called the InnoDB Buffer Pool:
InnoDB maintains a storage area called the buffer pool for caching data and indexes in memory. Knowing how the InnoDB buffer pool works, and taking advantage of it to keep frequently accessed data in memory, is an important aspect of MySQL tuning. For information about how the InnoDB buffer pool works, see InnoDB Buffer Pool LRU Algorithm.
You can configure the various aspects of the InnoDB buffer pool to improve performance.
Ideally, you set the size of the buffer pool to as large a value as practical, leaving enough memory for other processes on the server to run without excessive paging. The larger the buffer pool, the more InnoDB acts like an in-memory database, reading data from disk once and then accessing the data from memory during subsequent reads. See Section 15.6.3.2, “Configuring InnoDB Buffer Pool Size”.
There can be (and have been) written books about the buffer pool, how it works and how to optimize it, so I will stop there and just leave you with this keyword and refer you to the documentation.
Basically, your subsequent reads add data to the cache that can be reused until it has been replaced by other data (which in your case has happened the next day). Since (for MySQL) this can be any read of the involved tables and doesn't have to be your maybe complicated query, it might make the "prefetching" easier for you.
Although the following comes with a disclaimer because it obviously can have a negative impact on your server if you change your configuration: the default MySQL configuration is very (very) conservative, and e.g. the innodb_buffer_pool_size system setting is way too low for most servers younger than 15 years, so maybe have a look at your configuration (or let your system administrator check it).
We did some experimentation, including checking the effect from the system noted in the answer by #Solarflare. In our case, we concluded that the apparent caching was real, but it had nothing to do with MySQL at all. It was instead caused by the Linux disk cache. We were able to verify this in our case by manually flushing that cache after and before getting a result and comparing times.

MySQL Memory Engine vs InnoDB on RAMdisk

I'm writing a bit of software that needs to flatten data from a hierarchical type of format into tabular format. Instead of doing it all in a programming language every time and serving it up, I want to cache the results for a few seconds, and use SQL to sort and filter. When in use, we're talking 400,000 writes and 1 or 2 reads over the course of those few seconds.
Each table will contain 3 to 15 columns. Each row will contain from 100 bytes to 2,000 bytes of data, although it's possible that in some cases, some rows may get up to 15,000 bytes. I can clip data if necessary to keep things sane.
The main options I'm considering are:
MySQL's Memory engine
A good option, almost specifically written for my use case! But.. "MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. MEMORY tables cannot contain BLOB or TEXT columns." - Unfortunately, I do have text fields with a length up to maybe 10,000 characters - and even that is a number that is not specifically limited. I could adjust the varchar length based on the max length of text columns as I loop through doing my flattening, but that's not totally elegant. Also, for my occasional 15,000 character row, does that mean I need to allocate 15,000 characters for every row in the database? If there was 100,000 rows, that's 1.3 gb not including overhead!
InnoDB on RAMDisk
This is meant to run on the cloud, and I could easily spin up a server with 16gb of ram, configure MySQL to write to tmpfs and use full featured MySQL. My concern for this is space. While I'm sure engineers have written the memory engine to prevent consuming all temp storage and crashing the server, I doubt this solution would know when to stop. How much actual space will my 2,000 bytes of data consume when in database format? How can I monitor it?
Bonus Questions
Indexes
I will in fact know in advance which columns need to be filtered and sorted by. I could set up an index before I do inserts, but what kind of performance gain could I honestly expect on top of a ram disk? How much extra overhead to indexes add?
Inserts
I'm assuming inserting multiple rows with one query is faster. But the one query, or series of large queries are stored in memory, and we're writing to memory, so if I did that I'd momentarily need double the memory. So then we talk about doing one or two or a hundred at a time, and having to wait for that to complete before processing more.. InnoDB doesn't lock the table but I worry about sending two queries too close to each other and confusing MySQL. Is this a valid concern? With the MEMORY engine I'd have to definitely wait for completion, due to table locks.
Temporary
Are there any benefits to temporary tables other than the fact that they're deleted when the db connection closes?
I suggest you use MyISAM. Create your table with appropriate indexes for your query. Then disable keys, load the table, and enable keys.
I suggest you develop a discipline like this for your system. I've used a similar discipline very effectively.
Keep two copies of the table. Call one table_active and the second one table_loading.
When it's time to load a new copy of your data, use commands like this.
ALTER TABLE table_loading DISABLE KEYS;
/* do your insertions here, to table_loading */
/* consider using LOAD DATA INFILE if it makes sense. */
ALTER TABLE table_loading ENABLE KEYS; /* this will take a while */
/* at this point, suspend your software that's reading table_active */
RENAME TABLE table_active TO table_old;
RENAME TABLE table_loading TO table_active;
/* now you can resume running your software */
TRUNCATE TABLE table_old;
RENAME TABLE table_old TO table_loading;
Alternatively, you can DROP TABLE table_old; and create a new table for table_loading instead of the last rename.
This two-table (double-buffered) strategy should work pretty well. It will create some latency because your software that's reading the table will work on an old copy. But you'll avoid reading from an incompletely loaded table.
I suggest MyISAM because you won't run out of RAM and blow up and you won't have the fixed-row-length overhead or the transaction overhead. But you might also consider MariaDB and the Aria storage engine, which does a good job of exploiting RAM buffers.
If you do use the MEMORY storage engine, be sure to tweak your max_heap_table_size system variable. If your read queries will use index range scans (sequential index access) be sure to specify BTREE style indexes. See here: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

Why data usage increases when deleting rows Mysql Cluster?

My mysql cluster data nodes data usage increased to 95%. Then i tried to delete 500000 rows and my data usage of nodes started increasing initially and reached to 99% and above and TABLE FULL error is encountered. Please can anyone tell me why data usage of existing nodes increased when deleting rows? What is the internal working of mysql cluster when deleting rows?
My MaxNoOfConcurrentOperations parameter in config.ini 600000 and it is not due to this parameter.
Deleting rows does not necessarily free up disk space. In fact, it might use up more as it keeps MVCC logs and associated replication data.
In comparison, TRUNCATE will abandon data wholesale and doesn't seem to incur the same penalty, at least with most engines.
If you're really short on disk space, you may need to drop some tables entirely to liberate some headroom.