Using In-Memory OLTP Tables instead of temporary tables - sql-server-2014

SQL Server 2014 introduced what it seems to be a great feature: In-Memory OLTP (In-Memory Optimization). It reminds me of MySQL's MEMORY Storage Engine, which I have tried several years ago and obtained greater performance.
I know it has been already asked, but I am interested in a more targeted scenario:
Several SSRS reports use large stored procedures that, for performance reasons (really ugly and slow execution plans for queries with lots of joins, grouping etc.), split the logic by using temporary tables to hold partial results.
Since temporary table are written to the disk, I expect to have a performance impact when dealing with large quantity of data (tens or even hundred of thousand of lines).
Does using In-Memory OLTP tables make sense in this scenario? I am thinking about the following pros and cons.
Pros:
should be significantly faster, since all data is stored in memory
Cons:
requires explicit sessions separation (table should include some sort of session identifier to not mix data from various SPIDs)
requires explicit data cleanup to avoid memory starvation (as opposed to temporary tables automatically being dropped after exiting declaration scope)
requires to pay attention to the data volume stored in the table

You got that right.
Note, though, that temp tables like any other table are written lazily (even more lazily than normal). The writing to disk is usually not on the critical path because it happens asynchronously. Except for very big temp tables that exhaust SQL Servers ability to keep dirty buffers in memory.
Also, if the temp table is deallocated quickly there will not be any page writes at all.
So the "pro" that you have listed there often does not come into (full) effect.
That said in-memory tables tend to use less CPU even in interop mode. So the "pro" is always at least a little bit true.
You can consider using schema-only Hekaton tables. That will not cause any data writes for sure.

Related

Why do we have Redis when we have MySQL temporary tables?

MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.

A persistent temporary table?

Right now, I'm using temporary tables in my select queries to speed up the execution. They are created every time I execute the query.
In my current situation, the tables are updated with new data only once per day, so I was thinking that instead of using MySQL's CREATE TEMPORARY TABLE statement, I'll create a persistent table, which in a sense would be temporary since it'd be deleted and recreated after a day. And I could fill it up with the temporary data just after I've finished updating the main tables.
Or, will InnoDB's data buffer will be smart enough to cache the data for temporary tables itself?
Or is there another way for caching the temporary tables?
I'm also sending along appropriate cache headers with the data loaded using AJAX to reduce server load, and AJAX queries make up about 70% of the read requests sent to mysql.
Is what I'm thinking just a plain waste of disk space and tables are never meant to be used in this fashion, or is it a really bright idea for my situation?
I came across a similar issue recently where the CREATE TEMPORARY TABLE came at a significant cost due to continual reuse. I also used the solution that Barranka describes (create once and truncate when finished or before reuse).
To increase performance even more I used InnoDB tables that were created on a RAM disk (ramfs). This gives all the benefits of the InnoDB storage engine with very little IO cost. This is a better solution than using the MEMORY storage engine which, according to Oracle support, is only available for legacy applications and has not been improved or extended for some time.
Maybe looking at MEMORY storage engine might help? I use these for some accept data from an intensive query once a day, where the MEMORY table is then used intensively for a short period of time.
http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html

MySQL speed of a table affected by the size of other tables?

In a MySQL database with multiple tables, is the runtime of an query on a table affected by the size of all the other tables within the same database?
I can only think of two effects that other tables could have on a query, assuming that these tables are not involved in the query, no other queries are running, and there are no constraints or triggers that link the tables.
The first effect is during the compilation phase. During this phase, the SQL parser is fetching information about tables and columns from metadata tables. More tables and columns in the database could slow down the compilation of the query.
The other effect is page table and disk fragmentation. If you have a clean system and start allocating and filling pages, then the pages are probably going to fill contiguous pages on the disk (no guarantees, but probably). At access time, the operating system might be pre-fetching physical pages adjacent to a requested page. In the environment I described, this pre-fetched data would probably be used in the query.
In a database with multiple tables, you have a much greater chance of the disk being fragmented. In this case, the pre-fetched hardware pages are less likely to be for the same tables in the query. This means that they don't get used, so an additional I/O request is needed to get the next page.
I've described this in terms of disk fragmentation, but a similar thing can happen with the pages themselves. The physical pages where data for a table is stored may not be contiguous, with similar results.
Fragmentation can be an issue with databases. In fact, it can be an issue regardless of the number of tables in the database. But more tables with more insert/delete activity on them tends to increase fragmentation. However, the effects are usually pretty slight, and would only in certain extreme circumstances be responsible for a significant reduction in performance.
Having many tables (and data) only requires an HDD space.
Processing speed depends on your table optimization and size + queries you executing, not an entire database.
Query run-time can be effected by a wide variety of server environment concerns, so while one might say that in and of itself having other large tables in the database wouldn't effect the speed of queries on unrelated tables, the reality is those tables might be getting queried too and consuming server resources. This of course effects your query speed.
It should only be a problem if that table (the large one) is related to the other table with some sort of connection (foreign key constraints would be one example)...otherwise your tables should behave independently. That said, if you have a single table that is so massive as to be causing speed problems, you might want to find other solutions for refactoring some of that data into smaller subsets.

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere

Will a MySQL table with 20,000,000 records be fast with concurrent access?

I ran a lookup test against an indexed MySQL table containing 20,000,000 records, and according to my results, it takes 0.004 seconds to retrieve a record given an id--even when joining against another table containing 4,000 records. This was on a 3GHz dual-core machine, with only one user (me) accessing the database. Writes were also fast, as this table took under ten minutes to create all 20,000,000 records.
Assuming my test was accurate, can I expect performance to be as as snappy on a production server, with, say, 200 users concurrently reading from and writing to this table?
I assume InnoDB would be best?
That depends on the storage engine you're going to use and what's the read/write ratio.
InnoDB will be better if there are lot of writes. If it's reads with very occasional write, MyISAM might be faster. MyISAM uses table level locking, so it locks up whole table whenever you need to update. InnoDB uses row level locking, so you can have concurrent updates on different rows.
InnoDB is definitely safer, so I'd stick with it anyhow.
BTW. remember that right now RAM is very cheap, so buy a lot.
Depends on any number of factors:
Server hardware (Especially RAM)
Server configuration
Data size
Number of indexes and index size
Storage engine
Writer/reader ratio
I wouldn't expect it to scale that well. More importantly, this kind of thing is to important to speculate about. Benchmark it and see for yourself.
Regarding storage engine, I wouldn't dare to use anything but InnoDB for a table of that size that is both read and written to. If you run any write query that isn't a primitive insert or single row update you'll end up locking the table using MyISAM, which yields terrible performance as a result.
There's no reason that MySql couldn't handle that kind of load without any significant issues. There are a number of other variables involved though (otherwise, it's a 'how long is a piece of string' question). Personally, I've had a number of tables in various databases that are well beyond that range.
How large is each record (on average)
How much RAM does the database server have - and how much is allocated to the various configurations of Mysql/InnoDB.
A default configuration may only allow for a default 8MB buffer between disk and client (which might work fine for a single user) - but trying to fit a 6GB+ database through that is doomed to failure. That problem was real btw - and was causing several crashes a day of a database/website till I was brought in to trouble-shoot it.
If you are likely to do a great deal more with that database, I'd recommend getting someone with a little more experience, or at least oing what you can to be able to give it some optimisations. Reading 'High Performance MySQL, 2nd Edition' is a good start, as is looking at some tools like Maatkit.
As long as your schema design and DAL are constructed well enough, you understand query optimization inside out, can adjust all the server configuration settings at a professional level, and have "enough" hardware properly configured, yes (except for sufficiently pathological cases).
Same answer both engines.
You should probably perform a load test to verify, but as long as the index was created properly (meaning indexes are optimized to your query statements), the SELECT queries should perform at an acceptable speed (the INSERTS and/or UPDATES may be more of a speed issue though depending on how many indexes you have, and how large the indexes get).