MySQL Memory Table, Memcached or anything else? - mysql

A dataset growing currently >1 million which requires constant lookup / updation of user specific data.
looking for fastest and scalable option with high TPS.
Memcache/memcacheddb vs mysql memory tables are a big confusion for implementation and scaling options.
Can any one provide proper scaling / tps and performance information for which one to land on?

Does the integrity of this data matter? If it does, you can immediately rule out memcached and MySQL memory tables, as neither one is persisted to durable storage. memcachedb is at least persisted, but it doesn't make the same sorts of reliability guarantees that a normal (R)DBMS would.

If you've got a large dataset, you don't scale it by throwing hardware at it. Since you didn't say what's your growth rate, it's difficult to suggest anything.
If you need to scale writes - you partition your table.
If you need to scale reads - you create master > multiple slaves replication cluster.
Also, there's an engine called TokuDB available for MySQL - more info at www.tokutek.com. It's extremely fast for certain things (updates, hot index addition and similar) but not that excellent when it comes to mass updating. It's worth checking out.

Related

Optimizing mysql/postgresql for create and update

As far as I know most of the RDBMS packages are built keeping in mind 99% of the queries will be select queries. However, I am in a situation where we have at least 50 % of the queries as create/update queries. Since we also need persistence, we can not go for NoSQL solutions. Essentially, whenever there is an update it should be immediately stored permanently. So, I was wondering if the performance with MySQL will be hampered because of that. Our current MySQL engine is InnoDb. Is any other MySQL engine more preferable? I plan to use Amazon RDS so my focus is MySQL; but just out of curiousity I would like to know if postgresql can help in this.
N.B. - Just to give an idea of the scale, we are talking about create/update queries on tables with at least a million entries within a couple of months of going into production.
If your working set fits in memory, your inserts and updates will tend to be quite fast. Partitioning can help here, as others have mentioned. Most NoSQL solutions have persistence so you shouldn't exclude them outright. Cassandra has a storage model specifically tuned for writes and might be worth a look.
If you go with MySQL, there are tuning parameters to trade some durability for insert performance, and various other hardware and software settings:
https://serverfault.com/questions/118504/how-to-improve-mysql-insert-and-update-performance
You can probably expect around 100 inserts / sec using full durability on standard disks. If that's not going to cut it, setup benchmarks and start tweaking parameters or get ready for some re-architecting. Benchmark testing is important using realistic amounts of data in your tables. It's much better to find a problem now than to discover it 6 months down the road when your tables start to fill in. Synthetic data is fine, just make sure the indexed fields are distributed similarly.
Having as few as possible indexes increases speed of inserts and updates, because all indexes have to get updated when inserting/updating rows to the tables.
But of course, keep in mind that some indexes might increase your updates as weel.

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere

Best Table Engine for massively updated MySQL table. MyISAM or HEAP?

I am creating an application which will store a (semi) real-time feed of a few different scales around a certain location. The weights of each scale will be put in a table with only as many rows as scales. The scale app feeds the MySQL database a new weight every second, which a PHP web app reads every 3 seconds. It doesn't seem like very much traffic that would page the hard drive very much, or if the difference would be negligible, but I'm wondering if it would be more efficient or make more sense to use a Memory/HEAP table vs a normal MyISAM table.
With anything from 100's to 1000's of concurrent read/write requests (think typical OLTP usage) innodb will out perform myisam hands down.
It's not about other people's observations, it's not about transactional/acid support, it's about the architecture of innodb which is far superior to that of the legacy myisam engine.
For example, innodb supports clustered primary key indexes http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html.
Additionally, innodb has row level locking which is far more performant under concurrent load than myisam table level locking.
I could keep going but somone's already provided a really good summary of why innodb is a better choice for OLTP: http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB
Well, if you're expecting a large amount of data, I think you almost have to go MyISAM. You'll likely run out of memory if you store it all in a memory table. Not to mention that you'll lose all of your data upon power loss with a HEAP engine (Keep in mind, you may want that depending on your use case)...
I know that this question is getting dated and you've probably made a very good solution by now but I just wanted to point out to anyone who may be reading this that perhaps a relational database isn't the best way to solve this problem. To me this clearly looks like a case where a flat file database is the ideal solution. You could have saved yourself a ton of overhead by just writing these values out to a binary file and then use simple mathematical operations to select rows and fields.

What are the limitations of implementing MySQL NDB Cluster?

I want to implement NDB Cluster for MySQL Cluster 6. I want to do it for very huge data structure with minimum 2 million records.
I want to know is if there are any limitations of implementing NDB cluster. For example, RAM size, number of databases, or size of database for NDB cluster.
2 million databases? I asssume you meant "rows".
Anyway, concerning limitations: one of the most important things to keep in mind is that NDB/MySQL Cluster is not a general purpose database. Most notably, join operations, but also subqueries and range opertions (queries like: orders created between now and a week ago), can be considerably slower than what you might expect. This is in part due to the fact that the data is distributed across multiple nodes. Although some improvements have been made, Join performance can still be very disappointing.
On the other hand, if you need to deal with many (preferably small) concurrent transactions (typically single row updates/inserts/delete lookups by primary key) and you mangage to keep all of your data in memory, then it can be a very scalable and performant solution.
You should ask yourself why you want cluster. If you simply want your ordinary database that you have now, except with added 99,999% availability, then you may be disappointed. Certainly MySQL cluster can provide you with great availability and uptime, but the workload of your app may not be very well suited for the thtings cluster is good for. Plus you may be able to use another high availability solution to increase the uptime of your otherwise traditional database.
BTW - here's a list of limitations as per the doc: http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations.html
But whatever you do, try out cluster, see if its good for you. MySQL cluster is not "MySQL + 5 nines". You'll find out when you try.
NDB cluster comes with two type of storage options.
1.In Memory Storage.
2.Disk storage.
NDB introduced as in memory data storage and in version 7.4(MYSQL 5.6) onwards started supporting disk storage.
current version 7.5(MySQL 5.7) supports disk storage and in this case there will be no size constraints as data is going to reside in disk and limit depend on disk storage space available with you.
Disk Storage configurations - https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-disk-data-symlinks.html
In Memory storage in NDB cluster is also quite mature and you can define memory usage in management node config.ini file.
example -
DataMemory=3072M
IndexMemory=384M
in an average table(depend on data stored in columns) total db size should be less then 1GB which can easily be configured.
Note - in my own implementation i faced one performance challenge as performance of NDB degrades with increasing number of rows in table.
Under high load concurrency read will degrade with number of increasing row.
Make sure you don't go for full table scan and provide sufficient where clause predicate.
For proper performance define secondary index properly as per your query pattern.
Defining secondary index will again increase memory consumption so plan your query pattern and memory resources accordingly.

Will a MySQL table with 20,000,000 records be fast with concurrent access?

I ran a lookup test against an indexed MySQL table containing 20,000,000 records, and according to my results, it takes 0.004 seconds to retrieve a record given an id--even when joining against another table containing 4,000 records. This was on a 3GHz dual-core machine, with only one user (me) accessing the database. Writes were also fast, as this table took under ten minutes to create all 20,000,000 records.
Assuming my test was accurate, can I expect performance to be as as snappy on a production server, with, say, 200 users concurrently reading from and writing to this table?
I assume InnoDB would be best?
That depends on the storage engine you're going to use and what's the read/write ratio.
InnoDB will be better if there are lot of writes. If it's reads with very occasional write, MyISAM might be faster. MyISAM uses table level locking, so it locks up whole table whenever you need to update. InnoDB uses row level locking, so you can have concurrent updates on different rows.
InnoDB is definitely safer, so I'd stick with it anyhow.
BTW. remember that right now RAM is very cheap, so buy a lot.
Depends on any number of factors:
Server hardware (Especially RAM)
Server configuration
Data size
Number of indexes and index size
Storage engine
Writer/reader ratio
I wouldn't expect it to scale that well. More importantly, this kind of thing is to important to speculate about. Benchmark it and see for yourself.
Regarding storage engine, I wouldn't dare to use anything but InnoDB for a table of that size that is both read and written to. If you run any write query that isn't a primitive insert or single row update you'll end up locking the table using MyISAM, which yields terrible performance as a result.
There's no reason that MySql couldn't handle that kind of load without any significant issues. There are a number of other variables involved though (otherwise, it's a 'how long is a piece of string' question). Personally, I've had a number of tables in various databases that are well beyond that range.
How large is each record (on average)
How much RAM does the database server have - and how much is allocated to the various configurations of Mysql/InnoDB.
A default configuration may only allow for a default 8MB buffer between disk and client (which might work fine for a single user) - but trying to fit a 6GB+ database through that is doomed to failure. That problem was real btw - and was causing several crashes a day of a database/website till I was brought in to trouble-shoot it.
If you are likely to do a great deal more with that database, I'd recommend getting someone with a little more experience, or at least oing what you can to be able to give it some optimisations. Reading 'High Performance MySQL, 2nd Edition' is a good start, as is looking at some tools like Maatkit.
As long as your schema design and DAL are constructed well enough, you understand query optimization inside out, can adjust all the server configuration settings at a professional level, and have "enough" hardware properly configured, yes (except for sufficiently pathological cases).
Same answer both engines.
You should probably perform a load test to verify, but as long as the index was created properly (meaning indexes are optimized to your query statements), the SELECT queries should perform at an acceptable speed (the INSERTS and/or UPDATES may be more of a speed issue though depending on how many indexes you have, and how large the indexes get).