In order to seperate my data, based on a key: Should I use multiple topics or multiple partitions within same topic? I'm asking on basis of overheads, computation, data storage and load caused on server.
I would recommend to separate (partition) your data into multiple partitions within the same topic.
I assume the data logically belongs together (for example a stream of click events).
The advantage of partitioning your data using multiple partitions within the same topic is mainly that all Kafka APIs are implemented to be used like this.
Splitting your data into the topics would probably lead to much more code in the producer and consumer implementations.
As suggested by #rmetzger splitting records into multiple topic would increase the complexity at the producer level however there might be some other factors worth considering.
In Kafka the main level of parallelism is number of partitions in a topic, because having so you can spawn that many number of consumer instance to keep reading data from the same topic in parallel.
E.g if you have a separate topic based on the event having N number of partition then while consuming you will be able to create N number of consumer instances each dedicated to consume from a specific partitions concurrently. But in that case the ordering of the messages in not guaranteed.i.e. ordering of the messages is lost in the presence of parallel consumption
On the other hand keeping the records within same topic in a separate partition will make this a lot easier to implement and consumer messages in order (Kafka only provides a total order over messages within a partition, not between different partitions in a topic.). But you will be limited to run only one consumer process in that case.
Related
I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
I'd suggest adding a new attribute to the client's records, for example we could call it part. Assign a single value to simple clients, and store the same value in part for all their records.
But heavy clients would be assigned multiple values for part, up to the number of shards. Every record for that client would set its part to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.
Your hashing algorithm for mapping clients to a shard would then use the client id + the part attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.
This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the part values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for each part value associated with that client.
To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
You've tagged your question with cockroachdb so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.
It is best to explain my question in terms of a concrete example.
Consider an order management application that restaurants use to receive orders from their customers. I have a table called orders which stores all of them.
Now every day the tables keep growing in size but the amount of data accessed is constant. Generally the restaurants are only interested in orders received in the last day or so. After 100 days, for example, 'interesting' data is only about 1/100 of the table size; after 1 year it's 1/365 and so on.
Of course, I want to keep all the old orders, but performance for applications that are only interested in current orders keeps reducing. So what is the best way to not have old data interfere with the data that is 'interesting'?
From my limited database knowledge, one solution that occurred to me was to have two identical tables - order_present and order_past - within the same database. New orders would come into 'order_present' and a cron job would transfer all processed orders older than two days to 'order_old', keeping the size of 'order_present' constant.
Is this considered an acceptable solution to deal with this problem. What other solutions exist?
Database servers are pretty good at handling volume. But the performance could be limited by physical hardware. If it is the IO latency that is bothering you, there are several solutions available. You really need to evaluate what fits best for your usecase.
For example:
you can Partition the table to distribute it onto multiple physical disks.
you can do Sharding to put data on to different physical servers
you can evaluate using another Storage Engine which best fits your data and application. MyISAM delivers better read performance compared to InnoDB at the cost of being less ACID compliant
you can use Read Replicas to deligate all (most) "select" queries to replicas (slaves) of the main database servers (master)
Finally, MySQL Performance Blog is a great resource on this topic.
We have a product that uses diferent MySQL shemas for diferent customers, and a single Java application that uses diferent persistence units for one for every customer. This makes it dificult to add a cutomer withowt redeploying the application.
We are planing to use a single MySQL database schema that hold all the customers with each table having a field which is a KEY sibolizing one customer, so that adding a new customer is a mater od few sql updates/inserts.
What is the best aproach to handle this kind of data in MySQL...does MySQL provide any partitioning tables by key or something like that. And what could be the performance issues of that aproach?
There are a few questions here:
Schema Design Question
Partitioning question
Can mySQL handle a HASH MAP Query O(1)
Schema Design Question:
Yes, this is much better then launching a new app per customer.
Can mySQL handle a HASH MAP Query O(1)
Yes, if the data remains in memory and has enough CPU cycles mySQL can easily do 300K selects a second. Otherwise if the data is I/O bounded and the Disk subsystem is not saturated mySQL can easily do 20-30K Selects per second dependent on the traffic pattern, concurrency, and how many IOPS the database disk subsystem can do.
Partitioning
Partitioning means different things in the context of talking about mySQL. Partitioning is a storage engine that sits on top of another storage engine in mySQL to allocate data to a certain table but exposing a group of partition tables as a single table to the calling application. Partitioning could also mean having certain databases execute a subset of all tables. In your context I think you are asking if you federate by customer what are the performance impact. I.e. can you allocate a database per customer if necessary with the same schema. This concept is more along the ideals of Sharding, taking the data as a whole and allocating resources per unit of data e.g. a customer.
My suggestion to you
Make the schema the same per customer. Benchmark all the queries involved that a customer would do. Query patterns that is. Verify that each query with EXPLAIN does not produce a filesort or temporary table, nor scans 100K rows at a time and you should be able to scale no problem. Once you run into issues with a single or set of boxes getting close to you're IOP ceiling think about splitting the data.
I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to
I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.