Database design with millions of entry

Database design with millions of entry - mysql

Suppose there is a messaging system. This system has millions of entry to be sent and get reported and the count is growing by 100K every hour. 2 service accesses db, one is sender, one is reporter. So what would you suggest in order to get maximum performance? How could the db be designed?
Also what open source RDBMS would you suggest among mysql, postgresql, mongodb etc. to fullfil this high volume db?
Thanks

You've not really provided much information on your requirement other than a few comments about expected data volumes. Simple storage of large volumes of data has no real intrinsic value, it's the ability to access that data which gives the real value; so knowing how you expected to retrieve information from the database is more important than how much data you want to store.
Do these messages really require a document db like MongDB, or are are they structured enough to use a straight RDBMS like Postgresql or MySQL. Do you need full text search capability? How often and what type of queries are executed against this message data? Are you trying to write your own Twitter?
If those are your current data volumes, look to using db replication for resilience. Consider partitioning your message table, perhaps by date posted. Use master/slave (or even multi-master/multi-slave) as Konerak has suggested. Look at the possibilities of an archive table for older messages that are less likely to be queried, but which are then still available. Look at what a commercial database like Oracle can offer you. Get in a professional to help tune the db for performance, rather than simply asking for free advice on sites like SO.
Consider your hardware as well... multiple load balanced servers to help with the volumes (we have 14 dedicated servers purely for accepting new messages, and three high performance servers tuned for querying the data).

Related

Data Base for handle large data

We have started a new project using MySQL, spring boot, and Angular js. Initially, we did not realize our DB is going to handle large data.
The number of tables will not be large (<130), only 10 to 20 tables will be contained in more data, which is almost inserted/ read/ update.
The estimated amount of data in that 10 table is going to grow at 12,00,000 records in a month, and we should not delete those data be able to do various reports.
There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.
I don't have first-hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. as we have completed 100% coding and development but now we realize this. I have doubts may be MYSQL going to handle large data. I know that Oracle is the safe bet, interested if Mysql with a similar setup. But it is bound only in MySQL I am ok with any DB based on you all feedback I can take a call.
Open source DB more preferable but it's not mandatory we can go for paid DB also.

Handling Large Data
MySQL is more than capable of handling such loads. In fact, it is capable of handling much much more load than what you are talking about. You just have to create the right kind of tables. You can do that by choosing
the correct storage engine for your use-case
the correct character set
the optimal data type for your column
the right indexing strategy - creating indexes thoughtfully
the right partitioning strategy (if the data in the table exceeds tens of millions of records)
EDIT: You've also got to choose the right kind of data modelling and normalization strategy for your use-case. Most of OLTP applications require some level of normalization. But if you want to do analytics and aggregates on heavy tables, you should either have a Data Warehouse of have highly denormalized tables to avoid joins and/or have a column-oriented database to support such queries.
MySQL is open-source and has a very strong community support so you will find a lot of literature around any issue that you face. You can also find all the filed bugs (resolved and unresolved) here.
As far as the number of tables are concerned, there's really no cap on that. See here, MySQL permits 4 billion tables if you're using InnoDB as the engine.
A lot of very big companies with scale use MySQL in some capacity. Facebook is one of them.
Native JSON Support
With the growing popularity of JSON as the de facto data exchange format across the internet, MySQL has also provided native JSON support in 5.7, so now you can store and query JSON from your APIs, if required.
HA and Replication
MySQL Replication works! Earlier, MySQL used to support coordinate replication only but now it supports GTID replication which makes it easier to maintain and fix replication issues. There are third-party replicators also available in the market. For instance, Continuent's Tungsten is a replicator written in Java and is a replacement for native replication. It comes with a lot of configuration options which are not available with native MySQL replication.

I agree with MontyPython, MySql can do it and the design is critical. Fortunately MySql allows you to be flexible over time as needed.
I've had history tables needed used in daily reporting that grew to over a billion records in plain MySql and had no problems.
I've also used MySql Merge tables to divide up tables with big-ish rows (100KB+) to speed things up. Basically keeping the individual merge table file sizes under 30GB each. However that solution increases the open file count (in the system) per client - might be a bigger deal on a clustered system. That one was not.
That said, I like to give Honorable Mention to:
MariaDB - MySql but with contributions from Facebook, Alibaba, Google, and more.
I've moved most of my MySql community edition projects over to MariaDB and have been very happy. It's an almost transparent upgrade.
They offer an interesting enterprise Big Data Analytics (MariaDB AX) package, but with your current requirements its probably overkill and the standard community edition will fulfill your needs.
For example, here's an informative tutorial on how to set up a scalable Cluster (Galera) and adding MaxScale for High Availability:
https://mariadb.com/resources/blog/getting-started-mariadb-galera-and-mariadb-maxscale-centos
Another interesting option is Vitesse - developed at Youtube, which allows for sharded mysql through a (mostly) driver based solution. It solves the problem of needing to have available access to huge amounts of data and always yield good performance. As such, it goes beyond high availability and focuses on a solution wherein no single query (ie. a report against millions of rows of historical data) can negatively impact the other queries needing to be performed.

MySQL & Memcached for large datasets?

For a customer I am currently investigating improvements to their database structure.
My customer offers holiday rentals on their website.
On their front page they have a search function wich sends a query to a MySQL database architecture (Master-Master setup) that answers that query with all the holiday rentals that the customer is interested in.
Due to the growth of the company and the increasing load on their servers the search query's are currently running up to 10+ seconds. Mainly because the query's end with an ORDER BY which causes MySQL to create a temp table and sort all the data, an average search query can return up to 20k holiday homes.
Ofcourse one of the things we are doing is investigating the query's, rewriting them and putting indexes where needed. Unfortunately we are unable to get allot more performance under these circumstances.
That's why we are looking into implementing Memcached on top of MySQL to cache these large datasets in memory for faster retrieval. Unfortunately the datasets that the query's return are quite large wich makes Memcached not that effective at this point. The array that MySQL returns are currently about 15k rows with about 60 values per row.
The reason Memcached is interesting is because we want to drastically improve the search function, and lowering the load on the MySQL platform. This would make it more scalable.
I am wondering if there is anyone that is familair with (longterm) caching MySQL data in Memcached and making it more effective for large datasets?
Thanks a bunch!

Memcache is for storing key-value pairs, not for large sets of data. Will it work? Yes. Of course it will. But with how much data you guys are going to throw at it, you're going to run out of memory very soon and end up hitting the database anyway with how often your search results may change. And remember that just because it's memcache doesn't mean it doesn't have to go through web sockets to a (most likely) different machine. Your problem seems to be that you're using MySQL for something it was never designed well for, which is its use as a search engine. No matter how many things you optimize, all you're doing is raising the ceiling an inch at a time.
I could take this post in a "you need to optimize MySQL parameters so that it doesn't have to create those temp tables" direction, but I'm going to assume you've already looked into that and keep going.
My recommendation is that you implement something on top of MySQL to handle the searching. In my own quest for fast searching, these are the solutions I gave the most weight to:
Sphinx: http://sphinxsearch.com
Solr: http://lucene.apache.org/solr
Elasticsearch: http://www.elasticsearch.org
You'll find plenty of resources here on StackOverflow for which of those is better and faster and what not. For our purposes, we picked Elasticsearch for one of our projects and Solr for another.

Suggestion for write frequently, read rarely database

I have an application server which writes frequently to a database and reads it in the near future, but then very rarely that data entry is read.
What is some good databases optimised for this kind of access? I am currently using MongoDB but I think that probably isnt the best choice in this case.
I am open to relational DBs (i.e. MySQL), MongoDB, Redis, etc.
P.S. Seems it's easy to answer this question for read frequently DB access, but hard to find information on this specific case.

This is very generic question, We need to know more details
Size of Database
Data growth, How much 10GB per day / 200GB per month ?
Is it a OLTP Application or OLAP Application ?
What is maximum number of concurrent transactions / users ?
Apart from it, Since you have mentioned data is rarely read beyond a certain point
You can always look at options for Archival (Cleaning up based on duration - Monthly basis / Yearly basis)
Parititioning is also another option, for faster retrieval
Again the option for going for SQL or NOSQL is based on
Consistency
If you have a fixed schema I would suggest you to go for Relational DB
Concurrency aspects, Based on need you need to decided SQL or NOSQL (example - online banking i would suggest RDBMS, For product reviews/comments storing for a site, I am ok for NOSQL as this does not need any concurrency handling)
You need to provide more details on your database need in terms of functionality, data volumes, data usage and growth aspects
Hope it helps...

Since you mention MySQL, you might want to look at the ARCHIVE storage engine.

how to do fast read data and write data in mysql?

Hi Friends
i am using MySQL DB for one of my Product, about 250 schools are singed for it now, its about 1500000 insertion per hour and about 12000000 insertion per day, i think my current setup like just a single server may crash with in hours, and the read is also same as write, how can i make it crash free DB server, the main problem i am facing now is the slow of both writing and reading data how can i over come that,it is very difficult for me to get a solution.guys please help me..which is the good model for doing the solution?

It is difficult to get both fast reads and writes simultaneously. To get fast reads you need to add indexes. To get fast writes you need to have few indexes. And to get both to be fast they must not lock each other.
Depending on your needs, one solution is to have two databases. Write new data to your live database and every so often when it is quiet you can synchronize the data to another database where you can perform queries. The disadvantage of this approach is that data you read will be a little old. This may or may not be a problem depending on what it is you need to do.

~500 inserts per second is nothing to sneeze at indeed.
For a flexible solution, you may want to implement some sort of sharding. Probably the easiest solution is to separate schools into groups upfront and store data for different groups of schools on different servers. E.g., data for schools 1-10 is stored on server A, schools 11-20 on server B, etc. This is almost infinitely scalable, assuming that there are few relationships between data from different schools.
Also you could just try throwing more horsepower at the problem and invest into a RAID of SSD drives and, assuming that you have enough processing power, you should be OK. Of course, if it's a huge database, the capacity of SSD drives may not be enough.
Finally, see if you can cut down on the number of insertions, for example by denormalizing the database. Say, instead of storing attendance for each student in a separate row put attendance of the entire class as a vector in a single row. Of course, such changes will heavily limit your querying capabilities.

My laid back advice is:
Build you application lightweight. Don't use an high level database abstraction layer like Active Record. They suck at scaling.
Learn a lot about mysql permformance.
Learn about mysql replication.
Learn about load balancing.
Learn about in memory caches. (memcached)
Hire an administrator (with decent mysql knowledge) or web app performance guru/consultant.
The concrete strategy depends on your application and how it is used. Mysql replication, may or may not be appropriate (same applies for the mentioned sharding strategy). But it's a rather simple way to achive some scaling, because it doesn't impact your application design too much. In memory caches can keep away some load from your databases, but they need some work to apply and some trade offs. In the end you need a good overall understanding how to handle a database driven application under heavy load. If you have a tight deadline, add external manpower, because you won't do this right within 6 weeks without experience.

Alternatives to traditional relational databases for activity streams

I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.
The activities are things like:
6 PM: John favorited Bacon
5:30 PM: Jane commented on Snow Crash
5:15 PM: Jane added a photo of Bacon to her album
The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).
I need to be able to do the following:
Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
Pull activities for a thing (like "Bacon") in reverse date order
Filter by activity type ("favorite", "comment")
Store at least 30 million activities
Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.
I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.
Is anybody doing anything like this outside of a traditional RDBMS?
Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...
Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change

I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.
I have no idea how much performance or much data you plan on using, but 30M rows is not very many.
If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.
But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.
EDIT:Some other points:
key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
Therefore, you cannot do a CREATE INDEX
Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.
In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.

I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.

It seems to me that what you want to do -- Query a large set of data in several different ways and order the results -- is exactly and precisely what RDBMeS were designed for.
I doubt you would find any other datastore that would do this as well as a modern commercial DBMS (Oracle, SQLServer, DB2 etc.) or any opn source tool that would accomplish
this any better than MySql.
You could have a look at Googles BigTable, which is really a relational database but
it can present an 'object'y personality to your program. Its exceptionaly good for free format text
searches, and complex predicates. As the whole thing (at least the version you can download) is implemented in Python I doubt it would beat MySql in a query marathon.

For a project I once needed a simple database that was fast at doing lookups and which would do lots of lookups and just an occasional write. I just ended up writing my own file format.
While you could do this too, it is pretty complex, especially if you need to support it from a web server. With a web server, you would at least need to protect every write to the file and make sure it can be read from multiple threads. The design of this file format is something you should work out as good as possible with plenty of testing and experiments. One minor bug could prove fatal for a web project in this style, but if you get it working, it can work real well and extremely fast.
But for 99.999% of all situations, you don't want such a custom solution. It's easier to just upgrade the hardware, move to Oracle, SQL Server or InterBase, use a dedicated database server, use faster hard disks, install more memory, upgrade to a 64-bit system. Those are the more generic tricks to improve performance with the least effort.

I'd recommend learning about message queue technology. There are several open-source options available, and also robust commercial products that would serve up the volume you describe as a tiny snack.

CouchDB is schema-free, and it's fairly simple to retrieve a huge amount of data quickly, because you are working only with indexes. You are not "querying" the database each time, you are retrieving only matching keys (which are pre-sorted making it even faster).
"Views" are re-indexed everytime new data is entered into the database, but this takes place transparently to the user, so while there might be potential delay in generating an updated view, there will virtually never be any delay in retrieving results.
I've just started to explore building an "activity stream" solution using CouchDB, and because the paradigm is different, my thinking about the process had to change from the SQL thinking.
Rather than figure out how to query the data I want and then process it on the page, I instead generate a view that keys all documents by date, so I can easily create multiple groups of data, just by using the appropriate date key, essentially running several queries simultaneously, but with no degradation in performance.
This is ideal for activity streams, and I can isolate everything by date, or along with date isolation I can further filter results of a particular subtype, etc - by creating a view as needed, and because the view itself is just using javascript and all data in CouchDB is JSON, virtually everything can be done client-side to render your page.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008