I have some database with big data inside it, now I am thinking how to organize them to be more scallable.
some point as my consideration is :
Security
Performance
Cost
Generally answer is welcome, because I am still didn't expected all of my problem or possibility risk will happen, it's will help me if you can give me some suggestion.
To give a full answer to your question we will need more information on how big the data is, how complex, what your use cases are (ie. do you do many joins on multiple tables or are they mostly on a single table?). In any case, here are some good pointers that would help you get on your way.
If you are expecting your data to grow rapidly, I would recommend that you look at a cloud based database solution rather than invest on physical hardware that would need replacing every so often. Cloud based solutions provide you more freedom to scale your database both vertically and horizontally. There are specialized cloud database technologies such as Amazon RedShift and recently introduced Aurora which can be configured easily as your requirements grow.
For performance improvement within the database you can always look at indexes and changes in structures. Use the explain syntax in MySQL to analyze your queries and see if the queries use temporary tables or data scans which will slow things down. Adding indexes to columns that you use for filtering or merging data increases performance drastically.
In data warehouses, you can also denormalize and pre-join tables to improve performance. Although this will drastically increase your storage use, due to the fact that you are only working with one data table increases the performance as the time taken to do the join over and over again is taken off the equation.
If you are looking at massive datasets that will grow in structure and complexity, there are other non relational database technologies such as noSQL based Hadoop, Cassandra, etc. Moving into these environments may need you to rewrite most of your application, but is something that you should consider before you find yourself in the need for such things when the data has grown too big.
EDIT
Privacy and data security as pointed out below by #Saïd Tahali in the comments. If you can't host your data outside due to legal or security reasons, you will need to invest on your own hardware that will address all of the above in-house.
Related
Right now I'm trying to choose the most appropriate approach in order to implement Audit Trail for my entities with AWS RDS MySQL database.
I have to log all entity changes including the initiator(user) who initiated these changes. One of the main criterion is performance.
Hibernate Envers looks like the easiest and the most complete solution and can be very quickly integrated. Right now I'm worried about the possible performance slowdown after Envers introducing. I saw a few posts where developers prefer approach for Audit Trail based on database triggers.
The main issue with triggers is how to get initiator(user) who initiated these changes.
Based on your experience, could you please suggest the approach for Java/Spring/Hibernate/MySQL(AWS) in order to implement Audit Trail for historical changes.
Also, do we have any solution for Audit Trail within AWS RDS MySQL database infrastructure ?
Understand that speculation about performance without concrete evidence to support one's theory is analagous to premature optimization of code. It's almost always a waste of time.
From a simple database point of view, as a table grows to a specific limit, yes it's performance will degrade, but typcally this mainly impacts queries and less on insertion/update if the table is properly indexed and queries properly formed.
But many databases support partitioning as a means to control performance concerns, particularly on larger tables. This typically involves separating a table's data across a set of boundaries defined by a partition scheme you create. You simply define what is the most relevant data and you try and store this partition on your fastest drives/storage and the less relevant, typically older, data is stored on your slower drives/storage.
You can also elect to store database tables in differing schemas/tablespaces by specifying the envers property org.hibernate.envers.default_schema. If your database supports putting schemas in different database files on the file system, you can help increase performance by allowing your entity table reads/writes not impact the reads/writes of your audit tables.
I can't speak to MySQL's support for any of these things, but I do know that MSSQL/Oracle supports partitioning very easily and Oracle for sure allows the separation of schemas across differing database files.
Having studied about relational databases, document-stores, graph databases, and column-oriented databases, I concluded that something like Cassandra best fits my needs. In particular, the ability to add columns on the fly and no requirement to have a strict schema seals the deal for me. This seems to nicely bridge the gap between a rather novel graph db and a time-tested rdbms.
But I am concerned about how running Cassandra on a single node. Like many others, I can start only with a small amount of data, so more than one node to start with is just not practical. Based on another excellent SO question: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL? I concluded that Cassandra can indeed be run just fine as a single node, as long as one is willing to give up benefits like availability which are derived from a multi-node setup.
There also seem to be ways of implementing dynamic adding of fields in an RDBMS for instance as discussed here on SO: How to design a database for User Defined Fields? This would, to some extent, mimic schemaless-ness.
So I would now like to understand how do Cassandra and MySQL compare - with regard to features and performance, on a single node setup? What would you advise someone in my situation - start with a simple RDBMS with the plan/intent to switch to Cassandra later on? Or start with Cassandra?
In a single node setup of Cassandra, many of the advantages of Cassandra are lost, so the main reason for doing that would be if you intended to expand to multiple nodes in the future. Performance would tend to favor RDBMS in most applications when using a single node since RDBMS is designed for that environment and can assume all data is local.
The strengths of Cassandra are scalability and availability. You can add nodes to increase capacity and having multiple nodes means you can deal with hardware failures and not have downtime. These strengths come at the cost of more difficult schema design since access is based primarily on consistent hashing. It also means you don't have full SQL available and often must rely on denormalization techniques to support fast access to data. Cassandra is also weak for ACID transactions since it is inherently difficult to coordinate atomic actions on multiple nodes.
RDBMS by contrast is a more mature technology. ACID transactions are no problem. Schema design is much simpler since you can add efficient indexes to any column to optimize queries, and you have joins available so that redundant data can be largely eliminated. By eliminating redundant data it is much easier to keep your data consistent, since there are not multiple copies of data that need to be updated when someone changes their address for example. But you run the risk of running out of space on a single machine to store all your data. And if you get a disk crash you will have downtime and need backups to restore the data, while Cassandra can often easily repair the data on a node that is out of sync. There is also no easy way to scale an RDBMS to handle higher transaction rates other than buying a faster machine.
There are a lot of other differences, but those are the major ones. Neither one is better than the other, but each one may be better suited to certain applications. So it really depends on the requirements of your use case which one will be a better fit.
In my PHP application I have a 470M rows table weighing 200GB in a MySQL MyISAM partitioned table on one server. Usage includes 70% Writes/30% Reads.
I'm trying to improve performance. Main problem currently is read/write contentions due to table-level locks. I'm trying to decide between two options:
Changing MySQL to Innodb. Pros: avoiding the table level locks. Cons: Much more disk space, need bigger HDs which might not be as fast as these (currently using RAID10 6*300GB SAS 15k).
Moving data to a NoSQL db. Main Con: Learning curve. Have never used NoSQL before.
Question is, while trying to still avoid sharding the data, and considering the fact I'm using the RDMS MySQL as a simple key-value storage, are there high differences between performances between the two approaches or is the NoSQL main advantage here comes when moving to a distributed system?
I can only answer your question partially but hopefully more than a comment.
MongoDB is not typically a key-value store and has been known to have certain performance hits when used as one.
MongoDb also has a locking problem here that could come back to haunt you. It has a DB level lock atm which means it could (would need testing) cause write lock saturation.
It is also heavily designed for a 80% read app (which is said to be the most common setup for websites now-a-days) so the more writes you do the more you will notice a performance drop over time. That being said you can tweak MongoDB to be more write friendly and the distributed nature does help to stop write lock saturation a little.
However that being said my personal opinion the learning curve of MongoDB from SQL:
Was next to null
More natural and simpler to implement into my app than SQL
Query language is simple making it dead easy to get to grips with
Query language has a lot of similarities to SQL
The drivers are standardised so that the syntax you see in the Docs for the JS driver in the console is consistent across the board.
My personal opinion on the general matter is the distributed notion of it. If you get a NoSQL solution designed for key-value stores then it could be really good. A quick search on Google pulled out a small list of NoSQL key-value stores on Wikipedia: http://en.wikipedia.org/wiki/NoSQL#Key-value_stores_on_solid_state_or_rotating_disk
In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.
Concurrently, customer service operates on a replicated slave of the same database.
We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).
We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.
Given all this, what is our best option for a database?
I think you're going the wrong way about the problem.
Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).
Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.
What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point).
You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).
Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).
Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.
Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...
... so ...
... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.
If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.
As much as I'm not a fan of MySQL once your data gets large, I have to say that you're nowhere near needing to move to a NoSQL solution. 120M rows is not a big deal: the database I'm currently working with has ~600M in one table alone and we query it efficiently. Managing that much data from an ops perspective is the problem; querying it isn't.
It's all about proper indexes and the correct use of them when joining, and secondarily memory settings. Find your slow queries (mysql slow query log FTW!), and learn to use the explain keyword to understand whey they are slow. Then tweak your indexes so your queries are efficient. Further, make sure you understand MySQL's memory settings. There are great pages in the docs explaining how they work, and they aren't that hard to understand.
If you've done both of those things and you're still having problems, make sure disk I/O isn't an issue. Then you should look in to another solution for querying your data if it is.
NoSQL solutions like Cassandra have a lot of benefits. Cassandra is fantastic at writing data. Scaling your writes is very easy--just add more nodes! But the tradeoff is that it's harder to get the data back out. From a cost perspective, if you have expertise in MySQl, it's probably better to leverage that and scale your current solution until it hits a limit before completely switching your underlying architecture.
So I have a website that could eventually get some pretty high traffic. My DB implementation is in SQL Server 2008 at the moment. I really only have 2 tables and a few stored procs. Most of the DB could be re-designed to work without joining (although it wouldn't make sense when I can join so easily within SQL Server).
I heard that sites like Digg and Facebook use NoSQL databases for a lot of their basic data access. Is this something worth looking into, or will SQL Server not really slow me down that bad?
I use paging on my site (although this might change in the future), and I also use AJAX'd data access for most of the "live" stuff, so it doesn't really seem to be a performance hindrance at the moment, but I'm afraid it will be as the data starts expanding exponentially.
Am I going to gain a lot of performance my moving to NoSQL? Honestly, right now I don't even completely understand NoSQL, so any tips on how this will help me improve the better.
Thanks guys.
Actually Facebook use a relational database at its core, see SOCC Keynote Address: Building Facebook: Performance at Massive Scale. And so do many other web-scale sites, see Why does Quora use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, CouchDB etc?. There is also a discussion of how to scale SQL Server to web-scale size, see How do large-scale sites and applications remain SQL-based? which is based on MySpace's architecture (more details at Scale out SQL Server by using Reliable Messaging). I'm not saying that NoSQL doesn't have its use cases, I just want to point out that there are many shades of gray between white and black.
If you're afraid that your current solution will not scale then perhaps you should look at what are the factors that prevent scalability with your current solution. Test data is cheap to produce, load the 'exponentially increased' data volume and run your test harness, see where it cracks. None of the NoSQL solutions will bring magic off-the-shelf scalability, they all require you to understand how to use them effectively and deploy them correctly. And they also require you to test with large volumes if you want to ensure success at scale. Same for traditional relational solutions.
Sql Server scales pretty well. For example, Stack Overflow used it to serve you this very page. Facebook and Google might use a form of nosql, but even if you make it really big you're unlikely to rise to that level.
With a simple table structure and data that fits on one server, it doesn't matter much what platform you use. There are a several possible reasons to need to move to NoSQL:
Data scaling - SQL works best when all the data fits on one server (up to a few TB). The reason a lot of NoSQL stores don't have join is that they were designed not to require all the objects to be on one server.
Performance scaling - NoSQL stores do tend to be faster at handling high traffic, but not necessarily by enough to matter. You can improve SQL performance quite a lot with replication and caching as long as you aren't running into data size issues. Writes generally do have to run on the one server, but in most cases you will need to improve read performance long before write performance becomes an issue.
Complex data access - some types of queries simply don't fit well into a relational model. Graph and set stores work quite differently from relational databases so are a better fit for some applications.
Easier development - If you don't already have a SQL database and all the code to support it, using a schemaless datastore can save quite a bit of development time.
I don't think so you have to move your database from SQL to NoSQL unless and untill you are serving thousands of TB data. If you properly normalize your tables and serve the data and also need to set proper archive mechanism it should work.
If you still have question what to choose and how, than check this. Let's assume that you have decided to move on to NoSQL database than there are lot of market player. Just have a look at the list which is again depending upon your need and type of data you have.
Am I going to gain a lot of performance my moving to NoSQL?
It depends.
Check out this article for 7 reasons when you DON'T want to use NoSQL. If none is your case, then read further.
The main advantage of Document-based NoSQL for the traditional enterprise needs is cheaper hosting at high scale due to lower CPU usage on querying denormalised data (the most often request). Key points:
The CPU is going nuts on JOINs and GROUP BYs in the SQL queries, when a denormilised data structure implies no/less JOINs, hence less stress on CPU.
CPU is the most expensive resource in the cloud, then storage is the cheapest. And denormalised data trades higher storage for lower CPU.
How to get there?
Master the DDD (Domain-Driven Design).
Gain good understanding of CQRS (Command Query Responsibility Segregation) and Eventual consistency.
Understand your domain and business processes.
Design model, which is tuned to the access patterns.
Review.
Repeat steps 3 - 5.