I'm trying to store buses schedules into database and I'm wondering which database model is suitable for my case.
I have bus operators, each operator has several routes, each route has several turns, each turn has stops, etc. Turns are generated from something called "turn master" where the scheduling is defined (frequency, stops, etc.) within next N days.
I expect to deliver a very fast searching for bus when user tries to search a bus from city to city on given date.
I'm using MySQL, the number of stops reach around 100.000 records and the searching speed is fast but I'm not sure if it's still fast when data gets really big (thousand operators, each operator has hundreses turns, each turn has around 10 stops, turns are generated for around next 30 days).
Basically, performing a search is to look into stops (city/town/place, time) and check if it matches user search criteria.
So, my question is: Is relational database best in this case? Or using some kind of NoSQL will be better when the data get really bigs?
Thanks in advance,
NoSQL databases are designed to work with unstructured data or data which is structured in various or unpredictable ways. Your data is structured in a very well understood and predictable way.
What makes you think that relational database isn't the right answer for your application? Having a lot of rows doesn't mean your relational queries are going to be slow. The performance of your application will depend on having proper indexing, but even more importantly, it will depend on your application logic. What heuristic are you using for solving the travelling salesman problem? How you do your routing could potentially have a bigger impact on system performance than your data storage choices.
Related
I'm just wondering what solution to chose to implement a follower system?
In MySQL i would have a table
userID INT PRIMARY,
followID INT PRIMARY
And in Redis I would just use a SET and add to the UserID all the followIDs.
What would be faster for lets say someone having 2000 followers and you want to list all the followers?(in a table that has about 1M entries)
What would be faster to find out if two Users follow each other?
Thank you very much!
By modern standards, 1M items are nothing. Any database or NoSQL system will work fine with such volume, so you just have to pick the one you are the most comfortable with.
In term of absolute performance, Redis will be faster than MySQL on this use case, because:
the whole dataset will be in memory
hash tables are faster than btrees
there is no SQL query to parse or execute
However, please note a relational database is far more flexible than a key/value store like Redis. If you can anticipate all the access paths to your data, then Redis is a good solution. Otherwise you will be better served by a more traditional database.
In my opinion, go with MySQL.
The two biggest points you will think about when making the decision are:
1) Have you thought about your use-cases?
You said you want to implement a follower system. If you're only going to be displaying a list of followers which each user has, then the Redis SET will be enough.
But what if you want to get a list of "A list of users which you are currently following"? You can't dig that up easily from your Redis SET, right? Or how about if you wanted to know if User-X is following User-A ? If User-A had 10,000 followers, this wouldn't be easy either would it?
MySQL is much more flexible when querying different types of results in different scenes.
2) Do you really need the performance difference?
As you know, Redis IS faster than MySQL in these kinds of cases.
It is a simple Key-Value system, so it will exceed the performance of MySQL.
Checking out performance results like these:
http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/
http://ruturaj.net/redis-memcached-tokyo-tyrant-and-mysql-comparision/
But the performance difference between Redis and MySQL really starts to kick in
only after about 5,000request/sec .
Otherwise you'd wouldn't be seeing a difference of more than 50ms.
Performance difference will not be an issue until you have a VERY large traffic.
So, after thinking about these two points, MySQL would be a better answer.
Redis will be good only if:
1) The purpose of the set/list is specific, and there is no need for flexibility in the future
2) You feel that the performance difference will actually have an effect on your architecture.
It depends on what you want to do with the data. You gave some examples but it does not sound as though you are really giving a full definition of what the product needs to do. If all you really want to do is show users if they follow each other? Then either is fine as you are just talking about 2 simple queries. However, what if you want to show two users the intersection of users they share or you want to make suggestions off of the data based on profile data for the users. Then, it becomes more interesting as Redis has functionality to easily give you the intersection of sets very very quickly (we're talking magnitude differences in terms of speed not just milliseconds - and the difference gets exponentially larger as there are more users/relationships to parse as the sql joins required to get the data can become prohibitive if you want to give the data in real time).
sadd friends:alex george paul bart
sadd friends:alice mary sarah bart
sinterstore friends:alex_alice friends:alex friends:alice
Note that the above can be done with mysql as well, but your performance will suffer and it would be something that you are more likely to run as a batch job and then store the results for future use. On the other hand, keep in mind that the largest "friends" network in the world, Facebook, started with mysql to store relationships. The graphs of those relationships were batched and heavily denormalized for storage in thousands of memcached servers to get decent performance.
Then if you are looking for more options beyond mysq1 or redis, you might want to read what Michael Stonebaker has to say (he helped create Postgres and Ingres) about using an RDBMS system for graph data such as friend relationships. http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/. Of course, he's trying to sell his new VoltDB but it is interesting food for thought.
So I think you really need to map out the requirements for the app (as I assume it will do more than just show you who your friends are) in terms of both expected load (did you just throw out 2000 or is that really what you expect to handle) and features and budget. Then really examine many of the different options on the market.
First of all I wanted to clarify that I am learning about Hive and Hadoop (and big data in general), so excuse the lack of proper vocabulary.
I am embarking myself in a huge (at least for me) project which requires dealing with enormous quantities of data which I am not use to deal in the past as I always worked mostly with MySQL.
For this project a series of sensors will produce approximately 125.000.000 data points 5 times an hour (15.000.000.000 a day) which is several times more that everything I have ever inserted into every MySQL table combined.
I understand that one approach would be using Hadoop MapReduce and Hive to query and analyze the data.
The problem I am facing is that for what I could learn I understand Hive runs mostly like "cron jobs" and not with real time queries which may take many hours and require a different infrastructure.
I thought of creating MySQL tables based on the results of Hive queries as at most the data which will be needed to be queried in real time would be approximately 1.000.000.000 rows but I was wondering if this is the right way to go or I should look into some other technology.
Is there any technology I should study which is specifically created for real time queries on big data?
Any tip will be much appreciated!
This is a complicated question. Let's start by addressing the technologies that you mention in your question, and go from there:
MySQL: It should be obvious to anyone who has used MySQL (or any other relational DB) that a traditional out-of-the-box installation of MySQL will never support the volumes that you are talking about. the back of the envelope calculations are enough to tell us that- assuming that your sensor inserts are only 100 bytes, you are talking about 15 billion x 100 bytes = 1.5 trillion bytes or 1.396 terabytes per day. That's truly big data, especially if you are planning on storing it for more than a day or two.
Hive: Hive can certainly handle that kind of data volume (I and many others have done it), but as you point out, you don't get real-time queries. Every query will be in batch, and if you need fast queries you'll need to pre-aggregate data.
Now that brings us to the real question- what kind of queries do you need to run? If you need to run arbitrary, real-time queries and can never predict what those queries might be, then you probably need to look towards comparatively expensive, proprietary data stores like Vertica, Greenplum, Microsoft PDW, etc. These will cost a lot of money, but they and others can handle the load you are talking about.
If on the other hand you can predict with some degree of accuracy the type of queries that will be run, then something like Hive might make sense. Store the raw data there, and use the batch query capabilities to do the heavy lifting and periodically create aggregated data tables in MySQL or another relational database to support your needs for low-latency queries.
One more alternative is something like HBase. HBase gives you low-latency access to distributed data, but you lose two critical items that you are probably accustomed to- a query language (HBase doesn't have SQL) and the ability to aggregate data. To do aggregations in HBase, you need to run a MapReduce job, though that job can then go and store it's results back into HBase for low-latency access again.
I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to
I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.
Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:
You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance
You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.
Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.
Here are the facts:
We have a lot (L O T) of data coming in everyday.
Each file we receive is in a csv format and while there are a couple of headers that reoccur more often than others, there is not really a standard.
The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema (new field appeared in on file that was not existing before..).
While the primary key is unique, anything else can be duplicated
These are customers records (i.e.: email,firstname,lastname,city,state,address...etc)
We could have multiple emails for the same individual ..
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
Speed is what we are looking for. Mysql is too slow to answer queries where tables are over 50 million records. Even well optimized we have too many speed issue. Breaking down the tables has become an organizational concern. Schema less noSQL seemed attractive. What would you recommend, what did you implement? (Please do not answer to optimize mysql .. pointless and off topic)
--
Let's go over the points:
We have a lot (L O T) of data coming in everyday.
NoSQL solutions are basically all created to scale to large numbers (Riak, MongoDB, Cassandra, etc.)
... headers that reoccur more often than others, there is not really a standard... The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema
NoSQL definitely fits this model many of them are "schema-less" so it's easy to store those extra fields. This will however cost you extra space as the field names are typically stored with the document.
While the primary key is unique, anything else can be duplicated
"Document-oriented" and "Key-Value" databases are a good fit for this as long as the key is provided. If you have to run duplicate checks, then most key-value database are ill-equipped. The "document-oriented" database might be slightly better equipped, but not by much.
We could have multiple emails for the same individual
Most of these databases have some notion of "arrays as a basic type". CouchDB and MongoDB both store objects as JSON, so it's easy to see how a customer could have an array of e-mails without the need for a "join table". MongoDB also provides "atomic update" features like "$addToSet" that plays nicely with arrays.
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
The major NoSQL DBs are all designed to scale. (both reads and writes)
The only way to availability is through hardware and locational redundancy (no different that MySQL or other databases). Despite their low version numbers, many of these Databases are being used in production environments by very big companies, so many of the simple cases are covered. It's still virgin territory, but we're also past the "randomly crashes when nothing has changed" phase.
Speed is what we are looking for... Schema less noSQL seemed attractive. What would you recommend, what did you implement?
We have 100s of M of flexible user records in MongoDB. Performance on individual seeks is really awesome.
However, you have to wary about the type of queries you're running.
If you need to run queries that bring back several Users at once, you're going to have speed issues with basically any of these Key-Value or Document-Oriented database. You may want to look at Graph database or some other fancy solution. However, if your use cases all center around one user at a time then take a look at MongoDB.
MongoDB also supports native map-reduce so you'll be able to scale "non-real time" queries.