Hadoop for MySQL use cases - mysql

I have a database with ~4 million records of US stocks, mutual funds and ETFs prices for 5 years and every day I am adding daily price for each security.
For one feature that I am working on I need to fetch latest price for each security (groupwise max) and do some calculation with other financial metrics.
The securities count is ~40K.
But the groupwise maximum with this amount of data is heavy and takes minutes to execute.
Of course my tables use indexes, but the task involves getting and real time processing nearly 7GB data.
So I am interested, is this task for Big Data tools and algorithms or it is small amount of data? because in examples I noticed that they are working on data of thousands and millions GBs.
My database is MySQL and I want to use Hadoop to process data.
Is it good practice or I need to use only MySQL optimizations (is my data small?) or if it is wrong to use Hadoop in that amount of data, what can you advice for this case?
NOTE that my increasing every day and project involves many analyzes, that need to be done on real time, based on user request.
NOTE Don't know whether this question is OK to ask in stackoverflow, so please sorry if question is off-topic.
Thanks in advance!

In Hadoop terms, your data is definitely small. Latest computers have 16+ GB of RAM, therefore your dataset can entirely fit into memory of a single machine.
However, that doesn't mean you can at least attempt to load data into HDFS and perform some operation over it. Sqoop & Hive would be the tools you would use to load and have SQL processing.
Since I brought up the point about memory, though, it is entirely feasible you don't need Hadoop (HDFS & YARN), and can instead use Apache Spark w/ SparkSQL to hit MySQL directly from a distributed JDBC connection.

For MySQL, you can take advantage of indexes, and achieve the goal with Order(M), where M is the number of securities (40K) instead of O(N) where N is the number of rows in the table.
Here is an example that needs adapting.

Related

Big quantity of data with MySql

i have a question about Mass storage. Actually, i'm working with 5 sensors which sends a lot of datas with a different frequency for each one and i'm using MySQL DATABASE.
so here is my questions:
1) is MySQL the perfect solution.
2) if not, is there a solution to store this big quantity of data in a data base?
3) I'm using Threads in this and i'm using mutexs also, i'm afraid if this can cause problems, Actually,it seems to be.
i hope i will have an answer to this question.
MySql is good solution for OLTP scenarios where you are storing transactions to serve web or mobile apps. But it does not scale well (despite of cluster abilities).
There are many options out there based on what is important to you:
File System: You can device your own write-ahead-log solution to solve multi-threading problems and achieve "eventual consistency". That way you don't have to lock data for one thread at a time. You can use schema-full files like CSV, Avro or Parquet. Also you can use S3 or WSB for cloud based block storage. Or HDFS for just block and replicated storage.
NoSql: You can store each entry as document in NoSql Document stores. If you want to keep data in memory for faster read, explore Memcached or Redis. If you want to perform searches on data, use Solr or ElasticSearch. MongoDB is popular but it has scalability issues similar to MySql, instead I would chose Cassandra or HBase if you need more scalability. With some of NoSql stores, you might have to parse your "documents" at read time which may impact analytics performance.
RDBMS: As MySql is not scalable enough, you might explore Teradata and Oracle. Latest version of Oracle offers petabyte query capabilities and in-memory caching.
Using a database can add extra computation overhead if you have a "lot of data". Another question is what you do with the data? If you only stack them, a map/vector can be enough.
The first step is maybe to use map/vector that you can serialize to a file when needed. Second you can add the database if you wish.
About mutex if you share some code with different thread and if (in this code) you work on the same data at the same time, then you need them. Otherwise remove them. BTW if you can separate read and write operations then you don't need mutex/semaphore mechanism.
You can store data anywhere, but the data storage structure selection would depends on the use cases (the things, you want to do with the data).
It could be HDFS files, RDBMS, NoSQL DB, etc.
For example your common could be:
1. to save the sensor data very quickly.
2. get the sensor data on the definite date.
Then, you can use MongoDB or Cassandra.
If you want to get deep analytics (to get monthly average sensor data), you definitely should think about another solutions.
As for MySQL, it could also be used for some reasonable big data storage,
as it supports sharding. It fits some scenarios well, some not.
But I repeat, all would depend on use cases, i.e. the things you want to do with data.
So you could provide question with more details (define desired use-cases), or ask again.
There are several Questions that discuss "lots of data" and [mysql]. They generally say "yes, but it depends on what you will do with it".
Some general statements (YMMV):
a million rows -- no problem.
a billion rows or a terabyte of data -- You will run into problems, but they are no insurmountable.
100 inserts per second on spinning disk -- probably no problem
1000 rows/second inserted can be done; troubles are surmountable
creating "reports" from huge tables is problematical until you employ Summary Tables.
Two threads storing into the same table at the "same" time? Every RDBMS (MySQL included) solves that problem before the first release. The Mutexes (or whatever) are built into the code; you don't have to worry.
"Real time" -- If you are inserting 100 sensor values per second and comparing each value to one other value: No problem. Comparing to a million other values: big problem with any system.
"5 sensors" -- Read each hour? Yawn. Each minute? Yawn. Each second? Probably still Yawn. We need more concrete numbers to help you!

Best Way to logging sensor data with MySQL

I am new to SQL, please advise.
I wish to logging incoming data from sensor every 5 seconds for future graph plotting
What is the best way to design database in MySQL?
Could i log with timestamp and use AVG functions when i like to display graph by hour, day, week, month ?
Or Could I log and make average every minute, hour, day to reduce database size
Is it possible to use trigger function to make average when collect data over 1 minute ?
The answer is that it depends on how much data you are actually going to be logging, how often you are going to be querying it, and how fast your response time needs to be. If it's just one sensor, every 5 seconds, you could probably go on for eternity without running into too many problems with regular sql queries to pull out averages, sums, etc. in a reasonable period of time.
I will say that from experience, you can do a lot with SQL and time series data, but you have to be very careful how you design your queries. I've worked with time series tables with billions of rows and tens of thousands of individual sensors among those rows; it's possible to achieve very fast execution over that many time series rows, but you might spend a week trying to fine-tune the database. It's definitely a trade-off between flexibility and speed.
Again, for your purposes, it probably is not going to make very much difference if you are just talking about one sensor; just write a regular SQL query. However, if you anticipate adding several hundred more sensors or increasing the sample rate, you may want to consider doing periodic "rollup" functions as you suggest. And in that case, I would be more inclined to write a custom solution using a NoSQL database (e.g. Cassandra, Couchbase, etc.) and using a program that runs periodically to do the rollup. If you are interested, I can provide details, but I really don't think you will need to go that far.
This post has a pretty good discussion on storing time series data in SQL vs NOSQL: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
You should read about RRDtool.
From RRDtool website:
RRDtool is the OpenSource industry standard, high performance data
logging and graphing system for time series data.
http://oss.oetiker.ch/rrdtool/
If you don't want to use it (it may be too complicated, too big for your application etc.) - take a look how is this made, how information is stored etc.

Using Hive for real time queries

First of all I wanted to clarify that I am learning about Hive and Hadoop (and big data in general), so excuse the lack of proper vocabulary.
I am embarking myself in a huge (at least for me) project which requires dealing with enormous quantities of data which I am not use to deal in the past as I always worked mostly with MySQL.
For this project a series of sensors will produce approximately 125.000.000 data points 5 times an hour (15.000.000.000 a day) which is several times more that everything I have ever inserted into every MySQL table combined.
I understand that one approach would be using Hadoop MapReduce and Hive to query and analyze the data.
The problem I am facing is that for what I could learn I understand Hive runs mostly like "cron jobs" and not with real time queries which may take many hours and require a different infrastructure.
I thought of creating MySQL tables based on the results of Hive queries as at most the data which will be needed to be queried in real time would be approximately 1.000.000.000 rows but I was wondering if this is the right way to go or I should look into some other technology.
Is there any technology I should study which is specifically created for real time queries on big data?
Any tip will be much appreciated!
This is a complicated question. Let's start by addressing the technologies that you mention in your question, and go from there:
MySQL: It should be obvious to anyone who has used MySQL (or any other relational DB) that a traditional out-of-the-box installation of MySQL will never support the volumes that you are talking about. the back of the envelope calculations are enough to tell us that- assuming that your sensor inserts are only 100 bytes, you are talking about 15 billion x 100 bytes = 1.5 trillion bytes or 1.396 terabytes per day. That's truly big data, especially if you are planning on storing it for more than a day or two.
Hive: Hive can certainly handle that kind of data volume (I and many others have done it), but as you point out, you don't get real-time queries. Every query will be in batch, and if you need fast queries you'll need to pre-aggregate data.
Now that brings us to the real question- what kind of queries do you need to run? If you need to run arbitrary, real-time queries and can never predict what those queries might be, then you probably need to look towards comparatively expensive, proprietary data stores like Vertica, Greenplum, Microsoft PDW, etc. These will cost a lot of money, but they and others can handle the load you are talking about.
If on the other hand you can predict with some degree of accuracy the type of queries that will be run, then something like Hive might make sense. Store the raw data there, and use the batch query capabilities to do the heavy lifting and periodically create aggregated data tables in MySQL or another relational database to support your needs for low-latency queries.
One more alternative is something like HBase. HBase gives you low-latency access to distributed data, but you lose two critical items that you are probably accustomed to- a query language (HBase doesn't have SQL) and the ability to aggregate data. To do aggregations in HBase, you need to run a MapReduce job, though that job can then go and store it's results back into HBase for low-latency access again.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

Can MySQL Cluster handle a terabyte database

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries are likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.
I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realistically I think only the timestamps could be held in memory and all other data would need to be stored on disk.
Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage affect performance?
I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?
The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.
Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.
I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.
Okay, I did read the part about mySQL being a hard requirement.
So with that said, let me first point out that the workload you're talking about -- 2500 inserts/sec, rare queries, queries likely to have result sets of up to 10 percent of the whole data set -- is just about pessimal for any relational data base system.
(This rather reminds me of a project, long ago, where I had a hard requirement to load 100 megabytes of program data over a 9600 baud RS-422 line (also a hard requirement) in less than 300 seconds (also a hard requirement.) The fact that 1kbyte/sec × 300 seconds = 300kbytes didn't seem to communicate.)
Then there's the part about "contain up to 30 floats." The phrasing at least suggests that the number of samples per insert is variable, which suggests in turn some normaliztion issues -- or else needing to make each row 30 entries wide and use NULLs.
But with all that said, okay, you're talking about 300Kbytes/sec and 2500 TPS (assuming this really is a sequence of unrelated samples). This set of benchmarks, at least, suggests it's not out of the realm of possibility.
This article is really helpful in identifying what can slow down a large MySQL database.
Possibly try out hibernate shards and run MySQL on 10 nodes with 1/2 terabyte each so you can handle 5 terabytes then ;) well over your limit I think?