Is a mySQL relational database scalable for holding 80 GBs of additional logs per day? - mysql

I am currently deciding on a long term architecture solution for storing DNS logs. The amount of data we are talking about numbers some 80 GBs of logs per day at the peak. Currently I am looking at noSQL databases such as mongoDB, as well as relational - mySQL. I want to structure a solution that has three requirements:
Storage: This is a long term project, so I want the necessary capability to store 80 GBs of logs per day (~30 TB a year!). I realize this is pretty ridiculous, so I'm willing to have a retention period (keep 6 months' worth of logs = 15 TB constant).
Scalability: As it is a long term solution, this is a big issue. I've heard that mongoDB is horizontally scalable, while mySQL is not? Any elaboration on this would be very well received.
Query speed: As close to instantaneous querying as possible.
It should be noted that our logs are stored on an intermediary server, so we do not need to forward logs from our dns servers.

Related

Sharding vs Partitioning database

Sorry for the dumb question, I am not developer. I want solution for my database. I have zero user information in my database at the moment. I am thinking what if I have 1 million user information in future.
Please explain in simple words. I have a database in dedicated server.
And I want copy the database to 10 databases in 10 dedicated servers.
if user fills his information, like name, date or birth, address etc, The first 100 user information should go to first database and server. After 100k user information should go second database and server. And son on
I also want to search user information. ex. Sam Smith with date of birth in 10 database at same time
How that can be achieved?
Which one the solution. MySQL auto sharding,
MySQL partition.
Mongodb Sharding
Clustering?
Thanks in advance.
1M rows in a table -- no problem. Even 1 billion rows may not need any of those fancy actions.
Replication -- needed if you have 1000 reads per second. Or you want a separate backup machine.
Partitioning -- won't help the use case you described.
Sharding -- only if you need to 1000 writes per second.
1M WordPress "users", each owning Database with something 26 tables and thousands of "posts"? Then you have a problem. (But mostly it is because of WP's choice of EAV schema design.)
Bottom line -- Re-ask your Question after you have 1 thousand "user information". And provide SHOW CREATE TABLE. With 1K, the dataset should be big enough to project what issue growth may present, but not so big that it will necessitate a long downtime.
For mongoDB here is some info on both options since they provide different functionality:
sharding -> horisontal read/write performance scalability ( if your single server cannot provide enougth performance and you cannot upgrade more when data grows in time then you need to shard )
replication -> high availability and redundancy ( if you want to reduce maintenance downtime and increase redundancy then you need to use replication to add more replicaSet members that will hold same data on different storages )

Hadoop for MySQL use cases

I have a database with ~4 million records of US stocks, mutual funds and ETFs prices for 5 years and every day I am adding daily price for each security.
For one feature that I am working on I need to fetch latest price for each security (groupwise max) and do some calculation with other financial metrics.
The securities count is ~40K.
But the groupwise maximum with this amount of data is heavy and takes minutes to execute.
Of course my tables use indexes, but the task involves getting and real time processing nearly 7GB data.
So I am interested, is this task for Big Data tools and algorithms or it is small amount of data? because in examples I noticed that they are working on data of thousands and millions GBs.
My database is MySQL and I want to use Hadoop to process data.
Is it good practice or I need to use only MySQL optimizations (is my data small?) or if it is wrong to use Hadoop in that amount of data, what can you advice for this case?
NOTE that my increasing every day and project involves many analyzes, that need to be done on real time, based on user request.
NOTE Don't know whether this question is OK to ask in stackoverflow, so please sorry if question is off-topic.
Thanks in advance!
In Hadoop terms, your data is definitely small. Latest computers have 16+ GB of RAM, therefore your dataset can entirely fit into memory of a single machine.
However, that doesn't mean you can at least attempt to load data into HDFS and perform some operation over it. Sqoop & Hive would be the tools you would use to load and have SQL processing.
Since I brought up the point about memory, though, it is entirely feasible you don't need Hadoop (HDFS & YARN), and can instead use Apache Spark w/ SparkSQL to hit MySQL directly from a distributed JDBC connection.
For MySQL, you can take advantage of indexes, and achieve the goal with Order(M), where M is the number of securities (40K) instead of O(N) where N is the number of rows in the table.
Here is an example that needs adapting.

Can I have several 'similar' database tables to reduce retrieval time

It is best to explain my question in terms of a concrete example.
Consider an order management application that restaurants use to receive orders from their customers. I have a table called orders which stores all of them.
Now every day the tables keep growing in size but the amount of data accessed is constant. Generally the restaurants are only interested in orders received in the last day or so. After 100 days, for example, 'interesting' data is only about 1/100 of the table size; after 1 year it's 1/365 and so on.
Of course, I want to keep all the old orders, but performance for applications that are only interested in current orders keeps reducing. So what is the best way to not have old data interfere with the data that is 'interesting'?
From my limited database knowledge, one solution that occurred to me was to have two identical tables - order_present and order_past - within the same database. New orders would come into 'order_present' and a cron job would transfer all processed orders older than two days to 'order_old', keeping the size of 'order_present' constant.
Is this considered an acceptable solution to deal with this problem. What other solutions exist?
Database servers are pretty good at handling volume. But the performance could be limited by physical hardware. If it is the IO latency that is bothering you, there are several solutions available. You really need to evaluate what fits best for your usecase.
For example:
you can Partition the table to distribute it onto multiple physical disks.
you can do Sharding to put data on to different physical servers
you can evaluate using another Storage Engine which best fits your data and application. MyISAM delivers better read performance compared to InnoDB at the cost of being less ACID compliant
you can use Read Replicas to deligate all (most) "select" queries to replicas (slaves) of the main database servers (master)
Finally, MySQL Performance Blog is a great resource on this topic.

Is mySQL suitable for storing a data stream coming in at 250ms update, having 30 bytes of information per update, and serving to the web?

The title says most of it, I'm wondering if MySQL is suitable, (and if not, what else would do better) for storing this data?
It would be most likely 3 to 6 floating point numbers delivered every quarter second. (Somewhere between 1-3 GB per year)
At the same time this data needs to be real-time (or close-to) accessible from the MySQL, so it can't miss a beat while running some (potentially large) queries on the database.
Can MySQL handle this? I have no concept of how this kind of database scales or what constitutes "a lot" of information at one time for these databases.
If the database were to receive the same information but clumped into one large update every 5-10 minutes would that change the answer?
It depends on the table and the hardware. A Decent mysql server would have no problem with this. Especially if equipped with a SSD drive.
One large query inserting a large amount of data would be more efficient, as it would be a single index update and so on. There will be a sweet spot somewhere in there between the .25 second update and a 5 minute bulk. Testing would show that.
If you are doing replication, keep in mind that the slave is single threaded, and these updates would ultimate take longer there. You would also want to go with innodb as opposed to myisam to avoid table level locks during writes.

Can MySQL Cluster handle a terabyte database

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries are likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.
I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realistically I think only the timestamps could be held in memory and all other data would need to be stored on disk.
Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage affect performance?
I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?
The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.
Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.
I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.
Okay, I did read the part about mySQL being a hard requirement.
So with that said, let me first point out that the workload you're talking about -- 2500 inserts/sec, rare queries, queries likely to have result sets of up to 10 percent of the whole data set -- is just about pessimal for any relational data base system.
(This rather reminds me of a project, long ago, where I had a hard requirement to load 100 megabytes of program data over a 9600 baud RS-422 line (also a hard requirement) in less than 300 seconds (also a hard requirement.) The fact that 1kbyte/sec × 300 seconds = 300kbytes didn't seem to communicate.)
Then there's the part about "contain up to 30 floats." The phrasing at least suggests that the number of samples per insert is variable, which suggests in turn some normaliztion issues -- or else needing to make each row 30 entries wide and use NULLs.
But with all that said, okay, you're talking about 300Kbytes/sec and 2500 TPS (assuming this really is a sequence of unrelated samples). This set of benchmarks, at least, suggests it's not out of the realm of possibility.
This article is really helpful in identifying what can slow down a large MySQL database.
Possibly try out hibernate shards and run MySQL on 10 nodes with 1/2 terabyte each so you can handle 5 terabytes then ;) well over your limit I think?