handling/compressing large datasets in multiple tables

handling/compressing large datasets in multiple tables - mysql

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)

Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Related

One database or multiple databases for statistical architecture

I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.

MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.

JPA - Hibernate : Select query on the continuously growing table

I have a Mysql table which holds about 10 million records currently. Records are inserted by another batch application on continues basis and keep on growing.
On the front end user can search the data on this table based on different criterion. I am using query DSL and JPA repository to create dynamic queries and getting data from the table. But the performance of the query with pagination is very slow. I have tried indexing, InnoDB related tweaks,session management by HikariCP and ehcahe solutions but still it is taking about 100 seconds to get the data.
Also entities are simple POJO with no relation with other entities.
What is the best way/technology/framework to implement this scenario?

In a table of this size, dynamic query is a really, REALLY bad idea, you need to really control the access to the table and avoid table scans at all cost.
Ultimately, this sounds like a data warehouse solution, whereas the data is ETL'ed into a report-like format and not raw transactional data. Even so, you'll still need to define the access patterns you need and design your DWH to support that.
If you decide that the raw data is still the best format, another approach would be to define support metadata tables that could be queried to more quickly reduce the number of rows returned.
Could also look at clustering the data if you can find some way to logically break out data into chunks. However, when you say dynamic queries, this might not be possible.

My suggestion would be to create a dedicated cache and the web app should query the cache instead of the DB. If the ETL batch to your main table is at a defined period, you can keep the cache hot by triggering a loading from the main table to the cache. This can any in memory cache like Ignite or Infinispan.
However, this is not a sustainable solution and eventually you would need to restrict your users into seeing data within a manageable date range only and will have to either discard or send the old data asynchronous via flat file generated reports.
Not the entire history of the huge dataset can be made available in the UI to the user.
You could also try data virtualization tools to figure out what users are more comfortable with before deciding on the partition strategy in production.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!

Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.

You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.

You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

handling large dataset using MySQL

I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.

Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:

You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance

You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.

Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!

If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.

Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.

You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M

You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008