Best Way to logging sensor data with MySQL - mysql

I am new to SQL, please advise.
I wish to logging incoming data from sensor every 5 seconds for future graph plotting
What is the best way to design database in MySQL?
Could i log with timestamp and use AVG functions when i like to display graph by hour, day, week, month ?
Or Could I log and make average every minute, hour, day to reduce database size
Is it possible to use trigger function to make average when collect data over 1 minute ?

The answer is that it depends on how much data you are actually going to be logging, how often you are going to be querying it, and how fast your response time needs to be. If it's just one sensor, every 5 seconds, you could probably go on for eternity without running into too many problems with regular sql queries to pull out averages, sums, etc. in a reasonable period of time.
I will say that from experience, you can do a lot with SQL and time series data, but you have to be very careful how you design your queries. I've worked with time series tables with billions of rows and tens of thousands of individual sensors among those rows; it's possible to achieve very fast execution over that many time series rows, but you might spend a week trying to fine-tune the database. It's definitely a trade-off between flexibility and speed.
Again, for your purposes, it probably is not going to make very much difference if you are just talking about one sensor; just write a regular SQL query. However, if you anticipate adding several hundred more sensors or increasing the sample rate, you may want to consider doing periodic "rollup" functions as you suggest. And in that case, I would be more inclined to write a custom solution using a NoSQL database (e.g. Cassandra, Couchbase, etc.) and using a program that runs periodically to do the rollup. If you are interested, I can provide details, but I really don't think you will need to go that far.
This post has a pretty good discussion on storing time series data in SQL vs NOSQL: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql

You should read about RRDtool.
From RRDtool website:
RRDtool is the OpenSource industry standard, high performance data
logging and graphing system for time series data.
http://oss.oetiker.ch/rrdtool/
If you don't want to use it (it may be too complicated, too big for your application etc.) - take a look how is this made, how information is stored etc.

Related

Recurring data demand - automated query, or store data directly in SQL?

This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks
It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.

Using Hive for real time queries

First of all I wanted to clarify that I am learning about Hive and Hadoop (and big data in general), so excuse the lack of proper vocabulary.
I am embarking myself in a huge (at least for me) project which requires dealing with enormous quantities of data which I am not use to deal in the past as I always worked mostly with MySQL.
For this project a series of sensors will produce approximately 125.000.000 data points 5 times an hour (15.000.000.000 a day) which is several times more that everything I have ever inserted into every MySQL table combined.
I understand that one approach would be using Hadoop MapReduce and Hive to query and analyze the data.
The problem I am facing is that for what I could learn I understand Hive runs mostly like "cron jobs" and not with real time queries which may take many hours and require a different infrastructure.
I thought of creating MySQL tables based on the results of Hive queries as at most the data which will be needed to be queried in real time would be approximately 1.000.000.000 rows but I was wondering if this is the right way to go or I should look into some other technology.
Is there any technology I should study which is specifically created for real time queries on big data?
Any tip will be much appreciated!
This is a complicated question. Let's start by addressing the technologies that you mention in your question, and go from there:
MySQL: It should be obvious to anyone who has used MySQL (or any other relational DB) that a traditional out-of-the-box installation of MySQL will never support the volumes that you are talking about. the back of the envelope calculations are enough to tell us that- assuming that your sensor inserts are only 100 bytes, you are talking about 15 billion x 100 bytes = 1.5 trillion bytes or 1.396 terabytes per day. That's truly big data, especially if you are planning on storing it for more than a day or two.
Hive: Hive can certainly handle that kind of data volume (I and many others have done it), but as you point out, you don't get real-time queries. Every query will be in batch, and if you need fast queries you'll need to pre-aggregate data.
Now that brings us to the real question- what kind of queries do you need to run? If you need to run arbitrary, real-time queries and can never predict what those queries might be, then you probably need to look towards comparatively expensive, proprietary data stores like Vertica, Greenplum, Microsoft PDW, etc. These will cost a lot of money, but they and others can handle the load you are talking about.
If on the other hand you can predict with some degree of accuracy the type of queries that will be run, then something like Hive might make sense. Store the raw data there, and use the batch query capabilities to do the heavy lifting and periodically create aggregated data tables in MySQL or another relational database to support your needs for low-latency queries.
One more alternative is something like HBase. HBase gives you low-latency access to distributed data, but you lose two critical items that you are probably accustomed to- a query language (HBase doesn't have SQL) and the ability to aggregate data. To do aggregations in HBase, you need to run a MapReduce job, though that job can then go and store it's results back into HBase for low-latency access again.

MySQL performance: views vs. functions vs. stored procedures

I have a table that contains some statistic data which is collected per hour.
Now I want to be able to quickly get statitics per day / week / month / year / total.
What is the best way to do so performance wise? Creating views? Functions? Stored procedures? Or normal tables where i have to write to simultaneously when updating data? (I would like to avoid the latter).
My current idea would be to create a view_day which sums up the hours, then a view_week and view_month and view_year which sum up data from view_day, and view_total which sums up view_year. Is it good or bad?
You essentially have two systems here: One that collects data and one that reports on that data.
Running reports against your frequently-updated, transactional tables will likely result in read-locks that block writes from completing as quickly as they can and therefore possibly degrade performance.
It is generally HIGHLY advisable to run periodic "gathering" task that gathers information from your (probably highly normalized) transactional tables and stuff that data into denormalized reporting tables forming a "data wharehouse". You then point your reporting engine / tools at the denormalized "data wharehouse" which can be queried against without impacting the live transactional database.
This gathering task should only run as often as your reports need to be "accurate". If you can get away with once a day, great. If you need to do this once an hour or more, then go ahead, but monitor the performance impact on your writing tasks when you do.
Remember, if the performance of your transactional system is important (and it generally is), avoid running reports against it at all costs.
Yes, having the tables that store already aggregated data is a good practice.
Whereas views, as well as SPs and functions will just perform queries over big tables, which is not that efficient.
The only real fast and scalable solution is as you put it "normal tables where you have to write to simultaneously when updating data" with proper indexes. You can automate updating of such table using triggers.
My view is that complex calculations should only happen once as the data changes not every time you query. Create an aggregate data and populate it either through a trigger (if no log is acceptable) or through a job that runs once a day or once an hour or whatever lag time is acceptable for reporting. If you go the trigger route, test, test, test. Make sure it can handle multiple row inserts/updates/deletes as well as the more common single ones. Make sure it is as fast as possible and has no bugs whatsoever. Triggers will add a bit of processing to every data action, you have to make sure it adds the smallest possible bit and that no bugs will ever happen that will pervent users from inserting/updating/deleting data.
We have a similar problem, and what we do is utilize a master/slave relationship. We do transactional data (both reads and writes, since in our case some reads need to be ultra fast and can't wait for replication for the transaction), on the master. The slave is quickly replicating data, but then we run every non-transactional query off that, including reporting.
I highly suggest this method as it's simple to put into place as a quick and dirty data warehouse if your data is granular enough to be useful in the reporting layers/apps.

Database architecture for millions of new rows per day

I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)

Storing affiliate leads and conversions

I've created an affilaite system that tracks leads and conversions. The leads and conversions records will go into the millions so I need a good way to store them. Users will need to track the stats hourly, daily, weekly and monthly.
Whats the best way to store the leads and conversions?
For this type of system, you need to keep all of the detail records. Reason being at some point someone is going to contest an invoice.
However, you should have some roll up tables. Each hour, compute current hours work and store the results. Do the same for daily, weekly, and monthly.
If some skewing is okay, you can compute the daily amounts off of the 24 hourly computed records. Weekly, off of the last 7 daily records. For monthly you might want to compute back off of the hourly records, because each month doesn't quite add up to 4 full weeks.. Also, it helps reduce noise from any averaging you might be doing.
I'd recommend a two step archival process. The first one should run once a day and move the records into a separate "hot" database. Try to keep 3 months hot for any type of research queries you need to do.
The second archive process is up to you. You could simply move any records older than 3 months into some type of csv file and simply back it up. After some period of time (a year?) delete them depending on your data retention agreements.
Depending on the load, you may need to have multiple web servers handling the lead and conversion pixels firing. One option is to store the raw data records on each web/mysql server, and then run an archival process every 5-10 minutes that stores them in a highly normalized table structure, and which performs any required roll-ups to achieve the performance you are looking for.
Make sure you keep row size as small as possible, store IP's as unsigned ints, store referees as INTs that reference lookup tables, etc.