I have a log table that contains a large number of user transactions (logs). I am trying to create a webpage that displays statistics (count, average, and some complex calculations...) of the user transactions, but want to fetch from a Statistics table instead of querying the original transaction table because of the performance concern. One possible way might be updating the Statistics table whenever a row is inserted. And, another way can be updating the Statistics table periodically.
Both options sound inefficient, so I am wondering there is any particular method to achieve it in common database systems?
If you don't need statistics in real time (if near real time is ok for you, usually it is for most people), one thing that reports that need some complex calculations usually do is to geneate these reports in a periodic manner (let's say every X minutes, depends on how big is your data of course).
This way your users can access static data, which is pretty much easy to serve, and you won't push too much load into your analytics server.
Related
I am looking for a solution to sync data from multiply small instances to one big cloud instance.
I have many devices gathering data logs, every device has there own database, so I need a solution to sync data from them to one instance. The delay is not important but I want to sync the data with a max delay of 5-10 min.
Is there any ready solution for it?
Assuming all the data is independent, INSERT all the data into a single table. That table would, of course, have a device_id column to distinguish where the numbers are coming from.
What is the total number of rows per second you need to handle? If less than 1000/second, there should be no problem inserting the rows into the same table as the arrive.
Are you using HTTP? Or something else to do the INSERTs? PHP? Java?
With this, you will rarely see more than a 1 second delay between the reading being taken and the table having the value.
I recommend
PRIMARY KEY(device_id, datetime)
And the use of Summary tables rather than slogging through that big Fact table to do graphs and reports.
Provide more details if you would like further advice.
I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.
MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.
I have a Mysql table which holds about 10 million records currently. Records are inserted by another batch application on continues basis and keep on growing.
On the front end user can search the data on this table based on different criterion. I am using query DSL and JPA repository to create dynamic queries and getting data from the table. But the performance of the query with pagination is very slow. I have tried indexing, InnoDB related tweaks,session management by HikariCP and ehcahe solutions but still it is taking about 100 seconds to get the data.
Also entities are simple POJO with no relation with other entities.
What is the best way/technology/framework to implement this scenario?
In a table of this size, dynamic query is a really, REALLY bad idea, you need to really control the access to the table and avoid table scans at all cost.
Ultimately, this sounds like a data warehouse solution, whereas the data is ETL'ed into a report-like format and not raw transactional data. Even so, you'll still need to define the access patterns you need and design your DWH to support that.
If you decide that the raw data is still the best format, another approach would be to define support metadata tables that could be queried to more quickly reduce the number of rows returned.
Could also look at clustering the data if you can find some way to logically break out data into chunks. However, when you say dynamic queries, this might not be possible.
My suggestion would be to create a dedicated cache and the web app should query the cache instead of the DB. If the ETL batch to your main table is at a defined period, you can keep the cache hot by triggering a loading from the main table to the cache. This can any in memory cache like Ignite or Infinispan.
However, this is not a sustainable solution and eventually you would need to restrict your users into seeing data within a manageable date range only and will have to either discard or send the old data asynchronous via flat file generated reports.
Not the entire history of the huge dataset can be made available in the UI to the user.
You could also try data virtualization tools to figure out what users are more comfortable with before deciding on the partition strategy in production.
In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.
I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)