Using CDC tables to create DB state as on a previous date - sql-server-2008

I have a complex data model consisting of 100s of tables. I have CDC enabled for all tables and have that data in corresod nponding CDC tables. I need a generic mechanism whereby given a arbitrary SELECT query I am able to return results corresponding to a point of time in the past.
I did not find any online recipes or blogs about this. I have managed to realize on my own so far that in order to convert a normal SELECT query into its CDC-aware equivalent, it is important to consider cardinality of JOINs and have some logic around choosing transactions that matter. But it seems too complex and error-prone to write an equivalent query by hand on a per-query basis. Is there a tool out there which does that? or is this a market gap?

Change Data Capture in general would allow you to replicate the changes on the original database tables to a new target table.
One approach is that you could use LiveAudit or similar auditing functionality to have a full audit trail of the changes on that Table. (Insert all changes to an Audit table for ongoing history). You would still have to have some complex queries to identify the latest version of a Row based on Primary Key up to the time in question.
(https://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/mappingliveaudit.html ) **IBM product, not MS SQL-CDC.
Temporal Tables may be a better approach, but it has its own complexities with design.
Both approaches could be limited by how much history you retain.
(https://learn.microsoft.com/en-us/azure/sql-database/sql-database-temporal-tables-retention-policy )

Related

One database or multiple databases for statistical architecture

I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.
MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.

How big MySQL table should be before breaking it down to multiple tables?

Problem: We have a very big table, and growing. Most of its entries (say 80%) are historical data (with "DATE" field past current date) that are seldom queried, while small part of it (say 20%) are current data ("DATE" field after current date), most queries search these current entries.
Consider two possible scenarios, which one would be better (considering the overall implementation difficulty and performance,...)
Breaking the big table into two table: Historical and Current data. And on daily basis I move the records with expired date from Current table to Historical table.
Keeping record in one table (the DATA field is defined as INDEXED).
The scenario A would indicate more hustle in implementation and maintenance, and overload on daily bases for moving date between tables, while scenario B would indicate searching a big database (though indexed). Does it impose memory problems? Which scenario is recommended? IS there any other recommendations?
You usually don't want to break a big table into multiple tables, although having a current and historical table is totally reasonable. Your process makes sense. You can then optimize the current table for your query needs. I would probably go for two tables (given the limited information you provide), because it allows such optimization.
However, don't split the historical data. Instead, use partitioning. See the documentation. One caveat: queries need to specify the partitioning key in the where clause to take advantage of the partitions. With a large table, this is typical anyway.
Question: is the historical data necessary for system functionality or are these records stored for other purposes (e.g. audits)? It may be time to clean house by moving the historical data to an archive.
In my experience, most systems with big data have historical tables. In most cases that I have been, both the current data and historical data have different user-groups. The current data are used by the front end users to deal with customers with their current or recent transactions. The historical data are usually used by the user groups who do not have to talk with customers/clients directly.
Do not worry much about the issue of implementation and maintenance as I think your main consideration is all about performance. Implementation is only a one-time deal that will run on a specified frequency (like weekly, monthly or yearly archival) after you moved the program/s in production. Maintenance is very small and you can just forget about it once it is already implemented. You just have to make sure that you test the programs thoroughly.
For a normalized historical tables, tables have the same structure and field names which makes the data copy much easier. This way, one can just to a table join between the tables.
If you choose to not split the data, you will continue to add index after index. But somewhere down the road, you will still encounter the same issue again.

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

is better to create tables based on content or views?

i'm learning mysql and was working on a database for work. Everything's fine so far but I had a question. I am organizing financial statements for firms(balance sheet table, income statement table, cashflow table,etc.) and most companies have quarterly statements(they are unaudited) and annual statements(which are audited). Right now for each statement I have a column that flags it for annual or quarterly.
Its not likely that someone will be running a report on an audited and unaudited statement at the same time, so I was thinking if it was worth it to create a table for audited and one for unaudited. The reason I was thinking this was eventually the data will get fairly large and I thought the smaller the tables the faster performance.
So when I design a database should I be designing based on the content(i.e. group everything thats the same regardless) or should I be grouping based on how people will access it?
Another question this raises is should I be grouping financial statements by countries..since all analysis down at our firm in 90% within the same country
This is impossible to answer definitively without knowing the whole problem.
However, usually you want a single table to represent each logical entity in your system. From the sound of it, quarterly and annual statements represent the same logical entity, but differ by a single category column/field. The same holds true for the country question--if the only difference is the country (a categorization), then they likely should all be stored in the same table.
If you were to split your data into separate tables by category, your data would be scattered across multiple tables, and would be very hard to query. For example, if you wanted a count of all statements in the system, you would have to query ALL country tables and add the results together.
Edit: Joe Celko calls this anti-pattern "Attribute Splitting".
First of all I have to point out, I'm not a professional DB designer.
But if I ware you, in this case I would create one table as the entities are the same basically.
If you fear of mysql's performace on lager datasets, maybe it would be better to start building your app on Postgres. You can boost mysql's performace with stored functions/procedures or maybe views if you have to run complicated queries and of course you can use memcache or any nosql stuff to let the SQL rest a bit.
If you are sure in that users will search mainly only for this or that type of records, you can build three tables. One for all of the records, one-one for the audited and unaudited ones. You can keep them syncronized with the InnoDB's triggers (ON UPDATE/DELETE/INSERT). They could work like views, but I think (not tested) they would be faster then views. In this case you have to manage only the first "large" table. If you insert an audited record, the trigger fires and puts a record in to the audited table an so on...
Best wishes!
I agree with Phil and Damien - one table is better. What you want is one table per type of real business thing. If you design your tables to resemble real things, even abstract or conceptual things, then your data design is more likely to stand the test of time. Once you've sketched out a schema based on the real things you have data about, then you can go back and apply the rules of normalization to formalize your design.
As a rule, it is a bad idea to design for a performance problem you are worried about, but haven't actually seen. Your intuition about big tables being slower might actually be wrong. Most DBMS systems like bigger tables, at least to a point. When tables are big the query optimizers choose to use indexes. When tables are small they often end up getting full table scans, which can really slow down concurrent access. If your tables get so big that they are beyond the capabilities of your DBMS then it is time to consider either archiving out old data that you aren't using anymore or buying a more scalable DBMS.

Strategy for dealing with large db tables

I'm looking at building a Rails application which will have some pretty
large tables with upwards of 500 million rows. To keep things snappy
I'm currently looking into how a large table can be split to more
manageable chunks. I see that as of MySQL 5.1 there is a partitioning
option and that's a possible option but I don't like the way the column
that determines the partitioning has to be part of the primary key on
the table.
What I'd really like to do is split the table that a AR model writes to
based upon the values written but as far as I am aware there is no way
to do this - does anyone have any suggestions as to how I might
implement this or any alternative strategies?
Thanks
Arfon
Partition columns in MySQL are not limited to primary keys. In fact, a partition column does not have to be a key at all (though one will be created for it transparently). You can partition by RANGE, HASH, KEY and LIST (which is similar to RANGE only that it is a set of discrete values). Read the MySQL manual for an overview of partioning types.
There are alternative solutions such as HScale - a middleware plug-in that transparently partitions tables based on certain criteria. HiveDB is an open-source framework for horizontal partioning for MySQL.
In addition to sharding and partioning you should employ some sort of clustering. The simplest setup is a replication based setup that helps you spread the load over several physical servers. You should also consider more advanced clustering solutions such as MySQL cluster (probably not an option due to the size of your database) and clustering middleware such as Sequioa.
I actually asked a relevant question regarding scaling with MySQL here on stack-overflow some time ago, which I ended up answering myself several days later after collecting a lot of information on the subject. Might be relevant for you as well.
If you want to split your datas by time, the following solution may fit to your need. You can probably use MERGE tables;
Let's assume your table is called MyTable and that you need one table per week
Your app always logs in the same table
A weekly job atomically renames your table and recreates an empty one: MyTable is renamed to MyTable-Year-WeekNumber, and a fresh empty MyTable is created
Merge tables are dropped and recreated.
If you want to get all the datas of the past three months, you create a merge table which will include only the tables from the last 3 months. Create as many merge tables as you need different periods. If you can not include the table in which datas are currently inserted (MyTable in our example), you'll be even more happy, as you won't have any read / write concurrency
You can handle this entirely in Active Record using DataFabric.
It's not that complicated to implement similar behavior yourself if that's not suitable. Google sharding for a lot of discussion on the architectural pattern of handling table partitioning within the app tier. It has the advantages of avoiding middleware or depending on db vender specific features. On the other hand it is more code in your app that you're responsible for.