OLTP to OLAP with CQRS or SSIS - sql-server-2008

Background Information
Our application reads/writes from 3 components:
ASP.NET MVC 3 customer front end website (write actions)
Winform verification tool at stores (write actions)
Silverlight Dashboard for tenant (95% aggregate reads 5% write actions)
(3) is the only piece that can use some performance improvements.
Our storage is Sql Server Standard OLTP database that has stored procedures that aggregate data consumed by the silverlight app.
When using database tuning advisor or execution plan we don't see any critical indexes missing and we rebuild indexes with sql agent job.
Most of the widgets are sparklines
x = time selected by interval (day, week, month, year)
y = aggregate (sum,avg,ect)
currently we return about 14 - 20 points per widget. Our dashboard opens with 10 widgets initially.
Our dimensions would be: tenant, store, (day,week,month,year)
Our facts: completed, incomplete, redeemed, score ...
I know a denormalized table will remove needing sql server from recalculating for
store managers, franchise owners, corporate viewing the data ~50 (simultaneous users)
each time
I'll be honest if we go with OLAP it will be my first hands on experience with it.
Questions
What is the long term solution for a rich reporting dashboard?
I would assume OLAP. If so, how would you keep it up to date to be near realtime dashboard that we have today?
Putting a maintenance page while OLAP rebuilds itself is not an option.
Ideally, we would want to do this incrementally and see Nservicebus (which we use today already) as a great bridge to update these
denormalized views. Do we put these denormalized views in oltp as just another table or is there a way to incrementally update OLAP datasource?
References
http://www.udidahan.com/2009/12/09/clarified-cqrs/
http://www.udidahan.com/2011/10/02/why-you-should-be-using-cqrs-almost-everywhere%E2%80%A6/

“Putting a maintenance page while OLAP rebuilds itself is not an option.“
Why would you say that? The OLAP cube is available while it’s rebuilding.
There are several ways you can configure how the refresh works, ROLAP, HOLAP and MOLAP. You can have automatically refreshes at X hours or even make the data available in real-time. Try reading about proactive caching on SSAS, it may give you some ideas.

Related

Kingswaysoft SSIS create Contacts Slow

I’m currently doing some testing for an upcoming data migration project and came across Kingswaysoft which seemed like it would be ideal for this purpose.
However I’m currently testing importing 225,000 contact records into a new sandbox Dynamics 365 instance and it is on course to take somewhere between 10 and 13 hours.
Is this typical of the speeds I should expect or am I doing something silly?
I am setting only some out of the box fields such as first name, last name, dob and address data.
I have a staging contact SQL database holding the 225k records to be uploaded.
I have the CRM Destination Component setup to use multi threading batch size of 250 with up to 16 threads.
Have tested using both Create and Upsert and both very slow.
Am I doing something wrong - I would have expected it to be much quicker.
When it comes to the data load to Dynamics 365 Online, the most important aspect that affects your data load performance is the network latency. You should try to put the data migration solution as close as possible to the Dynamics 365 online server. If you have the configuration right, you should be able to achieve something like 1m to 2m records per hour. The speed that you are getting is too slow. There must be something. There are many other things that could affect the data load performance, but start from network latency first. We have some other tips shared at https://www.kingswaysoft.com/products/ssis-integration-toolkit-for-microsoft-dynamics-365/help-manual/crm/advanced-topics#MaximizedPerformance, which you should check out.

Query on database performance and software reliability

Suituation:
Client is running a web based finance application, where the primary functionaities includes huge volume of financial transactions both in and out.
The processes are automated.
We run several cron job tasks at midnight to split the payments for appropriate customers.
Monthly on average we have 2000 to 3000 new customers with total of 30,000 customers currently.
Our transactional tables has almost 900000 records so far and expect drastic increase in comming months.
Technologies: Initially we used LAMP environment, With Codeignitor framework, Laravel elequont ORM for querying and Mysql.
Hosting: Hosted in AWS, T2 small instance, no load balancer implemented.
**This application was developed three years back.
Problem:
Currenty our client faces downtime during peak hours and also their customers faces load time issues while reviewing their transaction archives and stats.
And also they fear in case if the cron job tasks fails, they could not able to handle the suituation. (vast calculations are made and amounts were inserted accross huge volume of customers).
Our plan:
So right now, we planned to rework on the application from scratch with performance and fault tolerance as our primary goal. And this application has to be reliable at least for another
six to eight years.
Technologies: Node (Sails.js), Angular 5, AWS with load balancer, AWS RDS (Mysql)
Our approach: From our analysis, we gained few straight forward reasons for the performance loss. Primarly, there are many stats for customers which access heavy tables.
Most of the stats are on current month. So we plan to add log tables for such and keep only the current month data in the specific table.addMethod
So, there are going to be may such log table which will only going to have read operation.
Queries:
Is it good to split the ready only tables to separate database or can we have it within the single database.
How Mysql buffer cache differ from Redis / memcache, Is there any memory consumption problem occurs while more traffic flows in?
What is the best approach to truncate few tables at the end of evey month (As i mentioned about log file)?
Am I proceeding in right direction?
A million rows is a modest size, not "huge". Since you are having performance problems, I have to believe that it stems from poor indexing and/or poor query formulation.
Find out what queries are having the most trouble. See this for suggestions on using mysqldumpslow -s t or pt-query-digest to locate them.
Provide SHOW CREATE TABLE and EXPLAIN SELECT ... for discussion of how to improve them. It may be as simple as adding a "composite" index.
Another possible performance bottleneck may be repeatedly summarizing old data. If this is the case, then consider the Data Warehousing technique of _building and maintaining Summary Tables .
As for your 4 questions, I tentatively say "no" to each.
The various frameworks tend to make small applications easy to develop, but they start to give trouble when you scale. Still, there are things that can be fixed without abandoning (yet) the frameworks.
AWS, etc, give you lots of reliability and read scaling. But, I repeat, the likely place to look is at the slow queries, not the various ideas you presented.
As for periodic truncation, let's discuss that after seeing what the data looks like and what the business requirements are for data retention.

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

Why do we need SSIS and star schema of Data Warehouse?

If SSAS in MOLAP mode stores data, what is the application of SSIS and why do we need a Data Warehouse and the ETL process of SSIS?
I have a SQL Server OLTP database. I am using SSIS to transfer my SQL Server data from OLTP database to a Data Warehouse database that contains fact and dimension tables.
After that I want to create cubes using SSAS form Data Warehouse data.
I know that MOLAP stores data. Do I need any Data warehouse with Fact and Dimension tables?
Is not it better to avoid creating Data warehouse and create cubes directly from OLTP database?
This might be a candidate for "Too Broad" but I'll give it a go.
Why would I want to store my data 3 times?
I have my data in my OLTP (online, transaction processing system), why would I want to move that data into a completely new structure (data warehouse) and then move it again into an OLAP system?
Let's start simple. You have only one system of record and it's not amazingly busy. Maybe you can get away with an abstraction layer (views in the database or named queries in SSAS) and skip the data warehouse.
So, you build out your cubes, dimensions and people start using it and they love it.
"You know what'd be great? If we could correlate our Blats to the Foos and Bars we already have in there" Now you need to integrate your simple app with data from a completely unrelated app. Customer id 10 in your app is customer id {ECA67697-1200-49E2-BF00-7A13A549F57D} in the CRM app. Now what? You're going to need to present a single view of the Customer to your users or they will not use the tool.
Maybe you rule with an iron fist and say No, you can't have that data in the cube and your users go along with it.
"Do people's buying habits change after having a child?" We can't answer that because our application only stores the current version of a customer. Once they have a child, they've always had a child so you can't cleanly identify patterns before or after an event.
"What were our sales like last year" We can't answer that because we only keep a rolling 12 weeks of data in the app to make it manageable.
"The data in the cubes is stale, can you refresh it?" Egads, it's the middle of the day. The SSAS processing takes table locks and would essentially bring our app down until it's done processing.
Need I go on with these scenarios?
Summary
The data warehouse serves as an integration point for diverse systems. It has conformed dimensions (everyone's has a common definition for what a thing is). The data in the warehouse may exceed the lifetime of the data in the source systems. The business needs might drive the tracking of data that the source application does not support. The data in the DW supports business activities while your OLTP system supports itself.
SSIS is just a tool for moving data. There are plenty out there, some better, some worse.
So No, generally speaking, it is not better to avoid creating a DW and build your cubes based on your OLTP database.

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)