Should proactive caching be used instead of processing dimensions? - ssis

I am confused about best practice for updating cube data throughout the day. We have a small order processing environment, where I would like to update a dashboard containing Order statuses. I am able to get this to work by creating an SSIS package and scheduling it to run every 4 minutes. This works.
But when I disable the SSIS job above, and instead turn on Real-time ROLAP on all the dimensions and the Cube, nothing ever changes in dashboard. Do I misunderstand the purpose of proactive caching?
I'm using SQL Server standard containing our production data, but our Analysis Server is Enterprise, in case that makes a difference. I'd also be willing to use Automatic or Scheduled MOLAP if that works.

no, you did not. I think you have configuration issues.
I assume the job you disabled was coping data from your database to your data warehouse, right?
And your cube reads from your data warehouse, right?
so now, your OLAP database is being updated (by your application) but the changes are not being pushed to the cube (because the job is off)
Proactive caching (specially with ROLAP) is a way to get your data live without having to schedule a cube refresh for every x minutes. But the job that populates your DW must still be running.
I can guess that the package you disabled was, besides updating the DW, also refreshing the cube. Check it's source.

Related

AWS DMS to replicate transactional data to data warehouse on an ongoing basis

I'm hoping someone can tell me if I'm absolutely crazy before I go too far down this path. I have an application with MySQL as the backend. I needed to create more robust reporting and i opted to build a data warehouse in pgsql. The challenge is I don't want the DW to be just updated once or twice a day. I'd like it to be near real time (some lag is expected and not a problem).
I looked at AWS glue and a few other options and finally settled on DMS as a method of replicating the data from the MySQL source to the pgsql target db for staging. I then set up trigger functions that will basically manipulate the inserted/updated data in the pgsql db, landing it in the data warehouse. The application is also connected to the DW and can pull reports and dashboard metrics from the DW as needed.
I've built a proof of concept and it seems to work, but it's really only me hitting the application at the moment, so I'm not sure if it will hold up if I were to proceed with this idea and put it in production.
I currently have a dms.t2.small replication instance (engine version 2.4.4) running at about 15-20% CPU utilization. I don't have it configured for Multi AZ currently.
I'm seeing combined CDCLatencyTarget/CDCLatencySource values averaging about 9 seconds. I think if that holds true it wouldn't be unbearable, although the less time the better. I'd say if it gets up over a minute we may start to see complaints.
I know that DMS is more meant for migrations, so I'd like to know if I'm just doing this in a really stupid way, or if this is a more or less valid use case? Are there issues with DMS that I am unaware of that will cause me to later regret this decision?
Also, I'd love any ideas you have for how I can put safeguards in place to ensure that source and target stay synced or if they don't that I'm made aware of it, or something that would allow it to self-heal.

SSIS Transfer Object Task Timeout

I can see I'm not the only person who's experienced an issue with the SSIS Transfer Database Object Task and timeouts, however, people using this for the extract phase of an ETL must be something fairly common, so I'm trying to establish what is the usual/accepted way to do this.
I have a web application that uses Entity Framework to generate ~250 tables, some of which occasionally have schema updates.
The bulk of the transform and load portion of our ETL is handled by a series of stored procedures, however, these read from a copy of the application's tables that are initially loaded in the Transfer Database Objects task.
Initially, we set up an SSIS package that simply ran the Transfer Database Objects task, and then kicked off the stored proc. That meant that the job was fairly resilient to change, and the only changes required were changes to the stored proc, if and when a schema update affected the tables that were used therein.
Unfortunately, as one of our application instances has grown over time, the Transfer Database Objects task is reaching the point where I'm regularly seeing Timeout errors. Those don't appear to be connection timeouts, or anything I can control on the server side, and from what I can see, I can't amend the CommandTimeout on the underlying SMO stuff within that Task.
I can see that some people manually craft their extract, such that they run a separate Data Flow task to pull the information from each table, which has the obvious bonus that these can be run in parallel, however, in my case, that's going to mean an initial chunk of work to craft 250ish of these, and a maintenance task whenever the schema changes on the source database, no matter how minor.
I've come across Biml, which looked like a possible way to at least ease that overhead, however, it doesn't appear this can run on VS2017 yet.
Does anyone have any particular patterns they follow for this, or if I do need individual data flow tasks, is there some way to automate the schema update, perhaps using some kind of SSIS automation and something from the entity framework?
It turns out the easiest way around this is to write a clone of the Transfer task, but with appropriate additions to allow more control over batching and timeouts etc. Details are available in this article: https://blogs.msdn.microsoft.com/mattm/2007/04/18/roll-your-own-transfer-sql-server-objects-task/

Why does SSRS need to recycle the application domain

I'm working with MS Reporting Services 2016. I noticed that the application domain is set by default to recycle every 12 hours. Now the impact on users after a recycle is either slow response from reporting services or a failed report. Both disappear after a refresh of the report, but this is not ideal.
I have come across a SO answer where people suggest that you can turn off the scheduled recycle by setting the configuration attribute RecycleTime to zero.
I have also read that writing a script to manually restart reporting services, which also recycles the app domain. Then a script that simply loads a report at a controlled time to remove the first time load issues. However this all seems like a work around to me and I would rather not have to do this.
My concern is that there must be a logical reason for having the scheduled recycle time, but I cannot find any information explaining this. Does anyone know if there is a negative impact from turning off the scheduled application domain recycle?
The RecycleTime is a function aimed at making sure SSRS isn't consuming RAM it doesn't need and potentially starving the rest of the machine. Disabling the refresh essentially removes the ability to claw back any memory used for a brief period of intensive processing.
If you are confident your machine is suitably resourced you can turn the refresh off or, if not, alternatively schedule the refresh for an out of hours time and define a Cache Refresh Plan to cache any super important reports immediately afterwards to minimise any user impact.
Further reading here: https://www.mssqltips.com/sqlservertip/2735/prevent-sql-server-reporting-services-slow-startup/
I guess I'm possibly over simplifying this, but SSRS was designed to recycle every 12 hours (default) for a reason. If it ain't broke, don't fix it. In my case, I wanted to control when the recycle occurred. I execute a 1 line powershell script from a SQL Agent job at 6:50 am, then generate a subscription report at 7 am, which kick starts SSRS and the users do not see any performance degradation.
restart-service 'ReportServer'
Leaving the SSRS config file setting at 720 minutes lets the recycle occur again at 6:50 pm. Subscription reports generate throughout the night, so if a human gets on SSRS after hours there should be no performance issue because the system is already running.
Are we possibly overthinking it?

Replicating database changes

I want to "replicate" a database to an external service. For doing so I could just copy the entire database (SELECT * FROM TABLE).
If some changes are made (INSERT, UPDATE, DELETE), do I need to upload the entire database again or there is a log file describing these operations?
Thanks!
It sounds like your "external service" is not just another database, so traditional replication might not work for you. More details on that service would be great so we can customize answers. Depending on how long you have to get data to your external service and performance demands of your application, some main options would be:
Triggers: add INSERT/ UPDATE/ DELETE triggers
that update your external service's
data when your data changes (this
could be rough on your app's
performance but provide near
real-time data for your external
service)
Log Processing: you can parse changes from the logs and use some level of ETL to make sure they'll run properly on your external service's data storage. I wouldn't recommend getting into this if you're not familiar with their structure for your particular DBMS.
Incremental Diffs: you could run diffs on some interval (maybe 3x a day, for example) and have a cron job or scheduled task run a script that moves all the data in a big chunk. This prioritizes your app's performance over the external service.
If you choose triggers, you may be able to tweak an existing trigger-based replication solution to update your external service. I haven't used these so I have no idea how crazy that would be, just an idea. Some examples are Bucardo and Slony.
There are many ways to replicate a PostgreSQL database. In the current version 9.0 the PostgreSQL Global Development Group introduced two new rocks features called Hot Standby and Streaming Replication puting to PostgreSQL to a new level and introducing a built-in solution.
On the wiki, there is a completed review of the new PostgreSQL-9.0´s features:
http://wiki.postgresql.org/wiki/PostgreSQL_9.0
There are other applications like Bucardo, Slony-I, Londiste (Skytools), etc,which you can use too.
Now, What are you want to do for log processing? What do you want exactly ? regards

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)