Handle data changes in BI and Data warehouse systems - data-analysis

We are adopting Splunk> as a BI solution with the aiming of ingesting SQL Data directly into SPLUNK> and bring powerfull reports/dashboards to the User.
The question that I have in general is how to handle data changes in a BI system and ensure that we always have the same state with the source.
How do you handle data changes:
Data changes that happen in the source system?
Data Deletions that happen in the source system ?
Is there any general approach for tackling this kind of issues ?

Let me say up front that Splunk is not a replacement for SQL databases. Each have their own roles to play.
Splunk supports ingestion of SQL data using its DB Connect add-on. DB Connect executes SQL queries and indexes the results. New and changed DB rows become new events in Splunk indexes.
There is no deleting of indexed data in Splunk, even if the data disappears from the source. The only way events are removed from Splunk is when they age out or they need to be deleted to make room for new events.

Related

Copying data of one database to another in Oracle

I am using Oracle database but open to use other database so tagging all of them.
I am designing one system in which I have to inject all the data of existing database table into new database and whatever changes happens in existing database should reflect in new database on daily basis. My approach is.
I will copy all the data of existing database to new database.
Then I will create a trigger which will record all the changes in the table and store in another table(all the DML operations).
Once in a day my API will read the data generated by trigger and copy into new system. I don't need live data so I will schedule job only once in a day to copy data into new database
is this the proper approach? any suggestions?
Common practice would be to back up your primary instance and restore it on the secondary once a day.
You could schedule the backup and restore in sequence as a daily jobs.
If your copy database is Sql server, then I suggested you use LinkedServer. Based on the documentation:
Linked servers enable you to implement distributed databases that can
fetch and update data in other databases. They are a good solution in
the scenarios where you need to implement database sharding without
need to create a custom application code or directly load from remote
data sources. Linked servers offer the following advantages:
The ability to access data from outside of SQL Server.
The ability to issue distributed queries, updates, commands, and
transactions on heterogeneous data sources across the enterprise.
The ability to address diverse data sources similarly.
You can find more information based on the documentation.
Visit https://learn.microsoft.com/en-us/sql/relational-databases/linked-servers/linked-servers-database-engine?view=sql-server-ver15

Connecting 3rd party reporting tools to MySQL

I have an application that runs on a MySQL database, the application is somewhat resource intensive on the DB.
My client wants to connect Qlikview to this DB for reporting. I was wondering if someone could point me to a white paper or URL regarding the best way to do this without causing locks etc on my DB.
I have searched the Google to no avail.
Qlikview is in-memory tool with preloaded data so your client have to get data only during periodical reloads not all the time.
The best way is that your client will set reload once per night and make it incremental. If your tables have only new records load every night only records bigger than last primary key loaded.
If your tables have modified records you need to add in mysql last_modified_time field and maybe also set index on that field.
last_modified_time TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
If your fields are get deleted the best is set it as deleted=1 in mysql otherwise your client will need to reload everything from that tables to get to know which rows were deleted.
Additionally your client to save resources should load only data in really simple style per table without JOINS:
SELECT [fields] FROM TABLE WHERE `id` > $(vLastId);
Qlikview is really good and fast for data modelling/joins so all data model your client can create in QLikview.
Reporting can indeed cause problems on a busy transactional database.
One approach you might want to examine is to have a replica (slave) of your database. MySQL supports this very well and your replica data can be as up to date as you require. You could then attach any reporting system to your replica to run heavy reports that won't affect your main database. This also gives you a backup (2nd copy) and the backup can further be used to create offline backups of your data also without affecting your main database.
There's lots of information on the setup of MySQL replicas so that's not too hard.
I hope that helps.

How to migrate SQL Server database from my server to client server

I have a Transaction database with 10,000+ entries inserted on daily basis.
My client's requirement is that we allow him to download reports from his own server, for this we make a same copy of Transaction database to his server.
But now problem is how do we move data at a specific time to his server which takes latest data entry?
There are at least a couple of options in SQL Server.
If you can connect to your customer's database, Change data capture with SSIS is one option. CDC collects all changes in a queryable store which SSIS then reads and pushes to your target. You can be as selective as you want on what to move over since you write the ETL process in SSIS. One downside to CDC is it's in enterprise edition only. See detailed instructions at https://technet.microsoft.com/en-us/library/bb895315(v=sql.105).aspx
Transactional replication is another option which is available in both enterprise and standard editions. This has been around along time and used by a lot of organizations to do exactly what you described - incrementally move data to another database. Not as flexible as CDC but you can still apply filters to what rows/columns get moved. Not needed enterprise edition is helpful for many customers. Lots of detail about the technology here https://msdn.microsoft.com/en-us/library/ms151198(v=sql.105).aspx but highly encourage you to check out Kendra Little's most excellent article that covers trans repl and compares it with CDC http://www.brentozar.com/archive/2013/09/transactional-replication-change-tracking-data-capture/
If you can't connect directly to the customer database, CDC with SSIS still works but the output target will be some flat file which then gets transferred to the customer and loaded using another SSIS package or some other bulk load job (TSQL, BCP, etc...). Do be careful with how the flat file gets moved since anybody can see its contents.
I'd avoid any manual methods like creating triggers or running some (usually expensive) query to find the changed rows. Apart from the maintenance efforts, you're very likely to encounter tough performance issues.

How to convert MS SQL tables to DynamoDB tables?

I am new to Amazon DynamoDB and I have eight(8) MS SQL tables that I want to migrate to DynamoDB.
What process should I use for converting and migrating the database schema and data?
I was facing the same problem a year back when I started migrating the app from SQL to DynamoDB. I am not sure if there are automated tools, but I can share what we had done for migration:
Check if your existing data types can be mapped/need to change in DynamoDB. You can merge some of the table which requires less updates into single item with List and Map types or use a Set if required.
The most important thing is to check all your existing queries. This will be the core information you will need when you will design DynamoDB tables.
Make sure you distribute Hash keys properly.
Use GSI and LSI for searching and sorting purposes (project only those attributes that will be needed; this will save money).
Some points that will save some money:
If your tables are read-heavy, try using some caching mechanism, otherwise be ready to increase throughput of the tables.
If your table is write-heavy, then implement a queuing mechanism, such as SQS.
Keep checking all of your important tables status in Management console. They have provided different matrices that will help you in managing the throughput of the tables.
I have written a blog which include all the challenges faced while moving from relational database to NoSQL database

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)