In a data warehouse project how do I verify that my fact table loaded in a data warehouse DB through SSIS ETL load is correct with my staging table so that later I don't have incorrect reporting?
Good question, people creates different systems for this. So you understand this is one of most complex check/reconciliation process that developers built. I tried to give you three ways to do this. I would recommend first one because its easier and most efficient.
You can -
Post Load reports: create reports which will reconcile data after load. Write SQL to compare source data and target data - compare count, compare amount, compare null values, compare daily data etc. If the comparison generates flag/alert - this means some issue in load.
Check as you go : You can create some reusable function or mapping which will compare incoming source data and target data - compare count, compare amount, compare null values, compare daily data etc. and store in a table. A script will keep on checking those values and if there is any issue, script will notify support team.
Pre process check : Before starting any ETL, you can check source data - like count, null values, daily count etc. to verify how is the data, if there is any file missing etc.
Related
I am looking for a way to store auto-generated reports. There are about 10-15 columns and 100-3000 rows depending on the report but each report is consistent in column count.
I am looking for a way to organise and store these reports into a large group without creating an entire new database and 1000s of tables to store each indervidual report.
The reports need to be queryable so they can be subdivided by team/area/person etc as each report can be a combination of 3-4 different sub-reports depending on how you split/sort the data.
I am using Python to collect and sort the data from the database so using MariaDB/MySQL would be preferred but im happy to use something else if there is a pre-exising connection libary for it.
To sum up i need something similar to a excel spreadsheet with each table being a sheet and sheet name being the date it was generated so i can select by the date generated.
Think through the goals.
Is this a legal issue -- you need to produce an unalterable report as something "official". A la a non-editable .pdf?
(at the opposite extreme) Be able to generate (or regenerate) any report for any timeframe.
Is performance an issue? (Either perceived or real)
I like to build and maintain Summary Table(s) for any "Data Warehouse" application. And build "reports" that take as a parameter a date range and a small number of other things. And have the report generation so fast that it does not matter if multiple people are pulling reports at random times.
15 columns and 3000 rows is usually excessive. If pulling a report is trivial enough, it can be less 'massive'; just get the parts you want, without such bulk.
http://mysql.rjweb.org/doc.php/summarytables
I'm trying to develop a new reporting module for a resource management tool (PHP+Mysql).
I am trying to extract data in the following format from mysql:
I have a table that consists of date and location of multiple people(i.e Office, Home or Client).
Sample Data as in DB.
here date_plotted means the date at which the user is engaged and plotting_date represents when this particular entry was made in the system(the date). So User was plotted to be in office on 30th Oct and the same entry was made on 30th Oct.
Data as in resource table
The resource table represents the user table.
Any suggestions on how to do the same in mysql?
These are the primary tables which needs to be used.
The above table id done in excel for now to represent the outcome.
I'm new to SQL so haven't tried anything yet.
There is a tool for Windows that might simplify this operation. It's made by MySQL and called MySQL for Excel. In theory it should allow you to structure and make changes to MySQL databases as well as perform queries that result in spreadsheets.
Without knowing more about your data, for example being supplied an actual csv file to work with, and the parameters of the actual pull, whether it's fix dates always or if this is a dynamic pull based on a range this question could result in 100 different implementations that visually return similar results, but have massively different requirements overhead-wise in implementation.
I have multiple archive tables storing similar kind of data in these tables but archived in the month wise format. Now, the requirement is to get all the archived data in to one table instead of multiple tables.
I am doing this activity with the help of Union all in SSIS, however it seems that it is taking random insert in the destination table.
Attach is the route taken for the transformation.
I want to prioritize the insert, please suggest!
You can add an extra column "Priority" to each of OLE DB sources with the corresponding priority for each source and then after union you can add Sort Component that sorts the data by Priority. But if you have a lot of data - that would be really inefficient because sort component will wait until all the source data is read.
I would suggest to write a proper source SQL statement that does the union/prioritization/sort for you and then insert into target.
Also if the sources are on different servers you can create Foreach loop container that will iterate through source tables and inset all of them into the target table. You can use this article for the reference.
I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.
I am importing a FoxPro table into SQL Server 2008 using SSIS. The source data is a proprietary database that I have no control over. Let call the table I am importing Customers.
Sometimes, the structure for Customers looks like this:
ID (int)
NAME (char(30))
ADDRESS (char(30))
CITY (char(20))
STATE (char(2))
ZIP (char(10))
CCNUM (char(16))
Other times, it looks like this:
ID (int)
NAME (char(30))
ADDRESS (char(30))
CITY (char(20))
STATE (char(2))
ZIP (char(10))
CCPTR (char(100))
This proprietary database basically has 2 different versions of the database. The older version had a field called CCNUM (credit card #) that was a basic 16 character field. The newer version replaced that field with a field called CCPTR, which was a 100 character field that represented a card pointer (encrypted value for the actual credit card number).
The problem here is everytime I have to switch back and forth between 2 datasets that have these different table structures, SSIS blows up and I have to go in and manually refresh the metadata.
My question is, is there anyway I can have SSIS dynamically look for one of these fields at runtime, and based on which one is there, load the correct data into the correct table structure in SQL?
Forgive me if this has been asked before. I am still fairly new to SSIS and I tried searching for this answer but to no avail.
Thanks,
Mark
The short answer is no. SSIS expects that there are no significant changes to the meta data of its source and destination components. There are ways to programatically influence this with .NET, but that kind of misses the point.
A well-designed solution to this problem is to create 2 separate data flows that copy the data into a shared staging table. Use this staging table as source to transform your data and push it to its final data structure.
if you build your package based on the lenght (100) and tun it on the (16), you should get only a warning. Are you getting an error?