I have been trying to set up a data warehouse using ralph kimballs technique but I am having difficulty actually understanding how to load data into my tables. I have a sales_filev1.csv that contains the columns:
CUST_CITY_NM
CUST_STREET_ADD
CUST_POSTAL_CD
CUST_STATE_CD
CUST_NM
CUST_NO
CUST_PHONE_NO
PROD_CAT_CD
PROD_LN_CD
PROD_NM
PROD_PACKAGE_SIZE_NO
SLS_PROMO_IN
SLS_QTY_NO
SLS_UNIT_PRICE_AM
STORE_CITY_NM
STORE_ESTABLISH_DT
STORE_ID
STORE_LVL_CD
STORE_MGR_NM
STORE_MGR_PHONE_NO
STORE_NM
STORE_NO
STORE_POSTAL_CD
STORE_STATE_CD
STORE_STREET_AD
SALES_DT
Then I have a CUST_LOOKUP.csv containing
CUST_NO
CUST_ID
CUST_INCOME_AM
CUST_CD
Then the last file is a product lookup:
PROD_NM
PROD_SKU_NO
SLS_UNIT_COST_AM
PROD_INTRO_DT
PROD_ID
I understand that I need to do have a sales_fact table as well. However, would my sales_filev1 not be the sales fact since it contains all of the information about the customers, store and products purchased and when? Then I would just use a join and insert to add the data together?
You need to model your data dimensionally (if you are going to use a star schema) and decide the grain of your data, determine the measures that will go into the fact table(s) and the attributes that will reside in the dimension table(s).
A datawarehouse is not joining all the data you have together in one table. It is optimal for storage and reporting.
Have a read of Dimensional Modelling and perhaps purchase the excellent book by Ralph Kimball's DWH Toolkit.
Related
I am currently studying databases, and I have a question.
The professor told us to create 2 different databases, and then move all the data to a star schema data model.
Here is the diagram for the first database, I have already filled the tables with data.
This is the diagram for the second database, also with data
This is my star schema model
The problem I am facing is i that i do not know how to start doing the mapping when adding my origin OLE DB and destination OLE DB.
I have already searched through the web, but I only find examples where they have to move just one database to the star schema data model.
The task you have is to consolidate two transactional/OLTP systems into a data warehouse. The reason you find only examples of moving/mapping one system into a data warehouse is that you simply repeat the process for each additional source that feeds into the DW.
In your example, you are integrating two sales systems to produce a unified sales business process report.
My approach would be to copy all the tables as is into your DW and put them into a staging schema (stage_d1, stage_d2). This way, you have all the data locally and it's likely consistent with the start of your data extract i.e. as of 9:30 this morning.
Now that you have the data locally, you need to transform and enrich the data to populate your dimensions and then your fact table.
Let's analyze Dim_Customer. This table is probably a little light in terms of providing value but the methodology is what you should focus on. System 1 supplies a first and last name and a city, state and zipcode. System 2 gives us a Company name, contact name, city, state, postal code and phone.
Given the usage of postal code and zip code, that would have me wondering whether we're dealing with international addresses versus a US centric (zip code) data. I'd also notice that we don't have an actual address line for this data so the point is, analyse your data so you know that you're modeling something that solves the problem (report on sales across all systems).
The next question I'd wonder about is how we populate the Customer dimension. If a Mario Almaguer had a purchase in both system 1 and system 2, are they the "same" person? Does it matter for this business process? If we sold to a person in TX and that name also exists in ME, does it matter if the name is in there twice?
I'll assume we only care about unique customer names. If it's a bad assumption, we go back and model it differently.
In my source, I'll write a query.
SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName FROM stage_d1.Customer AS C;
Run that, see that it returns the data I want. I'll then use an OLE DB Source in an SSIS data flow and use the third drop down option of Source query.
If I run the package, we'll get all the unique customer names from the source system. But we only want the names we don't have so that means we need to use something to check our reference table for existing matches. That's the Lookup Component
The source for the Lookup will be the DW's Dim_Customer table. You'll match based on CustomerName. The lookup component will tell us whether an incoming row matched and we can get two output streams: Match and no-match. We're only interested in the no-match path because that's new data. Andy Leonard has an excellent Stairway to Integration Services and in particular, we're talking about an Incremental Load.
From the Lookup, we'll drag the no-match branch to an OLE DB Destination where we point at Dim_Customer table.
You run that and Dim_Customer is populated. Run it again, and no new rows should be added as we're looking for new data only.
Now we need to solve getting the second staged customer data integrated. Fortunately, it's the same steps except this time our query is easier.
SELECT DISTINCT C.ContactName AS CustomerName FROM stage_d2.Customers AS C;
Lather, rinse repeat for all of your other dimensions.
You could also skipped the data flows and simply executed a query to do the same.
INSERT INTO dbo.Dim_Customer(CustomerName)
SELECT DISTINCT CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName
FROM stage_d1.Customer AS C
WHERE NOT EXISTS (SELECT * FROM dbo.Dim_Customer AS DC WHERE DC.CustomerName = CONCAT(C.FirstName, ' ', C.LastName));
Lather, rinse, repeat for the remaining dimensions.
Loading the fact is similar except we will use the Lookup components to find matches (as we need to translate our data to their dimension ids). Here I'll show how we'd populate a simplified version of your fact table
SELECT O.Price AS UnitPrice, BO.OrderDate AS [Date], 1 AS Quantity, 0 AS Discount, CONCAT(C.FirstName, ' ', C.LastName) AS CustomerName
FROM stage_d1.Ordering as O
INNER JOIN stage_d1.Book_Order AS BO
ON BO.OrderID = O.OrderID
INNER JOIN stage_d1.Customer AS C
ON C.CustomerID = BO.Cus_CustomerID;
That's my source query. The customer lookup will continue to match on Dim_Customer's CustomerName but this time we'll retrieve the CustomerID from the lookup component.
The destination then uses the UnitPrice, Date (depends on how you do it), Quantity and Discount directly from our source. The rest of the dimension keys, we populate through our lookup.
the standard approach would be to do the following:
Copy your source data into staging tables in your target database
Write the queries, necessary to populate the star schema, against the staging tables
You populate all your dimension tables first and then all your fact tables.
I know that sql can't save arrays (correct me if i'm wrong).
why?
I know this is a stupid question, but
Arrays are only structured data. Why can't sql save that?
Can i rewrite my mysql database or download a Addon for sql so i can save arrays?
Thanks in advance
Relational database management systems (RDBMS), such as MySQL, SQL Server, Oracle and PostgreSQL usually store data in tables. This is a very good way to store related data.
Let's say there are three entities: customers, orders, and products, and the orders contain multiple products. Four tables hence:
customers(customer_no, name)
products(product_no, name, price)
orders(order_no, customer_no, date)
order_details(order_no, product_no, amount)
We would provide indexes (i.e. search trees) to easily find orders of a customer or products in an order. Now let's say, we want to know how many orders have been made for product 123:
select count(distinct order_no)
from order_details
where product_no = 123;
The DBMS will quickly find the order_detail records for the product, because looking up an index is like searching by last name in a telephone book (binary search). And then it's mere counting. So only few records get read and the whole query is really fast.
Now the same with arrays. Something like:
products(product_no, name, price)
customers
(
customer_no,
name,
array of orders
(
order_no,
date,
array of products
(
product_no,
amount
)
)
)
Well, the order details are now hidden inside an order element which itself is inside a customer object. To get the number of orders for product 123, the only approach seems to be to read all customer records, loop through all orders and see whether they contain the product. This can take awfully long. Moreover without foreign key constraints for the relations between the entities, the arrays may contain product numbers that don't even exist.
Well, there may be ways to kind of index array data and there may be ways to guarantee data consistency for them, but the relational approach with tables has proven to solve these things extremely well. So we would avoid arrays and rather build our relations with tables instead. This is what a relational database is made for.
(Having said this, arrays may come in handy every now and then, e.g. in a recursive query were you want to remember which records have already been visited, but these occasions are rare.)
To answer my own question, i first want to thank for the comments
THANK YOU!
back to the question: ordinary sql cant save arrays and doesnt want to save save, because of normalization issues.
you can save arrays on another way:
A SQL Table is like an array. Link a new table as array. Create the table manually or, if the array could change, with Code. There is no need for arrays in sql
If you have to do, or want to do so, you can use Nosql or PostgreSql or save the Data with JSON, Oracle and XML
How do I use Power Pivot to summarize data in groups which are defined in a separate, non-relatable table
I'm analyzing a database that has the following tables:
Sales
Store
Category
Units
Sales
Stores
Store
address
etc
StoreGroups
Store
Group
A store can be in multiple groups (i.e. store B762 is in NW group & control_group) hence the StoreGroup table - where the two fields together make the primary key. Therefore, I can't relate StoreGroups to my Sales table, because both have duplicate Store values.
Right now all stores are being reported in each group:
PivotTableScreenshot
to confirm, if a store is in two groups, its sales should get counted for BOTH groups i.e. control group and NWRegion.
I've tried to adapt this DAX example mentioned below but have not been successful:
http://www.daxpatterns.com/dynamic-segmentation/
You have a many-to-many relationship between stores and groups.
You should be able to create a relationship from the store in StoreGroups and the Store in Stores (StoreGroups is a bridge table).
If you can post a link to some sample data, that would be helpful.
After doing that, you can read start to read about writing DAX formulas for many-to-many scenarios here. Be sure to also read in the comments, especially the one from Marco Russo.
I'm new to multidimensional data warehousing and have been tasked my workplace in
developing a data warehousing solution for reporting purposes, so this might be a
stupid question but here it goes ...
Each record in my fact table have FK columns that link out to their respective dimension tables (ex. dimCustomer, dimGeography, dimProduct).
When loading the data warehouse during the ETL process, I first loaded up the dimension tables with the details, then I loaded the fact table and did lookup transformations to find the FK values to put in the fact table. In doing so it seems each row in the fact table has FKs of the same value (ex. row1 has a FK of 1 in each column across the board, row2 has value 2... etc.)
I'm just wondering if this is typical or if I need to rethink the design of the warehouse and ETL process.
Any suggestions would be greatly appreciated.
Thanks
Based on your comments, it sounds like there's a missed step in your ETL process.
For a call center / contact center, I might start out with a fact table like this:
CallFactID - unique key just for ETL purposes only
AssociateID - call center associate who initially took the call
ProductID - product that the user is calling about
CallTypeID - General, Complaint, Misc, etc
ClientID - company / individual that is calling
CallDateID - linked to your Date (by day) Dimension
CallTimeOfDayID - bucketed id for call time based on business rules
CallStartTimestamp - ANSI timestamp of start time
CallEndTimestamp - ANSI timestamp of end time
CallDurationTimestamp - INTERVAL data type, or integer in seconds, call duration
Your dimension tables would then be:
AssociateDim
ProductDim
CallTypeDim
ClientDim
DateDim
TimeOfDayDim
Your ETL will need to build the dimensions first. If you have a relational model in your source system, you would typically just go to the "lookup" tables for various things, such as the "Products" table or "Associates" table, and denormalize any relationships that make sense to be included as attributes. For example, a relational product table might look like:
PRODUCTS: ProductKey,
ProductName,
ProductTypeKey,
ProductManufacturerKey,
SKU,
UPC
You'd denormalize this into a general product dimension by looking up the product types and manufacturer to end up with something like:
PRODUCTDIM: PRODUCTID (DW surrogate key),
ProductKey,
ProductName,
ProductTypeDesc,
ManufacturerDesc,
ManufacturerCountry,
SKU,
UPC
For attributes that are only on your transaction (call record) tables but are low cardinality, you can create dimensions by doing SELECT DISTINCT on the these tables.
Once you have loaded all the dimensions, you then load the fact by doing a lookup against each of the dimensions based on the natural keys (which you've preserved in the dimension), and then assign that key to the fact row.
For a more detailed guide on ETL with DW Star Schemas, I highly recommend Ralph Kimball's book The Data Warehouse ETL Toolkit.
We have a requirement in our application where we need to store references for later access.
Example: A user can commit an invoice at a time and all references(customer address, calculated amount of money, product descriptions) which this invoice contains and calculations should be stored over time.
We need to hold the references somehow but what if the e.g. the product name changes? So somehow we need to copy everything so its documented for later and not affected by changes in future. Even when products are deleted, they need to reviewed later when the invoice is stored.
What is the best practise here regarding database design? Even what is the most flexible approach e.g. when the user want to edit his invoice later and restore it from the db?
Thank you!
Here is one way to do it:
Essentially, we never modify or delete the existing data. We "modify" it by creating a new version. We "delete" it by setting the DELETED flag.
For example:
If product changes the price, we insert a new row into PRODUCT_VERSION while old orders are kept connected to the old PRODUCT_VERSION and the old price.
When buyer changes the address, we simply insert a new row in CUSTOMER_VERSION and link new orders to that, while keeping the old orders linked to the old version.
If product is deleted, we don't really delete it - we simply set the PRODUCT.DELETED flag, so all the orders historically made for that product stay in the database.
If customer is deleted (e.g. because (s)he requested to be unregistered), set the CUSTOMER.DELETED flag.
Caveats:
If product name needs to be unique, that can't be enforced declaratively in the model above. You'll either need to "promote" the NAME from PRODUCT_VERSION to PRODUCT, make it a key there and give-up ability to "evolve" product's name, or enforce uniqueness on only latest PRODUCT_VER (probably through triggers).
There is a potential problem with the customer's privacy. If a customer is deleted from the system, it may be desirable to physically remove its data from the database and just setting CUSTOMER.DELETED won't do that. If that's a concern, either blank-out the privacy-sensitive data in all the customer's versions, or alternatively disconnect existing orders from the real customer and reconnect them to a special "anonymous" customer, then physically delete all the customer versions.
This model uses a lot of identifying relationships. This leads to "fat" foreign keys and could be a bit of a storage problem since MySQL doesn't support leading-edge index compression (unlike, say, Oracle), but on the other hand InnoDB always clusters the data on PK and this clustering can be beneficial for performance. Also, JOINs are less necessary.
Equivalent model with non-identifying relationships and surrogate keys would look like this:
You could add a column in the product table indicating whether or not it is being sold. Then when the product is "deleted" you just set the flag so that it is no longer available as a new product, but you retain the data for future lookups.
To deal with name changes, you should be using ID's to refer to products rather than using the name directly.
You've opened up an eternal debate between the purist and practical approach.
From a normalization standpoint of your database, you "should" keep all the relevant data. In other words, say a product name changes, save the date of the change so that you could go back in time and rebuild your invoice with that product name, and all other data as it existed that day.
A "de"normalized approach is to view that invoice as a "moment in time", recording in the relevant tables data as it actually was that day. This approach lets you pull up that invoice without any dependancies at all, but you could never recreate that invoice from scratch.
The problem you're facing is, as I'm sure you know, a result of Database Normalization. One of the approaches to resolve this can be taken from Business Intelligence techniques - archiving the data ina de-normalized state in a Data Warehouse.
Normalized data:
Orders table
OrderId
CustomerId
Customers Table
CustomerId
Firstname
etc
Items table
ItemId
Itemname
ItemPrice
OrderDetails Table
ItemDetailId
OrderId
ItemId
ItemQty
etc
When queried and stored de-normalized, the data warehouse table looks like
OrderId
CustomerId
CustomerName
CustomerAddress
(other Customer Fields)
ItemDetailId
ItemId
ItemName
ItemPrice
(Other OrderDetail and Item Fields)
Typically, there is either some sort of scheduled job that pulls data from the normalized datas into the Data Warehouse on a scheduled basis, OR if your design allows, it could be done when an order reaches a certain status. (Such as shipped) It could be that the records are stored at each change of status (with a field called OrderStatus tacking the current status), so the fully de-normalized data is available for each step of the oprder/fulfillment process. When and how to archive the data into the warehouse will vary based on your needs.
There is a lot of overhead involved in the above, but the other common approach I'm aware of carries even MORE overhead.
The other approach would be to make the tables read-only. If a customer wants to change their address, you don't edit their existing address, you insert a new record.
So if my address is AddressId 12 when I first order on your site in Jamnuary, then I move on July 4, I get a new AddressId tied to my account. (Say AddressId 123123 because your site is very successful and has attracted a ton of customers.)
Orders I palced before July 4 would have AddressId 12 associated with them, and orders placed on or after July 4 have AddressId 123123.
Repeat that pattern with every table that needs to retain historical data.
I do have a third approach, but searching it is difficult. I use this in one app only, and it actually works out pretty well in this single instance, which had some pretty specific business needs for reconstructing the data exactly as it was at a specific point in time. I wouldn't use it unless I had similar business needs.
At a specific status, serialize the data into an Xml document, or some other document you can use to reconstruct the data. This allows you to save the data as it was at the time it was serialized, retaining original table structure and relaitons.
When you have time-sensitive data, you use things like the product and Customer tables as lookup tables and store the information directly in your Orders/orderdetails tables.
So the order table might contain the customer name and address, the details woudl contain all relevant information about the produtct including especially price(you never want to rely on the product table for price information beyond the intial lookup at teh time of the order).
This is NOT denormalizing, the data changes over time but you need the historical value, so you must store it at the time the record is created or you will lose data intergrity. You don't want your financial reports to suddenly indicate you sold 30% more last year because you have price updates. That's not what you sold.