How to import data into star schema data warehouse. - mysql

I have searched everywhere on the web to find out how I can import data into a star schema data warehouse. A lot of the stuff online explain the design of the star schema and data warehouse but none explain how exactly data is loaded into the DW. Here is what i've done so far:
I am trying to make an application of high school basketball statistics for each player.
I have:
A list of all of the players name, height, position and number
A list of all of the high schools
list of all of the schedules
list of conferences
statistics(points, rebounds, steals, games played, etc) for each player for the current year.
I assume the the stats would be my fact table and the rest are my dim tables.
Now the million dollar question --How in the world do get the data into that format appropriately?
I tried simply importing them to their respective tables but dont know how they connect.
Example: there are 800 players and 400 schools. each schools has a unique id (primary key). I upload the players into dim players and schools into dim schools. Now how do I connect them?
Please help. Thanks in advance. Sorry for the rambling :)

There are many ways for importing data into a database: using builtin loaders, scripts or, what is mostly used for DW environments, an ETL tool.
About your fact table, I think stats are metrics, not the transaction. In other words, you measure a transaction, not a metric itself.

Using an ETL tool (E- Extract your data from your soruces , T- transform your data or manipulate it to go as you want, L - Load the data in your DW) you can safely and surely have your data loaded in your DW.
You can use ETL tools like : SSIS , Talend , etc.

Yes, "star", "dim", "fact", and "data warehouse" are appropriate terms, but I would rather approach it from "entities" and "relationships"...
You have essentially defined 5 "Entities". Each Entity is (usually) manifested as one database table. Write the CREATE TABLEs. Be sure to include a PRIMARY KEY for each; it will uniquely identify each row in the table.
Now think about relationships. Think about 1:many, such as 1 high school has 'many' players. Think about many:many.
For 1:many, you put, for example, the id of the high school as a column in the player table.
For many:many you need an extra table . Write the CREATE TABLEs for any of those you may need.
Now, read the data, and do INSERTs into the appropriate table.
After that, you can think about the SELECTs to extract interesting data. At the same time, decide what INDEX(es) will be useful. But that is another discussion.
When you are all finished, you will have learned a bunch about SQL, and may realize that some things should have been done a different way. So, be ready to start over. Think of it as a learning exercise.

You can use SQL server data tools for this project.
SQL server Data tools consists of a SSIS,SSAS and SSRS.
Use SSIS to create a ETL process for your data in your database.
Use SSAS to create dimensions, fact tables and cubes (You can do a lot more in this).
Use SSRS to present the data in a user friendly way.
Lot of videos are available youtube.

Related

Is Reservation System Suitable for Amazon DynamoDB / NoSQL?

I'm working on basic restaurant reservation system and was thinking about using Amazon DynamoDB for this project. That being said, I'm not even sure if DynamoDB is suitable for something like this or if I should stick to MySQL RDS since some of the queries may be quite complex.
Functionality I need:
User will submit a "Find Table" form with date, time and party size.
Check RESTAURANT table if date and party size is even allowed.
Check BLOCKED table for blocked dates (holidays and other closures)
Check HOURS table making sure the restaurant is even open.
Check TABLEINFO table for a table based on party size AND compare with RESERVATION table making sure the table is not already reserved for another guest during the same time
Any suggestions or tips on the DynamoDB database design especially hash & range use for something like this?
Or do you think MySQL database is better suited for this kind of app?
This is a quick DB design to give you better idea what I'm trying to do.
I've done a lot with relational databases, and a bit with NoSQL databases (just so you know where I'm coming from). IMHO, NoSQL databases are best suited to scenarios where either one or more are true:
The data is essentially flat (not a lot of relations, almost like an
old flat-file)
There's a definite "parent" type record with "child"
records which are small enough/accessed frequently enough with the
parent to justify embedding them right in the record.
You need the
freedom to add/populate fields within reason. I like to think of it
like inheritance, where every item in the table shares some common
traits (ID, Name), but different records might have different
traits. For example, an online product catalog might have books,
bikes, and MP3 songs in it. A record for a "book" item would have
stuff like ISBN, number of pages, author, etc. A "bike" might have
wheel size and color, and an "MP3" would have length, artist, genre,
etc. You'd never get all of those things in an "item" table in an
RDS without some serious overloading or leaving fields empty. A
NoSQL database would allow you to store all of that info in the
table, and only for the items that need it.
You can definitely build the schema you include with your question using the indexing abilities of Dynamo, but you'd be trying to make a NoSQL database act like a RDS.
That said: I myself would try it with Dynamo first as a learning experience. :)

Mongo vs MySql Search Optimization

So I'm in the process of designing a system that is going to store document type of data (i.e. transcribed documents). Immediately, I thought this would be a great opportunity to leverage a NOSQL implementation like MongoDB. However, given that I have zero experience with Mongo, I'm wondering: on each of these docuemnts, I have a number of metadata tags I want to be able to search across: things like date, author, keywords, etc. If I were to use an RDBMS like MySql, I'd probably store these items in a separate table liked by a foreign key and the index the items that were most likely to be searched on. Then I could run queries against that table and only pull back and the full text results for the items that matched (saves on disk read not having to reach through a row that contains a lot of text or BLOB information).
Would something similar be possible with Mongo? I know in Mongo I could simply create 1 document that would have all the metadata AND the actual transcription but is it easy and highly performant to search the various fields in the metadata if the document is stored like that? Is there a best practice when needing to perform searches across various items in a document in Mongo? Or is this type of scenario more suited for an RDBMS rather than a NOSQL implementation?
You can add indexes for individual fields in mongodb documents. Only when the indexes get larger than your memory, performance of index based searches may become a problem.
When you decide if to go with mongodb, keep in mind that there is no join operation. This has to be done by your db layer or above.
If your primary concern is searching, there is an ElasticSearch river for mongodb, so you can utilize ElasticSearch on your dataset.
The NoSQL model, is geared for data storage in long range (OLTP model), yes you can create indexes and perform queries that you want, instead of you having related entities across tables, you have a complete entity that owns all entities dependent on it within herself.
When you have to extract complex reports with many joins in a relational database in a context of millions of data becomes impractical such an act, because you may end up compromising other applications.
For example:
We have the room and student bodies.
Each room have many students, the relational model we would have the following:
SELECT * FROM ROOM R
INNER JOIN
S STUDENT
ON = S.ID R.STUDENTID
Imagine doing that with some 20 tables with thousands of data? His performance will be horrible.
With MongoDB you will do so:
db.sala.find (null)
And will have all their rooms with their students.
MongoDB is a database that performs scanning horizontally.
You can read:
http://openmymind.net/mongodb.pdf
This site also has an interactive tutorial that uses the book's examples. Very nice.
And here you can experience the mongodb online and test your commands.
Search for try mongo db.
Also read about shards with replicaSets. I believe it will open your mind greatly.
You can install Robomongo which is a graphical interface for you to tinker with mongodb.
http://robomongo.org/

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

Best practice to Load Fact table in MS SSIS

I am new to SSIS in data warehouse. I am using Microsoft business intelligence studio.
I have 5 Dimensions each having some PK.
I have a Fact table that contains all the PK of Dimensions, means their is a foreign key relationship exist ( as in star schema).
Now what is the best practice to load the fact table.
What i have done is write a cross join query between 5 Dimensions and the resultant set is dumped to the fact table. But i don't think this is a good practice.
I am completely new to MS SSIS. so plz describe suggestions in detail.
thanks
Take a look at Microsoft Project Real examples. Also get a Kimball book and read-up on loading fact tables -- the topic covers several chapters.
I would echo #Damir's points about Project Real and Kimball. I am a fan of both.
I guess to give you some more thoughts, to answer your question,
load your date dimension and other "static" dimensions as a one off load
load records into all your dimensions to take care of NULL and UNKNOWN values
load your dimensions. For your dimensions, decide on a column by column basis what you want as type 1 or type 2 changing dimension columns. Be cautious and choose them mostly as type 1 unless there is a good reason.
[edited] load your fact table by joining your staging transaction data which will go into a fact table to your new dimension tables using the business keys, thus looking up the dimension's foreign keys as you go. e.g. sales transactions will have a store number (the business key), which you would want to look up in DimStore (already loaded in the previous step), which would give you the kStore of DimStore, then you would record kstore against that transaction in FactSalesTransaction.
Other general things you should consider (not related to your question, but if yo uare stating out you should consider)
Data archiving. How long will you keep data online? / when will it be deleted?
Table partitioning. If you have very large Fact tables(s), you should consider partitioning on a date or subject area basis. Date is quite nice, as you can do some interesting things with regard to dropping old partitions when the data is too old as part of the standard load process.
Having the DWH as a snowflaked schema, then using a set of views to flatten the snoflake into a star. This is particularly useful when putting an OLAP cube on top of a SQL DWH, as it simplifies the cube design.
How are you going to manage different environments (Dev/Test/etc/Prod)? Using one of the SQL Server configuration styles is imperative.
Build a template SSIS package with all the variables you need and the configration/connection strings you want. It will save loads of time to do that now, rather than having to rework packages when you discover new things. Do trivial prototypes initially to prove your methodology!

First-time database design: am I overengineering? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Background
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
The schema I have come up with is the following:
From the client perspective:
Clients is the main table
Clients are linked to the department they work for
Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
Departments are linked to the division of a company
Divisions are linked to the parent company
From the classes perspective:
Sessions is the main table
A teacher is linked to each session
A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
Sessions are grouped into "packs" of an arbitrary size
Each packs is assigned to a client
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me: (Click here for full-sized graphic)
(source: maian.org)
Example queries I'll be running
Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division
Question(s)
Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
Thanks for your time
Some more answers to your questions:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
What you can do is in the Departments table, add CityId and DivisionId.
Besides that, I think everything is fine...
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
No. It looks like you're designing at a good level of detail.
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
Regards.
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
I must say your design is impressive for a first year CS student.
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
A few things came to mind:
The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.)
Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers.
The "packs" concept isn't clear at all.
Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.