So, I'm new to BO Web Intelligence, so I'll try to explain my problem. I created a simple table below to represent my report. "Tom Smith" is a section break, which is followed by Ann Marie, etc. There is also a break on "Name", which represents clients. Client names are listed more than once since each has multiple "goals" and each goal was touched multiple times (time entered working on goals).
Basically, I want to find all the unique goals for each client in the "Goal" column of the report and see if there is a match in the "GoalTime" column. Goals in the "GoalTime" column are supposed to be the same goals as the "Goal" column, so if there is a goal that doesn't match, or an extra goal, this needs to be flagged and counted to produce totals, percentages, etc.
Is there a way to only display unique values for each client in "Goal"? I know there are ways of playing with the query, universes, filters, etc, to try to create the goal column so that only one goal type is displayed, but I'm looking for a simpler formula-based solution. I'm hoping there is some creative use of a "unique" or "distinct" function, or a creatve way of using "match" to achieve this, which can then be combined with flags to create a count.
Any help would be greatly appreciated!
Tom Smith:
Name: Goal: GoalTime:
Tim Buy a House Buy a House
Tim Find Employment Find Employment
Tim Buy a House Buy a Car
Tim Find Employment Find Employment
Chris Find Employment Find Employment
Chris Buy a House Buy a Car
Chris Buy a Car Buy a House
Chris Buy a Car Buy a Car
Ann Marie:
Name: Goal: GoalTime:
Tom Buy a House Buy a Car
Tom Find Employment Find Employment
Related
Looking for any tricks using SQL or Excel to clean up a ~100k record table that has no clear pattern. The data generally looks like this all blocked together in separate rows but the same column:
JENNIFER SMITH
Accountant - Senior
Day Shift
jsmith#mail.com
AMBER Jones
Professional
Pro Status
amberj#mail.com
Abby Stone
Receptionist
Analytics
123 Main St
123-456-7890
abby#mail.com
REBECCA MILLER
Media
Building 2
millerr#mail.com
Sarah M Myers
Executive
BRADBURY SCHOOL
456 Main St
The big problem is that some records have three sets of additional data beneath it and some records have five -- maybe they have an email and/or phone number maybe they don't, some have lines after the record some don't, etc.
I'm looking for ideas either using code or formulas to attempt to clean this up to look like below without going through every line manually:
Name Job Info Email Phone Address
JENNIFER SMITH Accountant - Senior Day Shift jsmith#mail.com
AMBER Jones Professional Pro Status amberj#mail.com
Abby Stone Receptionist Analytics abby#mail.com 123-456-7890 123 Main St
REBECCA MILLER Media Building 2 millerr#mail.com
Sarah M Myers Executive BRADBURY SCHOOL 456 Main St
Hoping people might have ideas using scenarios they've had to use in the past on really messy datasets that come in like this. If it's in Excel, it could be some combination of using SEARCH() or LEN() to try and identify when each record's data is over.
I know it's not the most pointed question -- but if anybody has any tips it'd really help me out. It also doesn't have to end up being perfect -- if it ends up looking mildly like above, I'll be able to clean it manually from that point on, just not from the start.
Any help using any method would be greatly appreciated!
You would spend more time jerking around with this with code than it would be worth. Such badly formatted data can't possibly be accurate. How on earth would you know Jennifer Smith still uses the email address indicated or has the job listed? If you are somehow forced to work on this data then you would be better off paying a human to key it in. That shouldn't take more than a week and you can probably get someone to do it for a few hundred bucks. Even so, that data is certainly crap so I can't see the point in bothering.
What are the options for making a data cleansing process (deduplication/matching)
when dealing with MS SQL Server 2008 R2?
Or better yet how can I weight scores on a matching process on columns of a row?
The situation is the following: I have a persons table on my database and their associated addresses and documents in other database tables?
How can I make the best decision of match based on Name, Serial no of the document and address? As I understood SSIS fuzzy groping won't support this feature: weighted scoring.
I do not have much experience with SSIS at the moment - so this answer is focused on the de-duping/matching/scoring aspect of your question.
There are many ways to approach a Data Quality strategy such as this, all of which have Pro's and Cons and I think a lot of it comes down to your existing data management strategies - how clean and standardised is the data you are trying to dedupe?
Even 'simple' items like telephone numbers can be difficult to dedupe if you have not got this correct - for example all of these are different representations of the same number:
+1 (888) 707-8822
1-888-707-8822
18887078822
001 888 7078822
888-7078822
The more complex structures such as addresses get even more interesting: are 'flat 2' and 'apartment 2' the same thing or different?
You have two choices - make it your self or trust a third party
Make it yourself
Advantages
Lots of fun logical problems to work through
Will be able to tweak and improve at will 'forever' as your solution grows
Disadvantages
It will take a lot of time.
Each country you use will need looking at separately - there are no high quality 'global' rules that you can apply (but there of of course snippets that can be reused)
Third Party
Advantages
If de-duplication is not your specialty - let the experts do it
Ready to go and deliver value immediately
Disadvantages
Cost
Whether you go your own route or third party I suggest you start by creating a clear goal.
What are your inputs:
How 'clean' is your data?
How standardised is your data?
How do the records link together.
Are the address records just from one country or are they from several.
What are your workflows:
How often do you need to run this process?
Do you want to stop duplicates entering your system in the first place or just run periodic bulk runs?
What do you want from the project?
To what level (document, person, household, organisation - see below) do you want to identify duplicates
What do you want to do with those duplicates
Delete duplicates and keep one record
Merge duplicates to create one master record
This stage is sometimes refereed to as creating the 'Golden' record. Deciding which information to keep, and which information to disregard.
To go into a bit more detail about some of those choices, consider the following dummy addresses:
Are you trying to dedupe to household level:
Ann Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Ann and Bob Smith, 1 Main St, DupeVille, MA, 12345-6789
Person Level
Robert Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Robert Smith, 1 Main St, DupeVille, MA, 12345-6789
or even by the ID's in your document database.
Once you have that plan, it may help you make up your mind about the best route to take. If you want to create it yourself, the links you have found certainly put you in the right mindset. If you want to go third party - there are a good range of suppliers out there. Just make sure you choose someone you can trust - they're going to be changing your data!
Google around for the various suppliers - Experian Data Quality are one of them (my company!) and depending upon where in the world you are, you can find your best contact details and more info here: http://www.qas.com/contact/office-locations.htm . We have tools that can integrate with SQL Server 2008 R2 which can score differing input types and then automatically dedupe these for you or return the clusters of potentially groups for your to look after yourself.
Take your plan, and clear idea of what you need from them and discuss it with them. Whoever you choose will be able to talk you through your plan, discuss your goals and tell you if they are the right people for the job.
Think I went on a bit there :-) but hopefully that points you in the right direction - Good luck!
If you do fuzzy grouping with multiple columns you will get _similarity information for every column you choose as input. With this similarity information you can calculate your own tresholds etc.
Suppose we want to store university courses and their entry requirements in a database. So for example BSc Mathematics, BSc Fine Art, MSc Computer Science etc.
Each course has it's own set of requirements, and might have a different number of non-shared requirements. For example, to be eligible for the BSc Maths you might need a A in Maths and a B in Physics. Whereas the BSc Fine Art might require a A in Art and that the user has a portfolio. The MSc might have a minimum age of 25 etc.
Suppose we then have a student who has his own set of attributes. So they might have a A in Maths and a B in Physics and a C in Chemistry and be of age 19.
How can we structure our database such that it is geared towards efficient lookups. And given a student's attributes how can we retrieve all the courses which they are eligible for?
On an abstract level we are looking for all items whose requirements are a subset of the attributes given.
I'd like to implement this in MySQL. The schema could be:
courses
id
name
requirements
course_id
subject
grade
But then how to do query the table to get all eligible courses when the student has a A in Subject 1, a B in Subject 2 etc.
For a course to be a match, each one of its requirements must be satisfied.
Thanks in advance, I hope my explanation isn't too confusing.
Okay. I think you need a subject list, with a bunch of subject id's.
Now, for the query, you will start with (or generate) a bunch of subject id's and grades. The easiest thing to do is to make this into a temporary table with those columns.
Now, you can do your query, joining the subject id's, and adding a "where temp.grade <= course.grade.
The trick to this is to count the rows. If it has the same number of rows as the course has specified, then you have a successful match.
Is that enough to get you going?
I'm creating a DB for my office. We have about 200 employees. Each employee was required to complete at least 1 of 12 courses within 2 years of being hired (so different completion/qualification dates for every course, some people have been here 20 years, some just 1 year) to become qualified. Some have completed multiple courses. Each course has to be refreshed periodically (each refresh period is different and based on the last refresher date). I'm having trouble with the layout of the table. Here's what I have as an idea, but i'm trying to see if there is a less busy way to lay out the data. I want to be able to run a query that tells me what person has completed what class (so it would have to look at all 3 class columns). I also want to be able to tell when their qualification has lapsed, or is coming up. So far I've created an employee data table that looks like the table below.
ID Name Class1 Class2 Class3 QualDt-Cl1 QualDt-Cl2 QualDt-Cl3 LstRequal1 ...
1 Bob Art Spanish 3/17/1989 9/12/2010 3/8/2012
2 Sally Math 8/31/2012
3 George Physics History 2/6/2005 7/6/1996
4 Casey History 6/8/2000
5 Joe English Sports Physics 12/10/1993 10/15/2001 4/22/2006
The classes are listed in their own table and each class column pulls from that. The qual date refresher will be a calculated column in the query based on the last refresher date.
Is there a way to put all the classes one person is qualified for in one column and have the associated date for requalifiing for each particular cours in another column?
I think it would be less confusing if you had a table per subject and register the people's names under each one with the date passed.
Also it would probably help to declutter the table from uneccssary info like the exact date the exam was passed, you can do month and year or maybe just year? if the lee way is 2 years that would probably make more sense - also making the qulified calculation easier.
The query would work if you searched per subject maybe ? or who would qualify to do what subject this current year and then the next.
this is not much of a question that you would ask on here by the way - but hope the answer helps.
When designing a database, any time you find yourself adding columns with names like Class1, Class2, Class3 you should immediately stop and think about whether it makes more sense to put those columns in a separate child table called Classes with a link (relation) to the parent. There are several reasons for this, including:
What happens when somebody takes a fourth course? Saying "that will never happen" ignores the fact that "never is a very long time" and none of us can predict the future.
When checking whether or not someone has taken a course you really need to check (Class1 IS NULL) OR (Class2 IS NULL) OR (Class3 IS NULL) and that can get really tedious, It also means that if you do have to add Class4 then all of that SQL code has to be corrected.
Similarly, if you want to find someone who took "CPR" you'd have to look for people with (Class1 = 'CPR') OR (Class2 = 'CPR') OR (Class3 = 'CPR'). Yuck.
So, save yourself some trouble (a lot of trouble, really) and create a Classes table:
ID
ClassName
QualDate
(etc. )
...where ID is the ID number from the main table (what is called a "foreign key"). From your sample data, your Classes table would look something like this:
ID ClassName QualDate
1 Art 3/17/1989
1 Spanish 9/12/2010
2 Math 8/31/2012
3 Physics 2/6/2005
3 History 7/6/1996
...
I am about to develop a online hotel reservation system...using php and mysql... I have some doubts about my current database schema and the business logic to get the hotels in which rooms are free between two particular dates...
Does anyone know of some kind of tutorial where i can get some idea about the hotel reservation schema and the business logics that should be used in the system...?
Thanks for your suggestions....
Edit :
I've figured out most of the logic... The points i am not clear about are the following...
If a user selects more than one room in a particular hotel between two particular dates how can i represent in the following reservation table...?
Table : Reservation
Field 1 : reservation_id
Field 2 : room_id
Field 3 : no. of Rooms
Field 4 : check-in date
Field 5 : check-out date
Field 6 : Customer id
How can i check what rooms are available between two dates based on the reservation table and the following rooms table...?
Table : Room
Field 1 : hotel_id
Field 2 : room_id
Field 3 : total_num_rooms
Note : The db contains more than one hotel... So it's like a user can select a city and look for rooms available in hotels in that area between two particular dates...
Also say if there are 10 numbers of room of a particular type in a hotel, i need to show only the number of rooms that are free in that particular time period.....
As a general thought, apply divide and conquer. Always.
For example, why do you think a specific customer should be able to have 'number of rooms' for a certain time span associated? What if, for example, I'm on a business trip and have my family follow me a few days later. Now, for the given time span the number of rooms is no longer a constant.
That doesn't really matter? True, you could just add another entry for the same customer. But then again, you could have done that in the first place and simplify your logic saying that a customer can only have one room at a time in a single row, but there can be rows that create overlaps in time spans for a given customer.
Also, make sure you separate Reservation and a ReservationRequest. The latter is comprised of a set of Reservations I think - because I want that room for me and my family and both criteria must be matched.
Just a few ideas. Note that this is the ivory tower approach and it can lead to massively overblown solutions. In the RealWorld (TM), stick to Marcs suggestions: Analyze the actual customers need. If handling 1% of the requests increases development time by 200%, he's not gonna like (or need) it, and vice versa.
There isn't a perfect way of representing something like an hotel reservation system.
Try talking to your client or people working in hotels to understand what they are doing now and base your system on this.
I'd guess:
A Room has a RoomType
A Customer can Book 1..n Room(s)
A RoomType has a name and a price
... and so on.
If you just use a tutorial, you might end up creating a system that doesn't fit the requirements of possible users. So talk to these future end users, figure out the business logic and start coding :)