What are the options for making a data cleansing process (deduplication/matching)
when dealing with MS SQL Server 2008 R2?
Or better yet how can I weight scores on a matching process on columns of a row?
The situation is the following: I have a persons table on my database and their associated addresses and documents in other database tables?
How can I make the best decision of match based on Name, Serial no of the document and address? As I understood SSIS fuzzy groping won't support this feature: weighted scoring.
I do not have much experience with SSIS at the moment - so this answer is focused on the de-duping/matching/scoring aspect of your question.
There are many ways to approach a Data Quality strategy such as this, all of which have Pro's and Cons and I think a lot of it comes down to your existing data management strategies - how clean and standardised is the data you are trying to dedupe?
Even 'simple' items like telephone numbers can be difficult to dedupe if you have not got this correct - for example all of these are different representations of the same number:
+1 (888) 707-8822
1-888-707-8822
18887078822
001 888 7078822
888-7078822
The more complex structures such as addresses get even more interesting: are 'flat 2' and 'apartment 2' the same thing or different?
You have two choices - make it your self or trust a third party
Make it yourself
Advantages
Lots of fun logical problems to work through
Will be able to tweak and improve at will 'forever' as your solution grows
Disadvantages
It will take a lot of time.
Each country you use will need looking at separately - there are no high quality 'global' rules that you can apply (but there of of course snippets that can be reused)
Third Party
Advantages
If de-duplication is not your specialty - let the experts do it
Ready to go and deliver value immediately
Disadvantages
Cost
Whether you go your own route or third party I suggest you start by creating a clear goal.
What are your inputs:
How 'clean' is your data?
How standardised is your data?
How do the records link together.
Are the address records just from one country or are they from several.
What are your workflows:
How often do you need to run this process?
Do you want to stop duplicates entering your system in the first place or just run periodic bulk runs?
What do you want from the project?
To what level (document, person, household, organisation - see below) do you want to identify duplicates
What do you want to do with those duplicates
Delete duplicates and keep one record
Merge duplicates to create one master record
This stage is sometimes refereed to as creating the 'Golden' record. Deciding which information to keep, and which information to disregard.
To go into a bit more detail about some of those choices, consider the following dummy addresses:
Are you trying to dedupe to household level:
Ann Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Ann and Bob Smith, 1 Main St, DupeVille, MA, 12345-6789
Person Level
Robert Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Robert Smith, 1 Main St, DupeVille, MA, 12345-6789
or even by the ID's in your document database.
Once you have that plan, it may help you make up your mind about the best route to take. If you want to create it yourself, the links you have found certainly put you in the right mindset. If you want to go third party - there are a good range of suppliers out there. Just make sure you choose someone you can trust - they're going to be changing your data!
Google around for the various suppliers - Experian Data Quality are one of them (my company!) and depending upon where in the world you are, you can find your best contact details and more info here: http://www.qas.com/contact/office-locations.htm . We have tools that can integrate with SQL Server 2008 R2 which can score differing input types and then automatically dedupe these for you or return the clusters of potentially groups for your to look after yourself.
Take your plan, and clear idea of what you need from them and discuss it with them. Whoever you choose will be able to talk you through your plan, discuss your goals and tell you if they are the right people for the job.
Think I went on a bit there :-) but hopefully that points you in the right direction - Good luck!
If you do fuzzy grouping with multiple columns you will get _similarity information for every column you choose as input. With this similarity information you can calculate your own tresholds etc.
Related
Suppose I have some sample data like that shown below (with a lot more entries), and my main use case is to look up a specific aliment and provide a list of waiting times for different hospitals which offer that treatment.
Not being very experienced at all with DB design, I don't know whether in this example there is an advantage to using separate tables with links between then or if a simple import of the CSV to a single table will suffice.
If I used separate tables, I'm guessing they would be for hospital and ailment perhaps?
I would be very grateful if someone tell me the best approach for this.
ID,Main Department,Specific Complaint,Hospital ,Waiting time
1,Cardiology,general,Hospital 1,7
2,Cardiology,general,Hospital 2,7
3,Cardiology,general,Hospital 3,7
4,Cardiology,general,Hospital 4,21
5,Cardiology,traumatology,Hospital 1,8
6,Cardiology,traumatology,Hospital 2,7
7,Dermatology,general,Hospital 1,21
8,Dermatology,general,Hospital 2,14
9,Dermatology,general,Hospital 3,21
10,Dermatology,erysipelas,Hospital 1,7
11,Dermatology,erysipelas,Hospital 3,7
...
One detail you must understand, SO is not a teaching site, tutorials abound for that. It is more to address specific problems that arise when developing solutions. That being said, I like this type of question, so here goes.
The type of solution to implement (simple CSV or complete database) depends on the volume of data, and type type of reports you require.
CSV is quick to implement.
Database takes more time, but will allow you to produce more complex reports than CSV, through the use of queries.
CSV is often used as a medium to load or extract data, but as for queries it is not as powerful.
A database can be expanded. Ex. today you only consider the name of the hospital. You could expand your table to include the address, phone number, ... You could also expand your model to add insurance company links, doctors, ...
Basic modeling:
Identify your objects. Ex. here I would consider ailment, hospital, complaint.
Identify relations between objects, and their type. Ex. ailment and hospital are linked, the that link is n-n. Meaning 1 ailment can be treated in many hospitals, and 1 hospital can treat many ailments.
I am not certain what to do with complaint. In your question you do not specify if all hospitals treat all (ailment - complaint) duos or not. More on that later.
As you define your structure, make sure you apply the normal forms. In most cases, forms 1-3 are enough.
1NF: atomic values and no repeating groups. Ex. you would create table with columns hospital and ailments separated by commas. 1 line == 1 hospital <-> 1 ailment.
2NF: 1NF is achieved and all the non-key attributes are dependent on the primary key. Ex. you should not create a table linking ailment and wait time. The wait time is not dependent on the ailment, it is dependent on the combination of ailment and hospital.
3NF: 2NF is achieved and there are no transitive functional dependencies. So A is dependent on B, B is dependant on C, so A is transitively dependent on C.
Some critical questions must be answered before you can model your data:
A hospital can treat a certain ailment. In all cases?
Can you have: hospital 1 can tread ailment 1 when the complaint is A and B, but not C?
Ex. all hospitals can provide primary care for cardiac patients, but cardiac surgery can only be performed as some hospitals.
In that case, you cannot link ailment and hospital together directly. A combination of (ailment,complaint) can. And this will impact wait time.
Based on reality, I will link (ailment and complaint) and link this duo to hospital.
Here is my first model, "for fun", which might need to be modified for your needs:
Wait time is in table Hospital_Treads_Ailment_has_Complaint. In my model, an hospital can only estimate the wait time once they know which ailment and which complaint the patient has.
A final exercise I do to test my model is try the main queries I need. If one query cannot be done with the model, it needs to be changed.
Which hospital treats cardiac problems? Ok, select hospital where ailment == cardiology, complaint == *.
Which hospital can accept patients who have trauma. Ok, select hospital where ailment == *, complaint == trauma.
and so on...
We presently use a pen/paper based roster to manage table games staff at the casino. Each row is an employee, each column is a 20 minute block of time and each cell represents what table the employee is assigned to, or alternatively they've been assigned to a break. The start and end time of shifts for employees vary as do the games/skills they can deal. We need to keep a copy of the rosters for 7 years, with paper this is fairly easy, I'm wanting to develop a digital application and am having difficulty how to store the data in a database for archiving.
I'm fairly new to working with databases, I think I understand how to model the data for a graph database like neo4j, but I had difficulty when it came to working with time. I've tried to learn about RDBMS databases like MySQL, below is how I think the data should be modelled. Please point out if I'm going in the wrong direction or if a different database type would be more appropriate, it would be greatly appreciated!
Basic Data
Here is some basic data to work with before we factor in scheduling/time.
Employee
- ID Number
- Name
- Skills (Blackjack, Baccarat, Roulette, etc)
Table
- ID Number
- Skill/Type (Can only be one skill)
It may be better to store the roster data as a file like JSON instead? Time sensitive data wouldn't be so much of a problem then. The benefit of going digital with a database would be queries, these could help assist time consuming tasks where human error is common.
Possible Queries
Note: Staff that are on shift are either on a break or on the floor (assigned to a table), Skills have a major or minor type based on difficulty to learn.
What staff have been on the floor for 80 minutes or more? (They are due for a break)
What open tables can I assign this employee to based on their skillset?
I need an employee that has Baccarat skill but is not already been assigned to a Baccarat table.
What employee(s) was on this table during this period of time?
Where was this employee at this point in time?
Who is on shift right now?
How many staff on shift can deal Blackjack?
How many staff have 3 major skills?
What staff have had the Baccarat skill for at least 3 months?
These queries could also be sorted by alphabetical order or time, skill etc.
I'm pretty sure I know how to perform these queries with cypher for neo4j provided I model the data right. I'm not as knowledgeable with SQL queries, I've read it can get a bit complicated depending on the query and structure.
----------------------------------------------------------------------------------------
MYSQL Specific
An employee table could contain properties such as their ID number and Name, but am I right that for their skills and shifts these would be separate tables that reference the employee by a unique integer(I think this is called a foreign key?).
Another table could store the gaming Tables, these would have their own ID and reference a skill/gametype with a foreign key.
To record data like the pen/paper roster, each day could have a table with columns starting from 0000 increasing by 20 in value going all the way to 2340? Prior to the time columns I could have one for staff where each employee is represented with their foreign key, the time columns would then have foreign keys to the assigned gaming Tables, the row data is bound to have many cells that aren't populated since the employee shift won't be 24/7. If I'm using foreign keys to reference gaming Tables I now have a problem when the employee is on break? Unless I treat say the first gaming Table entry as a break?
I may need to further complicate things though, management will over time try different gaming Table layouts, some of the gaming Tables can be converted from say Blackjack to Baccarat. this is bound to happen quite a bit over 7 years, would I want to be creating new gaming Table entries or add a column to use a foreign key and refer to a new table that stores the history of game types during periods of time? Employees will also learn to deal new games during their career, very rarely they may also have the skill removed.
----------------------------------------------------------------------------------------
Neo4j Specific
With this data would I have an Employee and a Table node that have "isA" relationship edges mapping to actual employees or tables?
I imagine with the skills for the two types I would be best with a Skill node and establish relationships like so?: Blackjack->isA->Skill, Employee->hasSkill->Blackjack, Table->typeIs->Blackjack?
TIME
I find difficulty when I want this database to now work with a timeline. I've come across the following suggestions for connecting nodes with time:
Unix Epoch seems to be a common recommendation?
Connecting nodes to a year/month/day graph?
Lucene timeline? (I don't know much about this or how to work with it, have seen some mention it)
And some cases with how time and data relate:
Staff have varied days and start/end times from week to week, this could be shift node with properties {shiftStart,shiftEnd,actualStart,actualEnd}, staff may arrive late or get sick during shift. Would this be the right way to link each shift to an employee? Employee(node)->Shifts(groupNode)->Shift(node)
Tables and Staff may have skill data modified, with archived data this could be an issue, I think the solution is to have time property on the relationship to the skill?
We open and close tables throughout the day, each table has open/close times for each day, this could change in a month depending on what management wants, in addition the times are not strict, for various reasons a manager may open or close tables during the shift. The open/closed status of a table node may only be relevant for queries during the shift, which confuses me as I'd want this for queries but for archiving with time it might not make sense?
It's with queries that I have trouble deciding when to use a node or add a property to a node. For an Employee they have a name and ID number, if I wanted to find an employee by their ID number would it be better to have that as a node of it's own? It would be more direct right, instead of going through all employees for that unique ID number.
I've also come across labels just recently, I can understand that those would be useful for typing employee and table nodes rather than grouping them under a node. With the shifts for an employee I think should continue to be grouped with a shifts node, If I were to do cypher queries for employees working shifts through a time period a label might be appropriate, however should it be applied to individual shift nodes or the shifts group node that links back to the employee? I might need to add a property to individual shift nodes or the relationship to the shifts group node? I'm not sure if there should be a shifts group node, I'm assuming that reducing the edges connecting to the employee node would be optimal for queries.
----------------------------------------------------------------------------------------
If there are any great resources I can learn about database development that'd be great, there is so much information and options out there it's difficult to know what to begin with. Thanks for your time :)
Thanks for spending the time to put a quality question together. Your requirements are great and your specifications of your system are very detailed. I was able to translate your specs into a graph data model for Neo4j. See below.
Above you'll see a fairly explanatory graph data model. In case you are unfamiliar with this, I suggest reading Graph Databases: http://graphdatabases.com/ -- This website you can get a free digital PDF copy of the book but in case you want to buy a hard copy you can find it on Amazon.
Let's break down the graph model in the image. At the top you'll see a time indexing structure that is (Year)->(Month)->(Day)->(Hour), which I have abbreviated as Y M D H. The ellipses indicate that the graph is continuing, but for the sake of space on the screen I've only showed a sub-graph.
This time index gives you a way to generate time series or ask certain questions on your data model that are time specific. Very useful.
The bottom portion of the image contains your enterprise data model for your casino. The nodes represent your business objects:
Game
Table
Employee
Skill
What's great about graph databases is that you can look at this image and semantically understand the language of your question by jumping from one node to another by their relationships.
Here is a Cypher query you can use to ask your questions about the data model. You can just tweak it slightly to match your questions.
MATCH (employee:Employee)-[:HAS_SKILL]->(skill:Skill),
(employee)<-[:DEALS]-(game:Game)-[:LOCATION]->(table:Table),
(game)-[:BEGINS]->(hour:H)<-[*]-(day:D)<-[*]-(month:M)<-[*]-(year:Y)
WHERE skill.type = "Blackjack" AND
day.day = 17 AND
month.month = 1 AND
year.year = 2014
RETURN employee, skill, game, table
The above query finds the sub-graph for all employees who have the skill Blackjack and their table and location on a specific date (1/17/14).
To do this in SQL would be very difficult. The next thing you need to think about is importing your data into a Neo4j database. If you're curious on how to do that please look at other questions here on SO and if you need more help, feel free to post another question or reach out to me on Twitter #kennybastani.
Cheers,
Kenny
How could the following database schema be drawn using E/R diagrams? (A sketch or final image would be helpful). I would also appreciate if you could guide me to a easy-to-understand tutorial on entity-relationships so I could learn how to draw them on paper first.
A CD has a title, a year of production and a CD type. (CD type could be anything: mini-CD, CD-R, CD-RW, DVD-R, DVD-RW...)
A CD usually has multiple songs on different tracks. Each song has a name, an artist and a track number. Entity set Song is considered to be weak and needs support from entity set CD.
A CD is produced by a producer which has a name and an address.
A CD may be supplied by multiple suppliers, each has a name and an address.
A customer may rent multiple CDs. Customer information such as Social Security Number (SSN), name, telephone needs to be recorded. The date and period of renting (in days) should also be recorded.
A customer may be a regular member and a VIP member. A VIP member has additional information such as the starting date of VIP status and percentage of discount.
Is this Entity diagram correct? This is so fracking confusing. I've built this diagram on just intuition rather a systematic approach they teach in a textbook. I still can't wrap my head around the many-to-one relation, weak entities, foreign keys.
There's a fair article on ERDs on Wikipedia.
When you're starting a new ERD - whether it's hand-drawn or computer-drawn - you should focus first on the entities (entity sets). Add the relationships in and then worry about fleshing out your non-key predicates. When you get some experience with ERDs you'll get to the point where you won't need much more work to achieve normalization. It will start to come naturally to you.
There are probably quite a few changes that you'll want to make to your diagram. Since this may be homework, I'll give you an alternative diagram to consider:
This model takes a more sophisticated view of your rules, for example:
Songs can appear many times on the same CD and on different CDs.
A song can be performed by multiple artists within a given track.
Producers can cooperate on a CD.
None of these are necessarily right for your model. It depends on your business rules.
Compare your model with this one and ask yourself what is different and why you might want to take one approach or the other.
take all the major concepts, draw a box for each
in the box put the name of the major concept, like SONG then an underline
under the major concept, list all the attributes like NAME
draw lines from one box to another where those concepts are linked (usually through an attribute) like line from CD to SONG
I'm the one-man dev team on a fledgling military history website. One aspect of the site is a catalog of ~1,200 individual battles, including the nations & formations (regiments, divisions, etc) which took part.
The formation information (as well as the other battle info) was manually imported from a series of books by a 10-man volunteer team. The formations were listed in groups with varying formatting and abbreviation patterns. At the time I set up the data collection forms I couldn't think of a good way to process that data... and elected to store it all as strings in the MySQL database and sort it out later.
Well, "later" - as it tends to happen - has arrived. :-)
Each battle has 2+ records in the database - one for each nation that participated. Each record has a formations text string listing the formations present as the volunteer chose to add them.
Some real examples:
39th Grenadier Rgmt, 26th Volksgrenadier Division
2nd Luftwaffe Field Division, 246th Infantry Division
247th Rifle Division, 255th Tank Brigade
2nd Luftwaffe Field Division, SS Cavalry Division
28th Tank Brigade, 158th Rifle Division, 135th Rifle Division, 81st Tank Brigade, 242nd Tank Brigade
78th Infantry Division
3rd Kure Special Naval Landing Force, Tulagi Seaplane Base personnel
1st Battalion 505th Infantry Regiment
The ultimate goal is for each individual force to have an ID, so that its participation can be traced throughout the battle database. Formation hierarchy, such as the final item above 1st Battalion (of the) 505th Infantry Regiment also needs to be preserved. In that case, 1st Battalion and 505th Infantry Regiment would be split, but 1st Battalion would be flagged as belonging to the 505th.
In database terms, I think I want to pull the formation field out of the current battle info table and create three new tables:
FORMATION
[id] [name]
FORMATION_HIERARCHY
[id] [parent] [child]
FORMATION_BATTLE
[f_id] [battle_id]
It's simple to explain, but complicated to enact.
What I'm looking for from the SO community is just some tips on how best to tackle this problem. Ideally there's some sort of method to solving this that I'm not aware of. However, as a last resort, I could always code a classification framework and call my volunteers back to sort through 2,500+ records...
You've tagged your question as PHP related - but it's not.
You are proposing substituting the real identifiers with surrogate keys (ids) however the real identifiers are intrinsically unique - so you're just making your data structure more complicated than it needs to be. Having said that, the leaf part of the hierarchy may only be unique within the scope of the parent node.
The most important question you need to address is whether the formation tree is always going to be two levels. I suspect that sometimes it may be one and sometimes it may be more than 2. The structure you propose is not going to work very well with variable depth trees.
This may help:
http://articles.sitepoint.com/article/hierarchical-data-database
C.
I am working on a reviews website. Basically you can choose a location and business type and optionally filter your search results by various business attribures. There are five tables at play here:
Businesses
ID
Name
LocationID
Locations
LocationID
LocationName
State
Attributes
AttributeID
AttributeName
AttributeValues
AttributeValueID
ParentAttributeID
AttributeValue
BusinessAttributes
ID
AttributeID
AttributeValueID
So what I need is to work out the query to use (joins?) to get a business in a particular location based on attribute values.
For example, I want to find a barber in Santa Monica with these attributes:
Price: Cheap
Open Weekends: Yes
Cuts Womens Hair: Yes
These attributes are stored in the Attributes and AttributeValues tables and are linked to the business in the BusinessAttributes table.
So let's say I have these details from the search form:
LocationID=5&Price=Cheap&Open_Weekends=Yes&Customs_Womens_Hair=Yes
I need to build the query to return the businesses that match this location and attributes.
Thank you in advance for your help and I think StackOverflow is awesome.
Thinking about your data needs, you may be a perfect candidate for a schema-free document oriented database. On a recent episode of .Net Rocks (link to show), Michael Dirolf talked about his project MongoDB.
From what I understand, you could take each Business entity and store it in the database with all its associated attributes (LocationID, Price, Open_Weekends, Customs_Womens_Hair, Etc.). Each entity stored in the store can have different combinations of attributes because there is no schema. This natively accomplishes what you are trying to do with an Attribute and Attribute_Value table.
To search the database, just ask it for all entities that have the particular set of keys and values you need. No complex joins and no loss of performance. What you are doing is exactly what schema-free, document based databases are designed for.
Michael Dirolf: Yes, I think that a lot of the people who are switching are people who have sort of got themselves into corners where they are using relational database the way that we use MongoDB.
Richard Campbell: Right.
Michael Dirolf: So having columns that, a column key and a separate column value and inserting stuff that way so that they get done in schema and all sorts of crazy stuff like that…
Richard Campbell: Yeah, now in reflection I suddenly realized I just describe your perfect customer, a guy who has taken, you know, abusing SQL Server as they say. We’re going down this funny path and you just shouldn’t be here in the first place.
If you keep going down the path of building a relational attribute/value store, your performance will suffer with the combonatoric explosion that results.