string categorization strategies

string categorization strategies - mysql

I'm the one-man dev team on a fledgling military history website. One aspect of the site is a catalog of ~1,200 individual battles, including the nations & formations (regiments, divisions, etc) which took part.
The formation information (as well as the other battle info) was manually imported from a series of books by a 10-man volunteer team. The formations were listed in groups with varying formatting and abbreviation patterns. At the time I set up the data collection forms I couldn't think of a good way to process that data... and elected to store it all as strings in the MySQL database and sort it out later.
Well, "later" - as it tends to happen - has arrived. :-)
Each battle has 2+ records in the database - one for each nation that participated. Each record has a formations text string listing the formations present as the volunteer chose to add them.
Some real examples:
39th Grenadier Rgmt, 26th Volksgrenadier Division
2nd Luftwaffe Field Division, 246th Infantry Division
247th Rifle Division, 255th Tank Brigade
2nd Luftwaffe Field Division, SS Cavalry Division
28th Tank Brigade, 158th Rifle Division, 135th Rifle Division, 81st Tank Brigade, 242nd Tank Brigade
78th Infantry Division
3rd Kure Special Naval Landing Force, Tulagi Seaplane Base personnel
1st Battalion 505th Infantry Regiment
The ultimate goal is for each individual force to have an ID, so that its participation can be traced throughout the battle database. Formation hierarchy, such as the final item above 1st Battalion (of the) 505th Infantry Regiment also needs to be preserved. In that case, 1st Battalion and 505th Infantry Regiment would be split, but 1st Battalion would be flagged as belonging to the 505th.
In database terms, I think I want to pull the formation field out of the current battle info table and create three new tables:
FORMATION
[id] [name]
FORMATION_HIERARCHY
[id] [parent] [child]
FORMATION_BATTLE
[f_id] [battle_id]
It's simple to explain, but complicated to enact.
What I'm looking for from the SO community is just some tips on how best to tackle this problem. Ideally there's some sort of method to solving this that I'm not aware of. However, as a last resort, I could always code a classification framework and call my volunteers back to sort through 2,500+ records...

You've tagged your question as PHP related - but it's not.
You are proposing substituting the real identifiers with surrogate keys (ids) however the real identifiers are intrinsically unique - so you're just making your data structure more complicated than it needs to be. Having said that, the leaf part of the hierarchy may only be unique within the scope of the parent node.
The most important question you need to address is whether the formation tree is always going to be two levels. I suspect that sometimes it may be one and sometimes it may be more than 2. The structure you propose is not going to work very well with variable depth trees.
This may help:
http://articles.sitepoint.com/article/hierarchical-data-database
C.

Related

Database structure for simple waiting times project with CSV data and MySql

Suppose I have some sample data like that shown below (with a lot more entries), and my main use case is to look up a specific aliment and provide a list of waiting times for different hospitals which offer that treatment.
Not being very experienced at all with DB design, I don't know whether in this example there is an advantage to using separate tables with links between then or if a simple import of the CSV to a single table will suffice.
If I used separate tables, I'm guessing they would be for hospital and ailment perhaps?
I would be very grateful if someone tell me the best approach for this.
ID,Main Department,Specific Complaint,Hospital ,Waiting time
1,Cardiology,general,Hospital 1,7
2,Cardiology,general,Hospital 2,7
3,Cardiology,general,Hospital 3,7
4,Cardiology,general,Hospital 4,21
5,Cardiology,traumatology,Hospital 1,8
6,Cardiology,traumatology,Hospital 2,7
7,Dermatology,general,Hospital 1,21
8,Dermatology,general,Hospital 2,14
9,Dermatology,general,Hospital 3,21
10,Dermatology,erysipelas,Hospital 1,7
11,Dermatology,erysipelas,Hospital 3,7
...

One detail you must understand, SO is not a teaching site, tutorials abound for that. It is more to address specific problems that arise when developing solutions. That being said, I like this type of question, so here goes.
The type of solution to implement (simple CSV or complete database) depends on the volume of data, and type type of reports you require.
CSV is quick to implement.
Database takes more time, but will allow you to produce more complex reports than CSV, through the use of queries.
CSV is often used as a medium to load or extract data, but as for queries it is not as powerful.
A database can be expanded. Ex. today you only consider the name of the hospital. You could expand your table to include the address, phone number, ... You could also expand your model to add insurance company links, doctors, ...
Basic modeling:
Identify your objects. Ex. here I would consider ailment, hospital, complaint.
Identify relations between objects, and their type. Ex. ailment and hospital are linked, the that link is n-n. Meaning 1 ailment can be treated in many hospitals, and 1 hospital can treat many ailments.
I am not certain what to do with complaint. In your question you do not specify if all hospitals treat all (ailment - complaint) duos or not. More on that later.
As you define your structure, make sure you apply the normal forms. In most cases, forms 1-3 are enough.
1NF: atomic values and no repeating groups. Ex. you would create table with columns hospital and ailments separated by commas. 1 line == 1 hospital <-> 1 ailment.
2NF: 1NF is achieved and all the non-key attributes are dependent on the primary key. Ex. you should not create a table linking ailment and wait time. The wait time is not dependent on the ailment, it is dependent on the combination of ailment and hospital.
3NF: 2NF is achieved and there are no transitive functional dependencies. So A is dependent on B, B is dependant on C, so A is transitively dependent on C.
Some critical questions must be answered before you can model your data:
A hospital can treat a certain ailment. In all cases?
Can you have: hospital 1 can tread ailment 1 when the complaint is A and B, but not C?
Ex. all hospitals can provide primary care for cardiac patients, but cardiac surgery can only be performed as some hospitals.
In that case, you cannot link ailment and hospital together directly. A combination of (ailment,complaint) can. And this will impact wait time.
Based on reality, I will link (ailment and complaint) and link this duo to hospital.
Here is my first model, "for fun", which might need to be modified for your needs:
Wait time is in table Hospital_Treads_Ailment_has_Complaint. In my model, an hospital can only estimate the wait time once they know which ailment and which complaint the patient has.
A final exercise I do to test my model is try the main queries I need. If one query cannot be done with the model, it needs to be changed.
Which hospital treats cardiac problems? Ok, select hospital where ailment == cardiology, complaint == *.
Which hospital can accept patients who have trauma. Ok, select hospital where ailment == *, complaint == trauma.
and so on...

What would be the cardinality between Artist vs ArtWork vs Group?

You set up a database company, ArtBase, that builds a product for art galleries. The core of this product is a database with a schema
that captures all the information that galleries need to maintain.
Galleries keep information about artists, their names (which are
unique), birthplaces, age, and style of art.
For each piece of artwork, the artist, the year it was made, its
unique title, its type of art (e.g., painting, lithograph, sculpture,
photograph), and its price must be stored.
Pieces of artwork are also classified into groups of various kinds,
for example, portraits, still lifes, works by Picasso, or works of the
19th century; a given piece may belong to more than one group. Each
group is identified by a name (like those just given) that describes
the group.
Finally, galleries keep information about customers. For each
customer, galleries keep that person’s unique name, address, total
amount of dollars spent in the gallery (very important!), and the
artists and groups of art that the customer tends to like.
Draw the ER diagram for the database.
Is the following ERD correct?
Is it possible that a group has zero Artworks?
Is it possible that the Artist didn't produce any artwork but still sits in the database?

1) You used ID as a PK in Artist and Artwork. This is a good thing as the use of an unique name (as requested in the business model) is wrong: after all, two pieces of art or two artists may bear the same name. However, you did respect the business model for the Customer entity whose PK is Name.
You can choose to make a good ERD and use ID as a surrogate PK for Artwork, Artist, and Customer; or respect the business model you were given and use Name as a PK for these three entities. Personally, I'd go with the former.
The following two questions can't be answered given the business model only; the answers below reflect the cardinality in the specific ERD you designed.
2) Yes, because according to the ERD a Group includes from 0 to N Artworks;
3) Yes, because according to the ERD although an Artist makes from 1 to N Artworks (and therefore there wouldn't be the need to insert an Artist in the database if he didn't do any Artwork) there is still a relationship between Customer and Artist in the sense that a Customer likes from 1 to N Artists.
Therefore an Artist can be in the database even if he didn't produce any Artwork (yet), provided that he is liked by at least one Customer. If an Artist didn't do any Artwork and is not liked by any Customer, he won't be in the database.

Missing some context information here, especialy some cadinality information. Pay attention to yourself asking questions about the context:
Is it possible that a group has zero Artworks?
Is it possible that the Artist didn't produce any artwork but still
sits in the database?
This information should be given by you (or by the presenting problem). If this is a work of your course or your college, your instructor needs to better explain the present context. If you are already working as a DBA or data modeler, please look for more information about this problem. It's almost indescribable the importance of a context in the development of an ER-Diagram. Keep this in mind: Without a well-defined context, the problem (the situation) is uncertain, and so is missing information to complete the reflection of a real-world situation. In short:
No complete context, no diagram (without a diagram, there is no system!).
I will make this diagram with you step-by-step, but I'll take some assumptions due to lack of information (context) here. I will give my opinion on certain resources used in ER-Diagram, but that does not mean that I'm saying you're layman. I am just showing my thought, which shows how I learned that here in my country. I believe that you are as capable as I am, ok? Well, let's begin...
Entities in ER-Diagram are defined when we have attributes / properties. According to your description, we can see immediately 3 entities here:
Customers
Artists
Artworks
Relationships exists to express links between entities. The most obvious relationship here is between Artists and Artworks, Don't you agree?
For each piece of artwork, the artist...
In accordance with the context revealed, all artwork has a unique artist (always), but it is uncertain if an artist always has one, multiple, or zero artworks. I SUPPOSE that an artist can have many or no artwork. That being said, we see that artists to artworks have a cardinality 0 to N, because, again, an artist may have made several or no artwork at all.
So far we have defined three entities, and linked two of them. Let's continue...
...its type of art (e.g., painting, lithograph, sculpture, photograph)...
If an artwork has only a single type of art, and an art type is defined only by its name, then we have here what is called a Functional Redundancy (translated from the Portuguese term "Redundância Funcional"). In spit summary, Functional Redundancies are like relationships between two entities, and serve to save you the trouble of repeating the same field in multiple columns in a table (which would be susceptible to errors). In a Conceptual Model, they are represented as a field in an entity with the suffix "(R)" (without the double-quotes).
If an entity has a field (column) like a Functional Redundancy, but with different values (multiple), then we have what is called Multivalued Field (also translated from the Portuguese term "Campo Multivalorado"). These are fields in entities that have the suffix "*" (also without the double-quotes).
This is not the case of the type of artwork, but it would until now for the groups of each artwork:
Pieces of artwork are also classified into groups of various kinds,
for example, portraits, still lifes, works by Picasso, or works of the
19th century; a given piece may belong to more than one group.
This would be true if groups only possess names, and no other entity relate to them. But then you said:
and groups of art that the customer tends to like.
This has changed things a bit. Groups no longer is a Multivalued Field in Artworks entity and becomes an entity with two relationships, one for Customers and one for Artworks. The relationship between Groups and Customers reveals the preferred art groups by customers. The relationship between groups and artworks shows which art groups a artwork is related. Now let's talk about the cardinalities of these relationships.
...a given piece may belong to more than one group. [...]
...and groups of art that the customer tends to like. [...]
Concerning Groups and Artworks, the word "may" says a lot to me. It says that something may or may not be effective. Still, it is uncertain whether an artwork can exist without at least one related group. Because of this, I see a 1 to N relationship from Artworks to Groups.
Conversely, the opposite process is not clear. I believe that there may be groups unrelated to artworks, perhaps because they are new groups created in a given time. So I see a relationship of 0 to N from Groups to Artworks.
Let's talk about Groups and Customers. It seems to me that a customer like at least one group of art. So I see a 1 to N relationship from Customers to Groups.On the opposite side, as already said, it would be possible to add new groups without automatically tying at least one customer to it. I think there may be new groups unrelated to customers. So guess what? We have a relationship of 0 to N from Customers to Groups.
So far we have identified another entity, a Functional Redundancy,
and two relationships with their respective cardinalities. Let's keep going...
and the artists ... that the customer tends to like.
There is a close connection here between two entities, Customers, and Artists. This relationship tells us what artists the customers like. If a customer must like at least one artist, then we have a 1 to N relationship from Customers to Artists. If a customer may or may not like an artist, then we have a relationship 0 to N.
If an artist has zero or more customers who appreciate it, then we have a relationship 0 to N from Artists to Customers. If an artist has at least one client who appreciates it's work, then we have a 1 to N relationship from Artists to Customers.
Lastly...
Galleries keep information about artists, [...] and style of art.
If multiple artists can share a single same art style, then we have a Functional Redundancy here. If several artists have various art styles, then we have a Multivalued Field.
After much talk, I came up with an ER-Diagram presented by your context and assumptions made by me:
NOTE: The green points highlights major assumptions.
Is this right? Is this the correct diagram? The correct answer would be (from me to you):
I do not know...
Without a concrete context, we can not finalize a diagram correctly. My tip is that you finish your context. Only then you will have a correct diagram.
Oh, one more thing. What would be this "money spent" attribute? If customers can buy artworks, it would represent a new relationship between Artworks and Customers. This relationship would represent the purchase of artworks from customers (called "ORDERS", for instance). If not so, skip this paragraph.
If I have forgotten something, please say so. If you have questions feel free to ask, I'm here to help you.

Expressing relational calculus & algebra queries in plain English re passengers, flights & trips in economy

I have this statement:
And this one:
How do I go about converting these to plain English?
Here is the extent of my understanding:
For the first one, I think it's selecting p_id where there exists f_no1, f_date and f_no2 from the Flight and Trip tables joined.
The second one is confusing; I know kind of what it's doing but I dont know how to convert it to plain English. It's natural joining the trip, flight and passenger tables, then it's selecting the rows from that resulting table where the class is business. From the rows where the class is business, it is then selecting only the rows where the final destination is Los Angeles, and then from those rows, it is selecting the passenger id and name. So I guess the English translation will be something along the lines of "Get the name and id of passengers going to Los Angeles in business class" but I'm not sure.

Relational Calculus
You're on the right track.
Free variable: p_id (determines your output structure)
Bounded variables: f_no1, f_no2, f_date
You can see that there are two lines that look very similar, but differ significantly. Each line is pairing information across two relationships with the intention to find values that satisfy the conditions.
Notice that the f_date and p_id variables are the same on both lines, whereas f_no differs. This indicates that there are two separate flights which occur on the same day with the same passenger on both. The first line indicates a journey from Rapanui to Papeete and the second line indicates a journey from Papeete to Auckland. Both of these journeys must also satisfy the requirement of being traveled via Economy class.
Bring this information together, this query is asking for the p_id's where that p_id travels from Rapnui to Auckland via Papeete on the same day, with both being in Economy class.
Relational Algebra
You pretty much have it there. The query selects p_id and p_name of all passengers who have flown to Los Angeles in Business class.

Data matching/ deduplication Sql server 2008 R2

What are the options for making a data cleansing process (deduplication/matching)
when dealing with MS SQL Server 2008 R2?
Or better yet how can I weight scores on a matching process on columns of a row?
The situation is the following: I have a persons table on my database and their associated addresses and documents in other database tables?
How can I make the best decision of match based on Name, Serial no of the document and address? As I understood SSIS fuzzy groping won't support this feature: weighted scoring.

I do not have much experience with SSIS at the moment - so this answer is focused on the de-duping/matching/scoring aspect of your question.
There are many ways to approach a Data Quality strategy such as this, all of which have Pro's and Cons and I think a lot of it comes down to your existing data management strategies - how clean and standardised is the data you are trying to dedupe?
Even 'simple' items like telephone numbers can be difficult to dedupe if you have not got this correct - for example all of these are different representations of the same number:
+1 (888) 707-8822
1-888-707-8822
18887078822
001 888 7078822
888-7078822
The more complex structures such as addresses get even more interesting: are 'flat 2' and 'apartment 2' the same thing or different?
You have two choices - make it your self or trust a third party
Make it yourself
Advantages
Lots of fun logical problems to work through
Will be able to tweak and improve at will 'forever' as your solution grows
Disadvantages
It will take a lot of time.
Each country you use will need looking at separately - there are no high quality 'global' rules that you can apply (but there of of course snippets that can be reused)
Third Party
Advantages
If de-duplication is not your specialty - let the experts do it
Ready to go and deliver value immediately
Disadvantages
Cost
Whether you go your own route or third party I suggest you start by creating a clear goal.
What are your inputs:
How 'clean' is your data?
How standardised is your data?
How do the records link together.
Are the address records just from one country or are they from several.
What are your workflows:
How often do you need to run this process?
Do you want to stop duplicates entering your system in the first place or just run periodic bulk runs?
What do you want from the project?
To what level (document, person, household, organisation - see below) do you want to identify duplicates
What do you want to do with those duplicates
Delete duplicates and keep one record
Merge duplicates to create one master record
This stage is sometimes refereed to as creating the 'Golden' record. Deciding which information to keep, and which information to disregard.
To go into a bit more detail about some of those choices, consider the following dummy addresses:
Are you trying to dedupe to household level:
Ann Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Ann and Bob Smith, 1 Main St, DupeVille, MA, 12345-6789
Person Level
Robert Smith, 1 main st, DupeVille, MA, 12345
Bob Smith, 1 main street, DupeVille, MA, 12345
become
Robert Smith, 1 Main St, DupeVille, MA, 12345-6789
or even by the ID's in your document database.
Once you have that plan, it may help you make up your mind about the best route to take. If you want to create it yourself, the links you have found certainly put you in the right mindset. If you want to go third party - there are a good range of suppliers out there. Just make sure you choose someone you can trust - they're going to be changing your data!
Google around for the various suppliers - Experian Data Quality are one of them (my company!) and depending upon where in the world you are, you can find your best contact details and more info here: http://www.qas.com/contact/office-locations.htm . We have tools that can integrate with SQL Server 2008 R2 which can score differing input types and then automatically dedupe these for you or return the clusters of potentially groups for your to look after yourself.
Take your plan, and clear idea of what you need from them and discuss it with them. Whoever you choose will be able to talk you through your plan, discuss your goals and tell you if they are the right people for the job.
Think I went on a bit there :-) but hopefully that points you in the right direction - Good luck!

If you do fuzzy grouping with multiple columns you will get _similarity information for every column you choose as input. With this similarity information you can calculate your own tresholds etc.

How could the following database schema be drawn using E/R diagrams?

How could the following database schema be drawn using E/R diagrams? (A sketch or final image would be helpful). I would also appreciate if you could guide me to a easy-to-understand tutorial on entity-relationships so I could learn how to draw them on paper first.
A CD has a title, a year of production and a CD type. (CD type could be anything: mini-CD, CD-R, CD-RW, DVD-R, DVD-RW...)
A CD usually has multiple songs on different tracks. Each song has a name, an artist and a track number. Entity set Song is considered to be weak and needs support from entity set CD.
A CD is produced by a producer which has a name and an address.
A CD may be supplied by multiple suppliers, each has a name and an address.
A customer may rent multiple CDs. Customer information such as Social Security Number (SSN), name, telephone needs to be recorded. The date and period of renting (in days) should also be recorded.
A customer may be a regular member and a VIP member. A VIP member has additional information such as the starting date of VIP status and percentage of discount.
Is this Entity diagram correct? This is so fracking confusing. I've built this diagram on just intuition rather a systematic approach they teach in a textbook. I still can't wrap my head around the many-to-one relation, weak entities, foreign keys.

There's a fair article on ERDs on Wikipedia.
When you're starting a new ERD - whether it's hand-drawn or computer-drawn - you should focus first on the entities (entity sets). Add the relationships in and then worry about fleshing out your non-key predicates. When you get some experience with ERDs you'll get to the point where you won't need much more work to achieve normalization. It will start to come naturally to you.
There are probably quite a few changes that you'll want to make to your diagram. Since this may be homework, I'll give you an alternative diagram to consider:
This model takes a more sophisticated view of your rules, for example:
Songs can appear many times on the same CD and on different CDs.
A song can be performed by multiple artists within a given track.
Producers can cooperate on a CD.
None of these are necessarily right for your model. It depends on your business rules.
Compare your model with this one and ask yourself what is different and why you might want to take one approach or the other.

take all the major concepts, draw a box for each
in the box put the name of the major concept, like SONG then an underline
under the major concept, list all the attributes like NAME
draw lines from one box to another where those concepts are linked (usually through an attribute) like line from CD to SONG

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008