Duplicate elimination of similar company names - duplicates

I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":
+------------------+
| company |
+------------------+
| 1c |
| 1c company |
| 1c game studios |
| 1c wireless |
| 1c-avalon |
| 1c-softclub |
| 1c: maddox games |
| 1c:inoco |
| 1cc games |
+------------------+
I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?
In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.
What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.

This is a difficult problem. A magic search keyword that might help you is "normalization" - while sometimes it means very different things ("database normalization" is unrelated, for example), you are effectively trying to normalize your input here.
A simple solution is to use Levenshtein distance with token awareness. The Python library Fuzzy Wuzzy does this and this blog post introduces how it works with motivating examples. The basic idea is simple enough you should be able to implement it in Java without much difficulty.
At a high level, the idea is to split the input into tokens on whitespace and maybe punctuation, then sort the tokens and treat them as a set, then use the set intersection size - allowing for fuzzy matching - as a metric.
Some related links:
Are there any good libraries available for doing normalization of company names? - Open Data Stack Exchange
NEMO: Extraction and normalization of organization names from PubMed affiliation strings
Automatic gazetteer enrichment with user-geocoded data - For place names, this basically creates a list of "true" names and then uses fuzzy lookup.
Normalizing company names with SPARQL and DBpedia - bobdc.blog - Uses Wikipedia redirect information.

Related

Most efficient implementation of a 1-to-n map in MySQL [duplicate]

This question already has answers here:
MySQL storing undirected graph edges efficiently
(2 answers)
Closed 6 years ago.
I have a database of information organized by US zipcode. I am building algorithms that crawls along adjacent zipcodes to determine the size of a 'city' based on density, job characteristics, or whatever. I use the location and area of any zip code to estimate which other zipcodes are adjacent to it. I am becoming aware that this algorithm is eating up most of my processing time when I run test of my program.
So what I want to do is have the map (as in a data structure map) of adjacent zipcodes in a table in my database.
My current implementation is that I have a table with two fields, source and target. Each time my algorithm determines two zipcodes are adjacent, the two codes are inserted both way into the table, as so:
+-----------+------------+
| source | target |
+-----------+------------+
| 02139 | 02138 |
| 02138 | 02139 |
+-----------+------------+
That way I can search for all adjacent zip codes with
SELECT target FROM adjacent WHERE source = '02139';
and get all the zip codes that are adjacent to '02139'.
Now strictly speaking, my implementation is just fine. For a set of less than 50,000 total zip codes, doing it the way I did doesn't really impose any computational penalties. However, not being indexed, and having each relationship inserted twice seems non-optimal, and since I'm just doing this for funzies and learning, I should put in the effort to optimize. So I'm trying to find out how to more efficiently simulate a mapping using a mysql table.
So the question is: what is the most efficient way to represent 1-to-n mappings using MySQL?
In your application, the concept of adjacency seems to be bidirectional (aka symmetric). That is,
A adj B if and only if B adj A
So you can consider "canonicalizing" the relationship and then always store the zip with a smaller numerical value in the first column and the one with the larger numerical value in the second column. That is, using your example, you now only have one row:
+-----------+------------+
| zipLower | zipHigher |
+-----------+------------+
| 02138 | 02139 |
+-----------+------------+
And then when you need to find all the neighboring zip of, say, 02139, your
query may look like this (assuming the new table is called adjHigher):
SELECT zipHigher as zip
FROM adjHigher
WHERE zipLower = '02139'
union
SELECT zipLower as zip
FROM adjHigher
WHERE zipHigher = '02139'
Pros and cons
Is this really a more optimal design? It depends. This design
uses half of the storage space, and insertion into the table may
be more efficient (only one row, not two rows, per adjacent relationship
is needed to be insert). However, you can also see that the lookup
query becomes more complicated. If you have to JOIN this
table with other tables, this design may make your JOINs more complicated.
I guess the intention of this discussion is to explore different
design options before committing to one. So here it is.

Hierarchical Query Advantages

I have some huge database tables filled with scientific names, in a parent-child relationship, like this...
TAXON | PARENT
Mammalia | Chordata
Carnivora | Mammalia
Canidae | Carnivora
Canis | Canidae
Canis-lupus | Canis
I installed PostgreSQL and started working on a hierarchical query, but it's far more complex than I thought. So I'm thinking of sticking with MySQL and going back to my original scheme, which looks like this:
TAXON | PARENT | FAMILY | ORDER
Mammalia | Chordata | (NULL) | (NULL)
Carnivora | Mammalia | (NULL) | Carnivora
Canidae | Carnivora | Canidae | Carnivora
Canis | Canidae | Canidae | Carnivora
Canis-lupus | Canis | Canidae | Carnivora
It looks amateurish, but I was surprised to discover that the Catalogue of Life apparently uses the same scheme, with more columns and over a million rows.
With this scheme, I can count children and grandchildren by simply counting the number of species that match Table.Family > Canidae, for example. And I can use a series of "stairstep" queries to figure out the names of the great grandparents, etc.
So I wondered what the benefits of hierarchical queries are. They're more elegant, and you can presumably do everything with just one or two queries, rather than a series of queries. I also assume they're faster, though my original query, with the two extra fields, is fast enough.
Do hierarchical queries have some additional significant advantage that would justify me hiring someone to set one up, or is it primarily just a matter of speed?
A recursive / hierarchical query is often actually slower. It varies - there are many more rows, but on the other hand each row is much smaller.
The main advantage is flexibility, rather than performance. In your table you have a set number of columns... but what if there was any number of possible steps between ultimate parent (root) and ultimate child (leaf)? Or branches that join as well as open, so that one object has two parents? That's when hierarchical queries become more useful.
If by hierarchical queries, you mean Postgresql Common Table Expressions; the answer is that they are a wonderfull feature that would allow you to write more readable queries and in some (but not all) cases lead to improved performance.
Is it really worth hiring someone to install postgresql for you? Maybe, maybe not. It's hard to say without benchmarks.
What you really ought to try: is Modified Pre order Tree Traversal now that sounds complicated, but it isn't
We’ll start by laying out our tree in a horizontal way. Start at the root node (‘Food’), and write a 1 to its left. Follow the tree to ‘Fruit’ and write a 2 next to it. In this way, you walk (traverse) along the edges of the tree while writing a number on the left and right side of each node. The last number is written at the right side of the ‘Food’ node. In this image, you can see the whole numbered tree, and a few arrows to indicate the numbering order.
Here is another excellent article on it. http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
This approach can be used in both postgreql and mysql and the existing data can be converted without too much difficulty.

How to reference groups of records in relational databases?

Humans
| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
| 1 | Issac | Newton | M |
| 2 | Marie | Curie | F |
| 3 | Tim | Duncan | M |
Animals
| AmimalID | Species | NickName |
|----------+---------+----------|
| 4 | Tiger | Ronnie |
| 5 | Dog | Snoopy |
| 6 | Dog | Bear |
| 7 | Cat | Sleepy |
How do I reference a group of records in other tables?
For example:
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+---------|
| 8 | Rice | ??? |
What I want to store in EatenBy may be:
a single record in the Humans and Animals tables (e.g. Tim Ducan)
a group of records in a table (e.g. all dogs, all males, all females)
a whole table (e.g. all humans)
A simple solution is to use a concatenated string, which includes primary keys
from different tables and special strings such as 'Humans' and 'M'.
The application could parse the concatenated string.
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+--------------|
| 8 | Rice | Humans, 6, 7 |
Using a concatenated string is a bad idea from the perspective of
relational database design.
Another option is to add another table and use a foreign key.
Foods
| FoodID | FoodName |
|--------+----------|
| 8 | Rice |
EatenBy
| FoodID | EatenBy |
|--------+---------|
| 8 | Humans |
| 8 | 6 |
| 8 | 7 |
It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?
The example tables here are a bit contrived, but I do run into situations like
this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.
This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.
The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.
The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.
Response to Initial Question
You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.
The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.
If you want a Relational database, then you are going to have to:
think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).
Then classifying the data (no relation whatever to the OO class or classifier concept).
Then relating the classified data.
The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.
No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.
In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.
I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.
Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.
The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".
Therefore, I would ask you to:
forget about "table structures" or definitions
forget about what you want
forget about implementation options
forget ID columns, completely and totally
forget EatenBy
think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it
think about how that data is classified, and how it can be classified.
how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)
If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:
Things
Groups of Things
Relationships between Things and ThingGroups
The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.
Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.
Response to Update 1 re Hierarchy
Record IDs are Physical, Non-relational
If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.
Please read this Answer.
Hierarchies Exist in the Data
If you want a full discourse, please ask a new question. Here is a quick summary.
Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.
The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.
Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.
There are three types of Hierarchies that occur in data.
Hierarchy Formed in Sequence of Tables
This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.
The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.
The Hidders Response has a great example of Hierarchies:
a. that exist naturally in the data
b. that OO types are blind to [as Hidders evidently is]
c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.
Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.
The two hierarchies in those four tables are:
Domain::Animal::Harvest
Domain::Activity::Harvest
Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.
Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:
Subject::CategorySubject::ExaminationResult
Category::CategorySubject::ExaminationResult
Person::Registrant::Candidate::ExaminationResult
Note that last one is a progression of state of the business instrument, thus the Key does not compound.
Hierarchy of Rows within One Table
Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).
The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.
You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.
Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:
Message::Message[Message]::Message[::Message[Message]] ...
Hierarchy of Rows within One Table, Via an Associative Table
This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.
The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.
Here is the Bill of Materials hierarchy, modelled and implemented correctly.
Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.
Person::Progeny is also given.
The hierarchies are:
Part[Assembly]::Part[Component] ...
Part[Component]::Part[Assembly] ...
Person[Parent]::Person[Child] ...
Person[Child]::Person[Parent] ...
Ignorance Of Hierarchy
Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.
Adjacency List
The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.
Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.
And they are clueless as to how to traverse the tree, that one needs recursion.
Nested Sets
An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.
Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.
The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.
Ok, it wasn't so short.
Response to Update 2
Response to Update 2
Response to Update 3
Response to Update 3
Response to Update 4
Response to Update 4
For each category who eats the food, you should add one table. for example, if one food may be eaten by some specific gender, you would have:
Food_Gender(FoodID,GenderID)
for humans you would have:
Food_Human(FoodID,HumanID)
for animals species:
Food_AnimalSpc(FoodID,Species)
for an entire table:
Food_Table(FoodID,TableID)
and so on for other categories

Incremental MySQL database design where future needs are unknown

I am using MySQL, InnoDB, and running it on Ubuntu 13.04.
My general question is: If I don't know how my database is going to evolve or what my needs will eventually be, should I not worry about redundancy and relationships now?
Here is my situation:
I'm currently building a baseball database from scratch, but I am unsure how I should proceed. Right now, I'm approaching the design in a modular fashion. For example, I am currently writing a python script to parse the XML feed of a sports betting website which tells me the money line and the over/under. Since I need to start recording the information, I am wondering if I should just go ahead and populate the tables and worry about keys and such later.
So for example, my python sports odds scraping script would populate three tables (Game,Money Line, Over/Under) like so:
DateTime = Date and time of observation
Game
+-----------+-----------+--------------+
| Home Team | Away Team | Date of Game |
+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+-----------+----------+
| Home Team | Away Team | Date of Game | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+----------+----------+
| Home Team | Away Team | Date of Game | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+----------+
I feel like I should be doing something with the redundant (home team, away team, date of game) columns of information, but I don't really know how my database is going to expand, and in what ways I will be linking everything together. I'm basically building a database so I can answer complicated questions such as:
How does weather in Detroit affect the betting lines when Justin Verlander is pitching against teams who have averaged 5 or fewer runs per game for 20 games prior to the appearance against Verlander? (As you can see, complex questions create complex relationships and queries.)
So is it alright if I go ahead and start collecting data as shown above, or is this going to create a big headache for me down the road?
The topic of future proofing a database is a large one. In general, the more successful a database is, the more likely it is to be subjected to mission creep, and therefore to have new requirements.
One very basic question is this: who will be providing the new requirements? From the way you wrote the question, it sounds like you have built the database to fit your own requirements, and you will also be inventing or discovering the new requirements down the road. If this is not true, then you need to study the evolving pattern of your client(s) needs, so as to at least guess where mission creep is likely to lead you.
Normalization is part of the answer, and this aspect has been dealt with in a prior answer. In general, a partially denormalized database is less future proofed than a fully normalized database. A denormalized database has been adapted to present needs, and the more adapted something is, the less adaptable it is. But normalization is far from the whole answer. There are other aspects of future proofing as well.
Here's what I would do. Learn the difference between analysis and design, especially with regard to databases. Learn how to use ER modeling to capture the present requirements WITHOUT including the present design. Warning: not all experts in ER modeling use it to express requirements analysis. In particular, you omit foreign keys from an analysis model because foreign keys are a feature of the solution, not a feature of the problem.
In parallel, maintain a relational model that conforms to the requirements of your ER model and also conforms to rules of normalization, and other rules of simple sound design.
When a change comes along, first see if your ER model needs to be updated. Sometimes the answer is no. If the answer is yes, first update your ER model, then update your relational model, then update your database definitions.
This is a lot of work. But it can save you a lot of work, if the new requirements are truly crucial.
Try normalizing your data (so that you do not have redundant info) like:
Game
+---+-----------+-----------+--------------+
|ID | Home Team | Away Team | Date of Game |
+---+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+
| Game_ID | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+
| Game_ID | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+
You can read more on NORMALIZATION here

Storing multi-language geodata in MySQL

My application needs to use geodata for displaying location names. I'm very familiar with large-scale complex geodata generally (e.g. Geonames.org) but not so much with the possible MySQL implementation.
I have a custom dataset of four layers, including lat/lon data for each:
- Continents (approx 10)
- Countries (approx 200)
- Regions/States (approx 100)
- Cities (approx 10K)
In relationship to all other tables, I'm properly referencing to four normalized tables of location names, allowing me to expand these separately from the rest of the data.
So far so good... in English!
However, I wish to add other languages to my application which means that some location names will also need translations (e.g. London > Londres > Londre etc). It won't be OTT, perhaps 6 languages and no more. UTF-8 will be needed.
I'll be using Symfony framework's cultures for handling interface translations, but I'm not sure how I should deal with location names, as they don't really belong in massive XML files. The solution I have in mind so far is to add a column to each of the location tables for "language" to allow the system to recognise what language the location name is in.
If anyone has experience of a cleaner solution or any good pointers, I would be grateful. Thanks.
EDIT:
After further digging, found a symfony-assisted solution to this. In case someone finds this question, here's the reference: http://www.symfony-project.org/book/1_0/13-I18n-and-L10n
I'm not familiar with what Symfony has to offer in specific functions in that department. But for a framework-independent approach, how about having one database column in the locality table holding the default name for quick lookup - depending on your preference, the english name of the locality (Kopenhagen) or the local name (København), and a 1:n translation table for the rest, linked to each locality:
locality ID | language (ISO 639-1) | UTF-8 translation
12345 | fin | Kööpenhamina
12345 | fra | Copenhague
12345 | eng | Kopenhagen
12345 | dan | København
?
That would leave your options open to add unlimited languages, and may be easier to maintain than having to add a column to each table whenever a new language comes up.
But of course, the multi-column approach is way easier to query programmatically, and no table relations are needed - if the number of languages is extremely probable to remain limited, I would probably tend towards that out of sheer laziness. :)