Storing multi-language geodata in MySQL - mysql

My application needs to use geodata for displaying location names. I'm very familiar with large-scale complex geodata generally (e.g. Geonames.org) but not so much with the possible MySQL implementation.
I have a custom dataset of four layers, including lat/lon data for each:
- Continents (approx 10)
- Countries (approx 200)
- Regions/States (approx 100)
- Cities (approx 10K)
In relationship to all other tables, I'm properly referencing to four normalized tables of location names, allowing me to expand these separately from the rest of the data.
So far so good... in English!
However, I wish to add other languages to my application which means that some location names will also need translations (e.g. London > Londres > Londre etc). It won't be OTT, perhaps 6 languages and no more. UTF-8 will be needed.
I'll be using Symfony framework's cultures for handling interface translations, but I'm not sure how I should deal with location names, as they don't really belong in massive XML files. The solution I have in mind so far is to add a column to each of the location tables for "language" to allow the system to recognise what language the location name is in.
If anyone has experience of a cleaner solution or any good pointers, I would be grateful. Thanks.
EDIT:
After further digging, found a symfony-assisted solution to this. In case someone finds this question, here's the reference: http://www.symfony-project.org/book/1_0/13-I18n-and-L10n

I'm not familiar with what Symfony has to offer in specific functions in that department. But for a framework-independent approach, how about having one database column in the locality table holding the default name for quick lookup - depending on your preference, the english name of the locality (Kopenhagen) or the local name (København), and a 1:n translation table for the rest, linked to each locality:
locality ID | language (ISO 639-1) | UTF-8 translation
12345 | fin | Kööpenhamina
12345 | fra | Copenhague
12345 | eng | Kopenhagen
12345 | dan | København
?
That would leave your options open to add unlimited languages, and may be easier to maintain than having to add a column to each table whenever a new language comes up.
But of course, the multi-column approach is way easier to query programmatically, and no table relations are needed - if the number of languages is extremely probable to remain limited, I would probably tend towards that out of sheer laziness. :)

Related

Duplicate elimination of similar company names

I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":
+------------------+
| company |
+------------------+
| 1c |
| 1c company |
| 1c game studios |
| 1c wireless |
| 1c-avalon |
| 1c-softclub |
| 1c: maddox games |
| 1c:inoco |
| 1cc games |
+------------------+
I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?
In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.
What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.
This is a difficult problem. A magic search keyword that might help you is "normalization" - while sometimes it means very different things ("database normalization" is unrelated, for example), you are effectively trying to normalize your input here.
A simple solution is to use Levenshtein distance with token awareness. The Python library Fuzzy Wuzzy does this and this blog post introduces how it works with motivating examples. The basic idea is simple enough you should be able to implement it in Java without much difficulty.
At a high level, the idea is to split the input into tokens on whitespace and maybe punctuation, then sort the tokens and treat them as a set, then use the set intersection size - allowing for fuzzy matching - as a metric.
Some related links:
Are there any good libraries available for doing normalization of company names? - Open Data Stack Exchange
NEMO: Extraction and normalization of organization names from PubMed affiliation strings
Automatic gazetteer enrichment with user-geocoded data - For place names, this basically creates a list of "true" names and then uses fuzzy lookup.
Normalizing company names with SPARQL and DBpedia - bobdc.blog - Uses Wikipedia redirect information.

How to reference groups of records in relational databases?

Humans
| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
| 1 | Issac | Newton | M |
| 2 | Marie | Curie | F |
| 3 | Tim | Duncan | M |
Animals
| AmimalID | Species | NickName |
|----------+---------+----------|
| 4 | Tiger | Ronnie |
| 5 | Dog | Snoopy |
| 6 | Dog | Bear |
| 7 | Cat | Sleepy |
How do I reference a group of records in other tables?
For example:
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+---------|
| 8 | Rice | ??? |
What I want to store in EatenBy may be:
a single record in the Humans and Animals tables (e.g. Tim Ducan)
a group of records in a table (e.g. all dogs, all males, all females)
a whole table (e.g. all humans)
A simple solution is to use a concatenated string, which includes primary keys
from different tables and special strings such as 'Humans' and 'M'.
The application could parse the concatenated string.
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+--------------|
| 8 | Rice | Humans, 6, 7 |
Using a concatenated string is a bad idea from the perspective of
relational database design.
Another option is to add another table and use a foreign key.
Foods
| FoodID | FoodName |
|--------+----------|
| 8 | Rice |
EatenBy
| FoodID | EatenBy |
|--------+---------|
| 8 | Humans |
| 8 | 6 |
| 8 | 7 |
It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?
The example tables here are a bit contrived, but I do run into situations like
this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.
This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.
The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.
The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.
Response to Initial Question
You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.
The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.
If you want a Relational database, then you are going to have to:
think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).
Then classifying the data (no relation whatever to the OO class or classifier concept).
Then relating the classified data.
The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.
No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.
In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.
I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.
Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.
The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".
Therefore, I would ask you to:
forget about "table structures" or definitions
forget about what you want
forget about implementation options
forget ID columns, completely and totally
forget EatenBy
think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it
think about how that data is classified, and how it can be classified.
how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)
If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:
Things
Groups of Things
Relationships between Things and ThingGroups
The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.
Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.
Response to Update 1 re Hierarchy
Record IDs are Physical, Non-relational
If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.
Please read this Answer.
Hierarchies Exist in the Data
If you want a full discourse, please ask a new question. Here is a quick summary.
Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.
The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.
Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.
There are three types of Hierarchies that occur in data.
Hierarchy Formed in Sequence of Tables
This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.
The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.
The Hidders Response has a great example of Hierarchies:
a. that exist naturally in the data
b. that OO types are blind to [as Hidders evidently is]
c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.
Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.
The two hierarchies in those four tables are:
Domain::Animal::Harvest
Domain::Activity::Harvest
Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.
Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:
Subject::CategorySubject::ExaminationResult
Category::CategorySubject::ExaminationResult
Person::Registrant::Candidate::ExaminationResult
Note that last one is a progression of state of the business instrument, thus the Key does not compound.
Hierarchy of Rows within One Table
Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).
The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.
You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.
Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:
Message::Message[Message]::Message[::Message[Message]] ...
Hierarchy of Rows within One Table, Via an Associative Table
This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.
The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.
Here is the Bill of Materials hierarchy, modelled and implemented correctly.
Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.
Person::Progeny is also given.
The hierarchies are:
Part[Assembly]::Part[Component] ...
Part[Component]::Part[Assembly] ...
Person[Parent]::Person[Child] ...
Person[Child]::Person[Parent] ...
Ignorance Of Hierarchy
Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.
Adjacency List
The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.
Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.
And they are clueless as to how to traverse the tree, that one needs recursion.
Nested Sets
An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.
Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.
The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.
Ok, it wasn't so short.
Response to Update 2
Response to Update 2
Response to Update 3
Response to Update 3
Response to Update 4
Response to Update 4
For each category who eats the food, you should add one table. for example, if one food may be eaten by some specific gender, you would have:
Food_Gender(FoodID,GenderID)
for humans you would have:
Food_Human(FoodID,HumanID)
for animals species:
Food_AnimalSpc(FoodID,Species)
for an entire table:
Food_Table(FoodID,TableID)
and so on for other categories

Structuring a database to handle unknown name/value pairs

Here's the idea: I expect to be receiving thousands of queries, each containing a certain amount of name value pairs; these start off as associative arrays, so I have fairly good control over what can happen to the data. These NVPs vary dependent on the source. For example, if the source is "A", I could receive the array (in JSON for ease of explanation): {'Key1':'test1','key2':'test2'} but if the source was "B", I could receive {'DifferentKey1':'test1','DifferentKey2':'test2'} I'm selecting which keys I want to store in my database, so in this case I could only want to select DifferentKey1 from source B's array, and discard the rest.
My main issue here is that these arrays could technically be completely unrelated content wise. They have a very general association (they're both arrays containing stats) but they're very different (in that the sources are different, ie. different games/sports).
I was thinking SQL: storing a table filled with games and their respective ids would be a good way of linking general NVP strings. For example:
Games table:
| id | name |
-------------
1 golf
2 soccer
NVP table
| id | game_id | nvp
1 1 team1score=87;team2score=94;team3score=73;
2 2 team1score=2;team2score=1;extratime=200;numyellowcards=4;
Hope that's clear enough. Do you see what I mean though? If there's an indeterminant amount of data that I may use, how can I structure a table? Thanks.
Edit: I guess I should note, obviously this set up WOULD work, however is it the best performance wise? Maybe not? I'm not sure, let's see what you guys can come up with!
SQL databases are great for highly relational data - but in a case like this where the data is not relational and there is no fixed schema, you might be better off using a NoSQL solution. There are a lot and I haven't used them enough to be sure what would work best for you. If your data can fit in RAM, then redis is great.
The common way of storing name/value pairs in a relational database is known as "Entity/Attribute/Value". You'll find a lot of discussion on Stack Overflow.
It all depends on what your application wants to do with the data. Storing is it easy - querying is much harder.
If you're building a sports application, you are likely to have domain concepts you want to support - for football, show a league position based on games played. For golf, show the number of birdies or eagles. You will probably want to show all the games a particular team/player has played in a season.
Some things are easy to build in a relational database, and have amazing performance over huge data sets. Find the highest-scoring game ever, find the last game in the 1998 season, find all the games featuring player x - all a great fit, as long as you can build a schema that represents those domain concepts.
From what you write, it does sound like you will have a fixed number of sports; the data coming in to your system sounds like it's not particularly structured, but you should be able to structure that to a domain model. If that's true, I recommend building a relational schema that reflects the domain logic of each sport.
If that's not true - if you can't reason about the domain in advance - the relational model is a bad fit, and NoSQL is probably better. But you will run into the same problem - extracting meaning from name/value pairs is going to be hard!

What is the best normalization for street address?

Today I have a table containing:
Table a
--------
name
description
street1
street2
zipcode
city
fk_countryID
I'm having a discussion what is the best way to normalize this in terms of quickest search. E.g. find all rows filtered by city or zipcode. Suggested new structure is this:
Table A
--------
name
description
fk_streetID
streetNumber
zipcode
fk_countryID
Table Street
--------
id
street1
street2
fk_cityID
Table City
----------
id
name
Table Country
-------------
id
name
The dicussion is about having only one field for street name instead of two.
My argument is that having two feilds is considered normal for supporting international addresses.
The pro argument is that it will go on the cost of performance on search and possible duplication.
I'm wondering what is the best way to go here.
UPDATE
I'm aiming at having 15.000 brands associated with 50.000 stores, where 1.000 users will do multiple searches each day by web and iPhone. In addition I will be having 3. parties fetching data from the DB for their sites.
The site is not launched yet, so we have no idea of the workload. And we'll only have around 1000 brands assocaited with around 4000 stores when we start.
My standard advice (from years of data warehouse /BI experience) here is:
always store the lowest level of broken out detail, i.e. the multiple fields option.
In addition to that, depending on your needs you can add indexes or even a compound field that is the other two field concatenated - though make sure to maintain with a trigger and not manually or you will have data syncronization and quality problems.
Part of the correct answer for you will always depend on your actual use. Can you ever anticipate needing the address in a standard (2-line) format for mailing... or exchange with other entities? Or is this a really pure 'read-only' database that is just set up for inquiries and not used for more standard address needs such as mailings.
A the end of the day if you have issues with query performance, you can add additional structures such as compound fields, indexes and even other tables with the same data in a different form. Then there are also options for caching at the server level if performance is slow. If building a complex or traffic intensive site, chances are you will end up with a product to help anyway, for example in the Ruby programming world people use thinking sphinx If query performance is still an issue and your data is growing you may ultimately need to consider non-sql solutions like MongoDB.
One final principle that I also adhere to: think about people updating data if that will occur in this system. When people input data initially and then subsequently go to edit that information, they expect the information to be "the same" so any manipulation done internally that actually changes the form or content of the users input will become a major headache when trying to allow them to do a simple edit. I have seen insanely complicated algorithms for encoding and decoding data in this fashion and they frequently have issues.
I think the topmost example is the way to go, maybe with a third free-form field:
name
description
street1
street2
street3
zipcode
city
fk_countryID
the only thing you can normalize half-way sanely for international addresses is zip code (needs to be a free-form field, though) and city. Street addresses vary way too much.
Note that high normalisation means more joins, so it won't yield to faster searches in every case.
As others have mentioned, address normalization (or "standardization") is most effective when the data is together in one table but the individual pieces are in separate columns (like your first example). I work in the address verification field (at SmartyStreets), and you'll find that standardizing addresses is a really complex task. There's more documentation on this task here: https://www.smartystreets.com/Features/Standardization/
With the volume of requests you'll be handling, I highly suggest you make sure the addresses are correct before you deploy. Process your list of addresses and remove duplicates, standardize formats, etc. A CASS-Certified vendor (such as SmartyStreets, though there are others) will provide such a service.

Which of these MySQL database designs (attached) is best for mostly read high performance?

I am a data base admin and developer in MySQL. I have been a couple of years working with MySQL. I recently adquire and study O'Reilly High Performance MySQL 2nd Edition to improve my skills on MySQL advanced features, high performance and scalability, because I have often been frustated by the lack of advance knowledge of MySQL I had (and in a big part, I still have).
Currently, I am working on a ambicious web project. In this project, we will have quite content and users from the begining. I am the designer of the data base and this data base must be very fast (some inserts but mostly and more important READS).
I want here to discuss about these requirements:
There will be several kind of items
The items have some fields and relations in common
The items also have some fields and relations special that make them differents each other
Those items will have to be listed all together ordered or filtered by common fields or relations
The items will have to be also listed only by type (for examble item_specialA)
I have some basic design doubts, and I would like you to help me decide and learn which design aproach would be better for a high performance MySQL data base.
Classical aproach
The following diagram shows the clasical aproach which is the first you may think about with the mind thinking in database: Database diagram
Centralized aproach
But maybe we can improve it with some or pseudo object oriented paradigm centralicing the common items and the relations on one common item table. It would also be useful for listing all kind of items: Database diagram
Advantages and disadvantages of each one?
Which aproach would you choose or which changes would you apply seeing the requirements seen before?
Thanks all in advance!!
What you have are two distinct data mapping strategies. That you called "classical" is "one table per concrete class" in other sources, and that you called "centralized" is "one table per class" (Mapping Objects to Relational Databases: O/R Mapping In Detail). They both have their advantages and disadvantages (follow the link above). The queries in the first strategy will be faster (you will need to join only 2 tables vs 3 in the second strategy).
I think that you should explore classic supertype/subtype pattern. Here are some examples from the SO.
If you're looking mostly for speed, consider selective use of MyISAM tables, use a centralized "object" table, and just one additional table with correct indexes on this form:
object_type | object_id | property_name | property_value
user | 1 | photos | true
city | 2 | photos | true
user | 5 | single | true
city | 2 | metro | true
city | 3 | population | 135000
and so on... lookups on primary keys or indexed keys (object_type, object_id, property_name) for example will be blazing fast. Also, you reduce the need to end with 457 tables as new properties appear.
It isn't exactly a well-designed nor perfectly-normalized database and, if you are looking for a long-term big site, you should consider caching, or at least using a denormalized paradigm, denormalized mysql tables like this one, redis, or maybe MongoDB.