Incremental MySQL database design where future needs are unknown - mysql

I am using MySQL, InnoDB, and running it on Ubuntu 13.04.
My general question is: If I don't know how my database is going to evolve or what my needs will eventually be, should I not worry about redundancy and relationships now?
Here is my situation:
I'm currently building a baseball database from scratch, but I am unsure how I should proceed. Right now, I'm approaching the design in a modular fashion. For example, I am currently writing a python script to parse the XML feed of a sports betting website which tells me the money line and the over/under. Since I need to start recording the information, I am wondering if I should just go ahead and populate the tables and worry about keys and such later.
So for example, my python sports odds scraping script would populate three tables (Game,Money Line, Over/Under) like so:
DateTime = Date and time of observation
Game
+-----------+-----------+--------------+
| Home Team | Away Team | Date of Game |
+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+-----------+----------+
| Home Team | Away Team | Date of Game | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+----------+----------+
| Home Team | Away Team | Date of Game | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+----------+
I feel like I should be doing something with the redundant (home team, away team, date of game) columns of information, but I don't really know how my database is going to expand, and in what ways I will be linking everything together. I'm basically building a database so I can answer complicated questions such as:
How does weather in Detroit affect the betting lines when Justin Verlander is pitching against teams who have averaged 5 or fewer runs per game for 20 games prior to the appearance against Verlander? (As you can see, complex questions create complex relationships and queries.)
So is it alright if I go ahead and start collecting data as shown above, or is this going to create a big headache for me down the road?

The topic of future proofing a database is a large one. In general, the more successful a database is, the more likely it is to be subjected to mission creep, and therefore to have new requirements.
One very basic question is this: who will be providing the new requirements? From the way you wrote the question, it sounds like you have built the database to fit your own requirements, and you will also be inventing or discovering the new requirements down the road. If this is not true, then you need to study the evolving pattern of your client(s) needs, so as to at least guess where mission creep is likely to lead you.
Normalization is part of the answer, and this aspect has been dealt with in a prior answer. In general, a partially denormalized database is less future proofed than a fully normalized database. A denormalized database has been adapted to present needs, and the more adapted something is, the less adaptable it is. But normalization is far from the whole answer. There are other aspects of future proofing as well.
Here's what I would do. Learn the difference between analysis and design, especially with regard to databases. Learn how to use ER modeling to capture the present requirements WITHOUT including the present design. Warning: not all experts in ER modeling use it to express requirements analysis. In particular, you omit foreign keys from an analysis model because foreign keys are a feature of the solution, not a feature of the problem.
In parallel, maintain a relational model that conforms to the requirements of your ER model and also conforms to rules of normalization, and other rules of simple sound design.
When a change comes along, first see if your ER model needs to be updated. Sometimes the answer is no. If the answer is yes, first update your ER model, then update your relational model, then update your database definitions.
This is a lot of work. But it can save you a lot of work, if the new requirements are truly crucial.

Try normalizing your data (so that you do not have redundant info) like:
Game
+---+-----------+-----------+--------------+
|ID | Home Team | Away Team | Date of Game |
+---+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+
| Game_ID | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+
| Game_ID | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+
You can read more on NORMALIZATION here

Related

Table Schema for storing various dates of availability

First off, I am new to database design so apologies for use of incorrect terminology.
For a university assignment I have been tasked with creating the database schema for a website. Part of the website a user selects the availability of hosting an event but the event can be at any time so for example from 12/12/2015 - 15/12/2015 and 16/01/2016 - 22/12/2016 and also singular dates such as 05/01/2016. They also have the option of having the event all the time
So I am unsure of how to store all these kind of variables in a database table without using a lot of rows. The example below is a basic one that would store each date of availability but that is a lot of records and that is just for one event. Is there a better method of storing these values or would this be stored elsewhere, outside of a database.
calendar_id | event_id | available_date
---------------------------------------
492 | 602 | 12/12/2015
493 | 602 | 13/12/2015
494 | 602 | 14/12/2015
495 | 602 | 15/12/2015
496 | 602 | 05/01/2016
497 | 602 | 16/01/2016
498 | 602 | 17/01/2016
etc...
This definitely requires a database. I don't think you should be concerned about the number of records in a database... that is what databases do best. However, from a university perspective there is something called Normalization. In simple terms normalization is about minimizing data repetition.
Steps to design a schema
Identify entities
As the first step of designing a database schema I tend to identify all the entities in the system. Looking at your example I see (1) Events and (2) EventTimes (event occurrences/bookings) with a one-to-many relation since one Event might have multiple EventTimes. I would suggest that you keep these two entities separate in the database. That way an Event can be extended with more attributes/fields without affecting its EventTimes. Most importantly you can add many EventTimes on an Event without repeating all the event's fields (which would be the case if you use a single table).
Identify attributes
The second step for me is to identify all the attributes/fields of each entity. Additionally, I always suggest an auto-increment id in every table to uniquely identify a row.
Identify constraints
This might be a bit more advanced, but most of the times you have constraints on what is acceptable data values or what uniquely identifies a row in real-life. For example, the Event.id might identify the row in the database but you might also require that each event has a unique title
Example schema
This has to be adjusted to the assignment or, in a real application, to the system's requirements
Events table
id int auto-increment
title varchar unique: Event's title
always_on boolean/enum: If 'Y' then the event is on all the time
... more fields here ... (category, tags, notes, description, venue,...)
EventTimes
id int auto-increment
event_id foreign key pointing to Event.id
start_datetime datetime or int (int if you go for a unix timestamp)
end_datetime : as above
... more fields again... (recursion below is a hard one! avoid it if you can)
recursion enum/int : Is the event repeated? Weekly, Montly, etc
recursion_interval int: Every x days, months, years, etc
A note on date/times, as a rule of thumb whenever you deal with dates and times in a database, always store them in UTC format. You probably don't want/need to mess with timezones in an assignment... but keep it in mind.
Possible extensions to the example
Designing a complete system one might add the tables: Venues, Organizers, Locations, etc... this can go on forever! I do try to think of future requirements when designing but do not over do it cause you end up with a lot of fields that you don't use and increased complexity.
Conclusion
Normalization is something you have to keep in mind when designing a database, however you can see that the more you normalize your schema the more complex will become your selects and joins. There is a trade-off there between data efficiency and query efficiency... That is the reason I used "from a university perspective" earlier. In a real-life system with complex data structures (for example graphs!) you might require to under-normalize the tables to make your queries more efficient/faster or easier. There are other approaches to deal with such issues (functions in the database, temporary/staging tables, views, etc) but always depends on the specific case.
Another really useful thing to keep in mind is: Requirements always change! Design your databases taking as granted that fields will be added/removed, more tables will be added, new constraints will appear, etc and thus make it as extensible and easy to modify as possible... (now we are scratching a bit "Agile" methodologies)
I hope this helps and does not confuse things more. I am not a DBA per-se but I have designed a few schemes. All the above come from experience rather than a book and they may not be 100% accurate. Definitely not the only way to design a database... its kind of an art this job :)

What is the most optimum why to layout stock tables in mysql?

I am building a less than traditional stock system to power a browser/mobile game. Basic principal is a building has stock of certain resources. These buildings have an hourly production that decreases imports and produces exports in each building. The productions are based on a structure as the type of building and that buildings level and capacity.
My dilemma is how to layout out these stock tables in a scalable way. I am able to build tables so that each column is a resources. Example:
building_id | structure_id | energy | food | water
--------------------------------------------------
1 | 1 | 459 | 19 | 0
The benefit of this method is that I can write a few handy views and events and power this logic completely from mysql. I can fire one big update statement every hour to transaction productions.
The downfall to this method is that I have to write each resource as a column in my tables. This will be present on my tables in the database as well. I am projecting only have 150 or so resources.
The other option I have been playing with is building this like a basic inventory system. So, having a stock table that looks like this:
stock_id | building_id | resource_id | qty
-------------------------------------------
1 | 1 | 3 | 19
4 | 1 | 2 | 0
5 | 1 | 1 | 459
The benefit to this method is scalabity in to code to allow easy entry of new resources to enhance game play.
The downfall to this method is that I will have to do multiple UPDATE and SELECT statements to do one buildings production. As well as for each building. I plan to have a server limit of 250k buildings. This can become taxing.
All in all, I am looking for the most optimum way of doing this. I will have a finite set of resources and I have the ability to use query building code to create upgrade classes to handle adding a resource. But this also becomes a large set of code to just build the database.
Anyone have any thoughts on this?
Edit:
I am adding how the production sequence works for clarity.
the building has to check what it needs to import from stock and how much space it will free up in capacity.
the building has to check what it needs to export into stock and how much space it will take up in capacity.
building imports and exports are from the structures table and are multiplied by the buildings level.
If we do not exceed capacity and we have all needed resources, the build will transform the stock.
This all, right now runs correctly from one single UPDATE statement on all buildings and does it quite quick(not tested on sets larger than 100 yet). But this is based on the design with each resource as a column. I can achieve the same structure i do now with a proper inventory system style tables but I would need 150 left joins (there are 150 resources).
Ditch the 150 resource columns notion. Force joins to behave with index hints after a analyze table xxxx call.
Verify plan with explain command. Make calls thru stored procs.
I realize this is a game you are constructing. I did large map game play MMOG with such structures items states. The data layer was highly optimized else it woulda barfed the User experience. Lot of memcache.
data is only important as needed. you do not approach a building and fetch every attribute about it. why is that?
1) not needed now. who cares that the antenna is blown. it is irrelevant. you are 90 feet from water, how would u use it anyway
2) slow
3) becomes stale
that is all pull technology. client manually pulls it
as for push from server (we had benefit of open socket)
these are critical and need to be near real-time <80ms
1) player positions and how equipped
2) base status. this is important. what is where and state in base. these is constantly grabbed by users from mini-maps
3) your player, stats in particular, partly to prevent hacks
these push 90% of the time resided memcached in the structure most friendly to the client side. cannot seem to get anywhere near this performance
push: stuff not in memcache but is happening right in front of the player's face. or behind it. like getting shot in the head.
naturally the player isn't pulling that. it independently occurred to walking, zooming.
Obviously of a row with all info without joins is nice. Wasn't suitable for us

How to reference groups of records in relational databases?

Humans
| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
| 1 | Issac | Newton | M |
| 2 | Marie | Curie | F |
| 3 | Tim | Duncan | M |
Animals
| AmimalID | Species | NickName |
|----------+---------+----------|
| 4 | Tiger | Ronnie |
| 5 | Dog | Snoopy |
| 6 | Dog | Bear |
| 7 | Cat | Sleepy |
How do I reference a group of records in other tables?
For example:
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+---------|
| 8 | Rice | ??? |
What I want to store in EatenBy may be:
a single record in the Humans and Animals tables (e.g. Tim Ducan)
a group of records in a table (e.g. all dogs, all males, all females)
a whole table (e.g. all humans)
A simple solution is to use a concatenated string, which includes primary keys
from different tables and special strings such as 'Humans' and 'M'.
The application could parse the concatenated string.
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+--------------|
| 8 | Rice | Humans, 6, 7 |
Using a concatenated string is a bad idea from the perspective of
relational database design.
Another option is to add another table and use a foreign key.
Foods
| FoodID | FoodName |
|--------+----------|
| 8 | Rice |
EatenBy
| FoodID | EatenBy |
|--------+---------|
| 8 | Humans |
| 8 | 6 |
| 8 | 7 |
It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?
The example tables here are a bit contrived, but I do run into situations like
this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.
This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.
The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.
The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.
Response to Initial Question
You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.
The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.
If you want a Relational database, then you are going to have to:
think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).
Then classifying the data (no relation whatever to the OO class or classifier concept).
Then relating the classified data.
The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.
No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.
In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.
I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.
Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.
The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".
Therefore, I would ask you to:
forget about "table structures" or definitions
forget about what you want
forget about implementation options
forget ID columns, completely and totally
forget EatenBy
think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it
think about how that data is classified, and how it can be classified.
how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)
If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:
Things
Groups of Things
Relationships between Things and ThingGroups
The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.
Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.
Response to Update 1 re Hierarchy
Record IDs are Physical, Non-relational
If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.
Please read this Answer.
Hierarchies Exist in the Data
If you want a full discourse, please ask a new question. Here is a quick summary.
Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.
The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.
Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.
There are three types of Hierarchies that occur in data.
Hierarchy Formed in Sequence of Tables
This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.
The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.
The Hidders Response has a great example of Hierarchies:
a. that exist naturally in the data
b. that OO types are blind to [as Hidders evidently is]
c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.
Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.
The two hierarchies in those four tables are:
Domain::Animal::Harvest
Domain::Activity::Harvest
Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.
Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:
Subject::CategorySubject::ExaminationResult
Category::CategorySubject::ExaminationResult
Person::Registrant::Candidate::ExaminationResult
Note that last one is a progression of state of the business instrument, thus the Key does not compound.
Hierarchy of Rows within One Table
Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).
The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.
You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.
Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:
Message::Message[Message]::Message[::Message[Message]] ...
Hierarchy of Rows within One Table, Via an Associative Table
This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.
The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.
Here is the Bill of Materials hierarchy, modelled and implemented correctly.
Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.
Person::Progeny is also given.
The hierarchies are:
Part[Assembly]::Part[Component] ...
Part[Component]::Part[Assembly] ...
Person[Parent]::Person[Child] ...
Person[Child]::Person[Parent] ...
Ignorance Of Hierarchy
Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.
Adjacency List
The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.
Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.
And they are clueless as to how to traverse the tree, that one needs recursion.
Nested Sets
An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.
Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.
The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.
Ok, it wasn't so short.
Response to Update 2
Response to Update 2
Response to Update 3
Response to Update 3
Response to Update 4
Response to Update 4
For each category who eats the food, you should add one table. for example, if one food may be eaten by some specific gender, you would have:
Food_Gender(FoodID,GenderID)
for humans you would have:
Food_Human(FoodID,HumanID)
for animals species:
Food_AnimalSpc(FoodID,Species)
for an entire table:
Food_Table(FoodID,TableID)
and so on for other categories

Which of these MySQL database designs (attached) is best for mostly read high performance?

I am a data base admin and developer in MySQL. I have been a couple of years working with MySQL. I recently adquire and study O'Reilly High Performance MySQL 2nd Edition to improve my skills on MySQL advanced features, high performance and scalability, because I have often been frustated by the lack of advance knowledge of MySQL I had (and in a big part, I still have).
Currently, I am working on a ambicious web project. In this project, we will have quite content and users from the begining. I am the designer of the data base and this data base must be very fast (some inserts but mostly and more important READS).
I want here to discuss about these requirements:
There will be several kind of items
The items have some fields and relations in common
The items also have some fields and relations special that make them differents each other
Those items will have to be listed all together ordered or filtered by common fields or relations
The items will have to be also listed only by type (for examble item_specialA)
I have some basic design doubts, and I would like you to help me decide and learn which design aproach would be better for a high performance MySQL data base.
Classical aproach
The following diagram shows the clasical aproach which is the first you may think about with the mind thinking in database: Database diagram
Centralized aproach
But maybe we can improve it with some or pseudo object oriented paradigm centralicing the common items and the relations on one common item table. It would also be useful for listing all kind of items: Database diagram
Advantages and disadvantages of each one?
Which aproach would you choose or which changes would you apply seeing the requirements seen before?
Thanks all in advance!!
What you have are two distinct data mapping strategies. That you called "classical" is "one table per concrete class" in other sources, and that you called "centralized" is "one table per class" (Mapping Objects to Relational Databases: O/R Mapping In Detail). They both have their advantages and disadvantages (follow the link above). The queries in the first strategy will be faster (you will need to join only 2 tables vs 3 in the second strategy).
I think that you should explore classic supertype/subtype pattern. Here are some examples from the SO.
If you're looking mostly for speed, consider selective use of MyISAM tables, use a centralized "object" table, and just one additional table with correct indexes on this form:
object_type | object_id | property_name | property_value
user | 1 | photos | true
city | 2 | photos | true
user | 5 | single | true
city | 2 | metro | true
city | 3 | population | 135000
and so on... lookups on primary keys or indexed keys (object_type, object_id, property_name) for example will be blazing fast. Also, you reduce the need to end with 457 tables as new properties appear.
It isn't exactly a well-designed nor perfectly-normalized database and, if you are looking for a long-term big site, you should consider caching, or at least using a denormalized paradigm, denormalized mysql tables like this one, redis, or maybe MongoDB.

Storing multi-language geodata in MySQL

My application needs to use geodata for displaying location names. I'm very familiar with large-scale complex geodata generally (e.g. Geonames.org) but not so much with the possible MySQL implementation.
I have a custom dataset of four layers, including lat/lon data for each:
- Continents (approx 10)
- Countries (approx 200)
- Regions/States (approx 100)
- Cities (approx 10K)
In relationship to all other tables, I'm properly referencing to four normalized tables of location names, allowing me to expand these separately from the rest of the data.
So far so good... in English!
However, I wish to add other languages to my application which means that some location names will also need translations (e.g. London > Londres > Londre etc). It won't be OTT, perhaps 6 languages and no more. UTF-8 will be needed.
I'll be using Symfony framework's cultures for handling interface translations, but I'm not sure how I should deal with location names, as they don't really belong in massive XML files. The solution I have in mind so far is to add a column to each of the location tables for "language" to allow the system to recognise what language the location name is in.
If anyone has experience of a cleaner solution or any good pointers, I would be grateful. Thanks.
EDIT:
After further digging, found a symfony-assisted solution to this. In case someone finds this question, here's the reference: http://www.symfony-project.org/book/1_0/13-I18n-and-L10n
I'm not familiar with what Symfony has to offer in specific functions in that department. But for a framework-independent approach, how about having one database column in the locality table holding the default name for quick lookup - depending on your preference, the english name of the locality (Kopenhagen) or the local name (København), and a 1:n translation table for the rest, linked to each locality:
locality ID | language (ISO 639-1) | UTF-8 translation
12345 | fin | Kööpenhamina
12345 | fra | Copenhague
12345 | eng | Kopenhagen
12345 | dan | København
?
That would leave your options open to add unlimited languages, and may be easier to maintain than having to add a column to each table whenever a new language comes up.
But of course, the multi-column approach is way easier to query programmatically, and no table relations are needed - if the number of languages is extremely probable to remain limited, I would probably tend towards that out of sheer laziness. :)