How important is database normalization in a very simple database? - mysql

I am making a very simple database (mysql) with essentially two types of data, always with a 1 to 1 relationship:
Events
Sponsor
Time (Optional)
Location (City, State)
Venue (Optional)
Details URL
Sponsors
Name
URL
Cities will be duplicated often, but is there really much value in having a cities table for such a simple database schema?
The database is populated by screen-scraping a website. On this site the city field is populated via selecting from a dropdown, so there will not be mistypes, etc and it would be easy to match the records up with a city table. I'm just not sure there would be much of a point even if the users of my database will be searching by city frequently.

Normalize the database now.
It's a lot easier to optimize queries on normalized data than it is to normalize a pile of data.
You say it's simple now - these things have a tendency to grow. Design it right and you'll get the experience of proper design and some future proofing.

I think you are looking at things the wrong way - you should always normalize unless you have a good reason not to.
Trusting your application to maintain data integrity is a needless risk. You say the data is made uniform because it is selected from a dropdown. What if someone hacks on the form and modifies the data, or if your code inadvertently allows a querystring param with the same name?

Where will the city data come from that populates your dropdown box for the user? Wouldn't you want a table for that?
It looks like you are treating Location as one attribute including city and state. Suppose you want to sort or analyse events by state alone rather than city and state? That could be hard to do if you don't have an attribute for state. Logically I would expect state to belong in a city table - although that may depend on exactly how you want to identify cities.

Direct answer: Just because a problem is relatively simple is no reason to not do things to keep it simple. It's a lot easier to walk on my feet than on my hands. I don't recall ever saying, "Oh, I only have to go half a mile, that's a short distance so I might as well walk on my hands."
Longer answer: If you don't keep any information about a city other than it's name, and you don't have a pre-set list of cities (e.g. to build a drop-down), then your schema is already normalized. What would be in a City table other than the city name? (I presume State cannot be dependent on City because you could have two cities with the same name in different states, e.g. Dayton OH and Dayton TN.) The relevant rule of normalization is "no non-key dependencies", that is, you cannot have data that depends on data that is not a key. If you had, say, latitude and longitude of each city, then this data would be repeated in every record that referenced the same city. In that case you would certainly want to break out a separate city table to hold the latitude and longitude. You could, of course, create a "city code" that is an integer or abbreviation that links to a city table. But if there's no other data about a city, I don't see how this gains anything.
Technically, I would assume that City depends on Venue. If the venue is "Rockefeller Center", that implies that the city must be New York. But if venue is optional, this creates problems. One possibility is to have a Venue table that lists venue name, city, and state, and for cases where you don't specify a venue, have an "unspecified" for each city. This would be more textbook correct, but in practice if in most case you do not specify a venu, it would gain little. If most of the time you DO specify a venu, it would probably be a good idea.
Oh, and, is there really a 1:1 relation between event and sponsor? I can believe that an event cannot have more than one sponsor. (In real life, there are plenty of events with multiple sponsors, but maybe for your purposes you only care about a "primary sponsor" or some such.) But does a sponsor never hold more than one event? That seems unlikely.

Why not go ahead and normalize? You write as if there are significant costs of normalizing that outweigh the benefits. It's easier to set it up in a normal form before you populate it than to try and normalize it later.
Also, I wonder about your 1-to-1 relationship. Naively, I would imagine that an event might have multiple sponsors, or that a sponsor might be involved in more than one event. But I don't know your business logic...
ETA:
I don't know why I didn't notice this before, but if you are really averse to normalizing your database, and you know that you will always have a 1-to-1 relationship between the events and sponsors, then why would you have the sponsors in a separate table?
It sounds like you may be a little confused about what normalization is and why you would do it.

The answer hinges, IMO, on whether you want to prevent errors during data-entry. If you do, you will need a VENUES table:
VENUES
City
State
VenueName
as well as a CITIES and STATES table. (Note: I've seen situations where the same city occurs multiple times in the same state, usually smaller towns, so CITY/STATE do not comprise a unique dyad. Normally there's a zipcode to disambiguate.)
To prevent situations where the data-entry operator enters a venue for NY NY which is actually in SF CA, you'd need to validate the venue entry to see if such a venue exists in the city/state supplied on the record.
Then you'd need to make CITY/STATE mandatory, and have to write code to rollback the transaction and handle the error.
If you are not concerned about enforcing this sort of accuracy, then you don't really need to have CITY and STATES tables either.

If you are interested in learning about normalization, you should learn what happens when you don't normalize. For each normal form (beyond 1NF) there is an update anomaly that will occur as a consequence of harmful redundancy.
Often it's possible to program around the update anomalies, and sometimes that's more practical than always normalizing to the ultimate degree.
Sometimes, it's possible for a database to get into an inconsistent state due to failure to normalize, and failure to program the application to compensate.
In your example, the best I can come up with is a sort of lame hypotheical. What if the name of a city got mispelled in one row, but spelled correctly in all the others. What if you summarized by city and sponsor? Your output would reflect the error, and diovide one group into two groups. Maybe it would be better if the city were only spelled out once in the database, for better or for worse. At least the grouping for the summary would be correct, even if the name were mispelled.
Is this worth nromalizing for? Hey, it's your project, not mine. You decide

Related

Is there a more efficient way to handle multi-valued attributes other than creating a relationship table?

I have three tables, tbl_school, tbl_courses and tbl_branches.
Each course can be taught in one or more branches of a school.
tbl_school has got:
id
school_name
total_branches
...
tbl_courses:
id
school_id
course_title
....
tbl_branches:
id
school_id
city
area
address
When I want to list all the branches of a school, it is a pretty straight forward JOIN.
However, each course will be taught in one or more branches or all the branches of the school and I need to store this information. Since there is a one-to-many relationship between tbl_courses and tbl_branches, I will have to create a new relationship table that maps each course record to it's respective branches.
When my users want to filter a course by city or area, this relationship table will be used.
I would like to know if this is the right approach or is there something better for my problem?
I was planning to store a JSON of branches of courses which would eliminate the relationship table and query would be much easier to find the city or area pattern in JSON string.
I am new to design patterns so kindly bear with me.
Issues
The table description you have given has a few errors, which need to be corrected first, after which my proposal will make more sense.
The use of a table prefix, especially tbl_, is incorrect. All the tables are tbl_s. If you do use a prefix, it is to group tables by Subject Area. Further, SQL allows a table qualifier when referring to any table in the code:
`... WHERE table_name.column_name = "something" ...
If you would like some advice re Naming Convention, please review this Answer.
Use singular, because the table name is supposed to refer to a row (relation), not to the content (we know it contains many rows). Then all the English used re the table_name makes sense. (Eg. refer my Predicates.)
You have some incorrect or extraneous columns. It is easier to give you a Data Model, than to explain each item. A couple of items do need explanation:
school.total_branches is a duplicate, because that value can easily be derived (by COUNT() of the Branches). It breaks Normalisation rules, and introduces an Update Anomaly, which can get "out of synch".
course.school_id is incorrect, given that each Branch may or may not teach a Course. That relation is 1 Course to many Branches, it should be in the new table you are contemplating.
By JSON, if you mean construct an array on the client instead of keeping the relations in the database, then no, definitely not. Data and relationships to data, should be implemented in the database. For many reasons, the most important of which is Integrity. Following that, you may easily drag it into the client, and keep it there for stream-performance purposes.
The table you are thinking about is an Associative Table, an ordinary Relational construct to relate ("map", "link") two parent tables, here Courses to Branches.
Data duplication is not prevented. Refer to the Keys is the Data Model.
ID columns you have do not provide row uniqueness, which the Relational Model demands. If that is not clear to you please read this Answer.
Solution
Here is the model.
Proposed School Data Model
Please review and comment.
I need to ensure that you understand the notation in IDEF1X models, that unlike non-standard diagrams: every little notch, tick and line means something very specific. If not, please got to the IDEF1X Notation link at the bottom right of the model.
Please check the Predicates carefully, they (a) explain the model, and (b) are used to verify it. It is a feedback loop. They have two separate benefits.
If you would like more information on Predicates, why they are relevant, please go to this Answer and read the Predicate section.
If you wish to thoroughly understand Predicates, with a view to understanding Data Modelling, consider that Data Model (latest version is linked at the top of the Answer) against those Predicates. Ie. see if you understand a database that you have never seen before, via the model plus Predicates.
The Relational Keys I have given provide the row uniqueness that is required for Relational databases, duplicate data must be prevented. Note that ID columns are simply not needed. The Relational Keys provide:
Data Integrity
Relational access to data (notice the ease of, and unlimited, joins)
Relational speed
None of which a Record Filing System (characterised by ID columns) has.
Column description:
I have implemented two address_lines. Obviously, that should not include city because that is a separate column.
I presume area means something like borough or county or the area that the school branch operates in. If it is a fixed geographic administrative region (my first two descriptors) then it requires a formal structure. If not (my third descriptor), ie. it is loose, or (eg) it spans counties, then a simple Lookup table is enough.
If you use formal administrative regions, then city must move into that structure.
Your approach with an additional table seems the simplest and most straightforward to me. I would not mix JSON in this.

Restructure Inventory Management Database (2 to 3 Tables; Development Stage)

I’m developing a database. I’d appreciate some help restructuring 2 to 3 tables so the database is both compliant with the first 3 normal forms; and practical to use and to expand on / add to in the future. I want to invest time now to reduce effort / and confusion later.
PREAMBLE
Please be aware that I'm both a nube, and an amateur, though I have a certain amount of experience and skill and an abundance of enthusiasm!
BACKGROUND TO PROJECT
I am writing a small (though ambitious!) web application (using PHP and AJAX to a MySQL database). It is essentially an inventory management system, for recording and viewing the current location of each individual piece of equipment, and its maintenance history. If relevant, transactions will be very low (probably less than 100 a day, but with a possibility of simultaneous connections / operations). Row count will also be very low (maybe a few thousand).
It will deal with many completely different categories of equipment, eg bikes and lamps (to take random examples). Each unit of equipment will have its details or specifications recorded in the database. For a bike, an important specification might be frame colour, whereas a lamp it might require information regarding lampshade material.
Since the categories of equipment have so little in common, I think the most logical way to store the information is 1 table per category. That way, each category can have columns specific to that category.
I intend to store a list of categories in a separate table. Each category will have an id which is unique to that category. (Depending on the final design, this may function as a lookup table and / or as a table to run queries against.) There are likely to be very few categories (perhaps 10 to 20), unless the system is particulary succesful and it expands.
A list of bikes will be held in the bikes table.
Each bike will have an id which is unique to that bike (eg bike 0001).
But the same id will exist in the lamp table (ie lamp 0001).
With my application, I want the user to select (from a dropdown list) the category type (eg bike).
They will then enter the object's numeric id (eg 0001).
The combination of these two ids is sufficient information to uniquely identify an object.
Images:
Current Table Design
Proposed Additional Table
PROBLEM
My gut feeling is that there should be an “overarching table” that encompasses every single article of equipment no matter what category it comes from. This would be far simpler to query against than god knows how many mini tables. But when I try to construct it, it seems like it will break various normal forms. Eg introducing redundancy, possibility of inconsistency, referential integrity problems etc. It also begins to look like a domain table.
Perhaps the overarching table should be a query or view rather than an entity?
Could you please have a look at the screenshots and let me know your opinion. Thanks.
For various reasons, I’d prefer to use surrogate keys rather than natural keys if possible. Ideally, I’d prefer to have that surrogate key in a single column.
Currently, the bike (or lamp) table uses just the first column as its primary key. Should I expand this to a composite key including the Equipment_Category_ID column too? Then make the Equipment_Article table into a view joining on these two columns (iteratively for each equipment category). Optionally Bike_ID and Lamp_ID columns could be renamed to something generic like Equipment_Article_ID. This might make the query simpler, but is there a risk of losing specificity? It would / could still be qualified by the table name.
Speaking of redundancy, the Equipment_Category_ID in the current lamp or bike tables seems a bit redundant (if every item / row in that table has the same value in that column).
It all still sounds messy! But surely this must be very common problem for eg online electronics stores, rental shops, etc. Hopefully someone will say oh that old chestnut! Fingers crossed! Sorry for not being concise, but I couldn't work out what bits to leave out. Most of it seems relevant, if a bit chatty. Thanks in advance.
UPDATE 27/03/2014 (Reply to #ElliotSchmelliot)
Hi Elliot.
Thanks for you reply and for pointing me in the right direction. I studied OOP (in Java) but wasn't aware that something similar was possible in SQL. I read the link you sent with interest, and the rest of the site/book looks like a great resource.
Does MySQL InnoDB Support Specialization & Generalization?
Unfortunately, after 3 hours searching and reading, I still can't find the answer to this question. Keywords I'm searching with include: MySQL + (inheritance | EER | specialization | generalization | parent | child | class | subclass). The only positive result I found is here: http://en.wikipedia.org/wiki/Enhanced_entity%E2%80%93relationship_model. It mentions MySQL Workbench.
Possible Redundancy of Equipment_Category (Table 3)
Yes and No. Because this is a lookup table, it currently has a function. However because every item in the Lamp or the Bike table is of the same category, the column itself may be redundant; and if it is then the Equipment_Category table may be redundant... unless it is required elsewhere. I had intended to use it as the RowSource / OptionList for a webform dropdown. Would it not also be handy to have Equipment_Category as a column in the proposed Equipment parent table. Without it, how would one return a list of all Equipment_Names for the Lamp category (ignoring distinct for the moment).
Implementation
I have no way of knowing what new categories of equipment may need to be added in future, so I’ll have to limit attributes included in the superclass / parent to those I am 100% sure would be common to all (or allow nulls I suppose); sacrificing duplication in many child tables for increased flexibility and hopefully simpler maintenance in the long run. This is particulary important as we will not have professional IT support for this project.
Changes really do have to be automated. So I like the idea of the stored procedure. And the CreateBike example sounds familiar (in principle if not in syntax) to creating an instance of a class in Java.
Lots to think about and to teach myself! If you have any other comments, suggestions etc, they'd be most welcome. And, could you let me know what software you used to create your UML diagram. Its styling is much better than those that I've used.
Cheers!
You sound very interested in this project, which is always awesome to see!
I have a few suggestions for your database schema:
You have individual tables for each Equipment entity i.e. Bike or Lamp. Yet you also have an Equipment_Category table, purely for identifying a row in the Bike table as a Bike or a row in the Lamp table as a Lamp. This seems a bit redundant. I would assume that each row of data in the Bike table represents a Bike, so why even bother with the category table?
You mentioned that your "gut" feeling is telling you to go for an overarching table for all Equipment. Are you familiar with the practice of generalization and specialization in database design? What you are looking for here is specialization (also called "top-down".) I think it would be a great idea to have an overarching or "parent" table that represents Equipment. Then, each sub-entity such as Bike or Lamp would be a child table of Equipment. A parent table only has the fields that all child tables share.
With these suggestions in mind, here is how I might alter your schema:
In the above schema, everything starts as Equipment. However, each Equipment can be specialized into Lamp, Bike, etc. The Equipment entity has all of the common fields. Lamp and Bike each have fields specific to their own type. When creating an entity, you first create the Equipment, then you create the specialized entity. For example, say we are adding the "BMX 200 Ultra" bike. We first create a record in the Equipment table with the generic information (equipmentName, dateOfPurchase, etc.) Then we create the specialized record, in this case a Bike record with any additional bike-specific fields (wheelType, frameColor, etc.) When creating the specialized entities, we need to make sure to link them back to the parent. This is why both the Lamp and Bike entities have a foreign key for equipmentID.
An easy and effective way to add specialized entities is to create a stored procedure. For example, lets say we have a stored procedure called CreateBike that takes in parameters bikeName, dateOfPurchase, wheelType, and frameColor. The stored procedure knows we are creating a Bike, and therefore can easily create the Equipment record, insert the generic equipment data, create the bike record, insert the specialized bike data, and maintain the foreign key relationship.
Using specialization will make your transactional life very simple. For example, if you want all Equipment purchased before 1/1/14, no joins are needed. If you want all Bikes with a frameColor of blue, no joins are needed. If you want all Lamps made of felt, no joins are needed. The only time you will need to join a specialized table back to the Equipment table is if you want data both from the parent entity and the specialized entity. For example, show all Lamps that use 100 Watt bulbs and are named "Super Lamp."
Hope this helps and best of luck!
Edit
Specialization and Generalization, as mentioned in your provided source, is part of an Enhanced Entity Relationship (EER) which helps define a conceptual data model for your schema. As such, it does not need to be "supported" per say, it is more of a design technique. Therefore any database schema naturally supports specialization and generalization as long as the designer implements it.
As far as your Equipment_Category table goes, I see where you are coming from. It would indeed make it easy to have a dropdown of all categories. However, you could simply have a static table (only contains Strings that represent each category) to help with this population, and still keep your Equipment tables separate. You mentioned there will only be around 10-20 categories, so I see no reason to have a bridge between Equipment and Equipment_Category. The fewer joins the better. Another option would be to include an "equipmentCategory" field in the Equipment table instead of having a whole table for it. Then you could simply query for all unique equipmentCategory values.
I agree that you will want to keep your Equipment table to guaranteed common values between all children. Definitely. If things get too complicated and you need more defined entities, you could always break entities up again down the road. For example maybe half of your Bike entities are RoadBikes and the other half are MountainBikes. You could always continue the specialization break down to better get at those unique fields.
Stored Procedures are great for automating common queries. On top of that, parametrization provides an extra level of defense against security threats such as SQL injections.
I use SQL Server. The diagram I created is straight out of SQL Server Management Studio (SSMS). You can simply expand a database, right click on the Database Diagrams folder, and create a new diagram with your selected tables. SSMS does the rest for you. If you don't have access to SSMS I might suggest trying out Microsoft Visio or if you have access to it, Visual Paradigm.

Database Design For Tournament Management Software

I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?

How to handle properties that exist "between" entities (in a many-to-many relationship in this case)?

I've found a few questions on modelling many-to-many relationships, but nothing that helps me solve my current problem.
Scenario
I'm modelling a domain that has Users and Challenges. Challenges have many users, and users belong to many challenges. Challenges exist, even if they don't have any users.
Simple enough. My question gets a bit more complicated as users can be ranked on the challenge. I can store this information on the challenge, as a set of users and their rank - again not too tough.
Question
What scheme should I use if I want to query the individual rank of a user on a challenge (without getting the ranks of all users on the challenge)? At this stage, I don't care how I make the call in data access, I just don't want to return hundreds of rank data points when I only need one.
I also want to know where to store the rank information; it feels like it's dependent upon both a user and a challenge. Here's what I've considered:
The obvious: when instantiating a Challenge, just get all the rank information; slower but works.
Make a composite UserChallenge entity, but that feels like it goes against the domain (we don't go around talking about "user-challenges").
Third option?
I want to go with number two, but I'm not confident enough to know if this is really the DDD approach.
Update
I suppose I could call UserChallenge something more domain appropriate like Rank, UserRank or something?
The DDD approach here would be to reason in terms of the domain and talk with your domain expert/business analyst/whoever about this particular point to refine the model. Don't forget that the names of your entities are part of the ubiquitous language and need to be understood and used by non-technical people, so maybe "UserChallenge" is not he most appropriate term here.
What I'd first do is try to determine if that "middle entity" deserves a place in the domain model and the ubiquitous language. For instance, if you're building a website and there's a dedicated Rankings page where the user he can see a list of all his challenges with the associated ranks, chances are ranks are a key matter in the application and a Ranking entity will be a good choice to represent that. You can talk with your domain expert to see if Rankings is a good name for it, or go for another name.
On the other hand, if there's no evidence that such an entity is needed, I'd stick to option 1. If you're worried about performance issues, there are ways of reducing the multiplicity of the relationship. Eric Evans calls that qualifying the association (DDD, p.83-84). Technically speaking, it could mean that the Challenge has a map - or a dictionary of ranks with the User as a key.
I would go with Option 2. You don't have to "go around talkin about user-challenges", but you do have to go around grabbin all them Users for a given challenge and sorting them by rank and this model provides you a great way to do it!

What is the best normalization for street address?

Today I have a table containing:
Table a
--------
name
description
street1
street2
zipcode
city
fk_countryID
I'm having a discussion what is the best way to normalize this in terms of quickest search. E.g. find all rows filtered by city or zipcode. Suggested new structure is this:
Table A
--------
name
description
fk_streetID
streetNumber
zipcode
fk_countryID
Table Street
--------
id
street1
street2
fk_cityID
Table City
----------
id
name
Table Country
-------------
id
name
The dicussion is about having only one field for street name instead of two.
My argument is that having two feilds is considered normal for supporting international addresses.
The pro argument is that it will go on the cost of performance on search and possible duplication.
I'm wondering what is the best way to go here.
UPDATE
I'm aiming at having 15.000 brands associated with 50.000 stores, where 1.000 users will do multiple searches each day by web and iPhone. In addition I will be having 3. parties fetching data from the DB for their sites.
The site is not launched yet, so we have no idea of the workload. And we'll only have around 1000 brands assocaited with around 4000 stores when we start.
My standard advice (from years of data warehouse /BI experience) here is:
always store the lowest level of broken out detail, i.e. the multiple fields option.
In addition to that, depending on your needs you can add indexes or even a compound field that is the other two field concatenated - though make sure to maintain with a trigger and not manually or you will have data syncronization and quality problems.
Part of the correct answer for you will always depend on your actual use. Can you ever anticipate needing the address in a standard (2-line) format for mailing... or exchange with other entities? Or is this a really pure 'read-only' database that is just set up for inquiries and not used for more standard address needs such as mailings.
A the end of the day if you have issues with query performance, you can add additional structures such as compound fields, indexes and even other tables with the same data in a different form. Then there are also options for caching at the server level if performance is slow. If building a complex or traffic intensive site, chances are you will end up with a product to help anyway, for example in the Ruby programming world people use thinking sphinx If query performance is still an issue and your data is growing you may ultimately need to consider non-sql solutions like MongoDB.
One final principle that I also adhere to: think about people updating data if that will occur in this system. When people input data initially and then subsequently go to edit that information, they expect the information to be "the same" so any manipulation done internally that actually changes the form or content of the users input will become a major headache when trying to allow them to do a simple edit. I have seen insanely complicated algorithms for encoding and decoding data in this fashion and they frequently have issues.
I think the topmost example is the way to go, maybe with a third free-form field:
name
description
street1
street2
street3
zipcode
city
fk_countryID
the only thing you can normalize half-way sanely for international addresses is zip code (needs to be a free-form field, though) and city. Street addresses vary way too much.
Note that high normalisation means more joins, so it won't yield to faster searches in every case.
As others have mentioned, address normalization (or "standardization") is most effective when the data is together in one table but the individual pieces are in separate columns (like your first example). I work in the address verification field (at SmartyStreets), and you'll find that standardizing addresses is a really complex task. There's more documentation on this task here: https://www.smartystreets.com/Features/Standardization/
With the volume of requests you'll be handling, I highly suggest you make sure the addresses are correct before you deploy. Process your list of addresses and remove duplicates, standardize formats, etc. A CASS-Certified vendor (such as SmartyStreets, though there are others) will provide such a service.