Data Modeling: hierarchy of geographic locations - mysql

I want my users to be able to specify their locations so that I can plot them on a map. Given an address, I use Google Maps API to get their lat/long coordinates and store that in the database.
Additionally, I want to allow users to search for other users based on location. Using Google Maps API, I can also get, say, country/state/city for address (or lat/long coordinates). My problem is that I don't know how to store country/state/city in such a way that:
The data is associated to a particular user
There is no data redundancy
The problem I think is if User-1 and User-2 both enter an address that results in the country being "USA", I think I need to know that User-1 and User-2 are both from the USA -- AND that "USA" is only stored once in the DB.
When users search for other users, I think I should only let them search for users in the USA if I actually have users from USA. Meaning, assume User-1 and User-2 are the only 2 users from the USA, if User-1 and User-2 delete their profiles, I shouldn't allow searches for users in the USA anymore.
Are my ideas conceptually wrong? In any case, how should I model this information? I'm using MySQL.

Your goals and intentions are correct (don't give them up!), you may need a bit of help getting over the line, that's all. The more understanding and experience you have with data modelling and Normalisation, the easier it will be. So do as much research and exercise as you can (SO or the web is not a good way to learn anything).
For loading and maintenance purposes, you are better off ensuring that you have a normalised, top-down structure for geographic locations. You can load it from info provided (usually free) by your council or county or Post Office or whatever. That will eliminate manual data entry for the External Reference tables.
This is a highly Normalised Data Model, at 5NF.
But there's more Normalisation that can be done; more duplication that can be removed. It is not worth it unless you are really interested, eg. you need to search on LastNames, etc.
.
This one is for an Utility company, to ensure that false Street locations are not provided by the prospective customers.
Address is a specific house or unit (apartment), and that it not duplicated. Two People living at the same address will use one Address row.
This structure handles "any" geographic location. Note that some countries do not have State; some States do not have Counties; some Towns do not have Suburbs; etc. Rather than building all those exceptions into the Data Model, I have kept the hierarchy "clean". You will still need to handle that in your SQL (whether or not the model is simple with no exceptions, or whether it has exceptions; because that is the real world), and not display the State for a State-less Country. All you need is a row for the non-State with a StateCode of CHAR(0) that identifies the condition.
I have placed Longitude & Latitude at the Suburb level, assuming that that is what your users can choose easily via GoogleMaps, and because Street level will have limitation (might be to fine grained, and would cause duplication; or not fine grained enough for cities with very long Streets). Easy to change.
For now, I suggest you do not worry about identifying users in the same country, first see if you can handle the SQL to identify users in the same Suburb (not Street, that is easy). After that, we can deal with City, County, Country, etc.
I think the other searches you identify are effortless; see if you agree.
Anyway, this is just something to get you started; there is some interaction to be had before it is finalised.
Link to GLS Data Model (plus the answer to your other question)
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.

Related

mysql tables redesign ideas (relationships)

I have tables with relationships like the below:
I have cascading drop down boxes linked to each other, i.e. when you select countries, the regions under that country will be loaded in the regions drop down. But now I want to change the drop downs to Ajax based auto-complete text-boxes.
My question is, how many text-boxes I should have? Should I have one text-box for everything like "search by location", I would need to change table design, or one text-box for each like country, region, city etc,
If I have textboxes like these, the users may not know, few places whether they are region or a city, for example Auckland, New Zealand is a region not city.
They may search for regions in city textbox & search cities in region textbox...now that they have a dropdown, they can see their region from it, "Auckland will be there in region for sure"
I may not find what I want from individual text-boxes,
I need some suggestions on redesigning from both the database & interface point of view.
Your schema is fine. But it sounds like what the user wants at a minimum is:
1. A google-style free-form text field which they can just type in words, but...
2. Which brings up a subset of matching results in a combo-style fashion.
So here's the deal: Search-like capability isn't what relational databases are designed for, and that's basically the problem you're running into. That said, MySQL, while not my domain of expertise, does seem to have reasonable full-text search support (MySQL Full Text Search).
Perhaps you could have FULLTEXT indices on each of the description fields and issue five different queries. Or if you're willing to go with a dirty solution, have a separate BUSINESS_SEARCH(business_id, concat_description) where concat_description is just all of the related "description" fields munged together; though you'll need to account for description updates.
But I have no idea what the performance implications are with FULLTEXT. If it's non-trivial, I'd offload these queries to a separate copy of the server.
My personal feeling--completely without evidence to back it up--is that you'll run into performance problems down the road. Have you considered an add-on? A quick google search-engine shows Google-like Search Engine in PHP/mySQL. The big downside is that you're introducing all of the pitfalls of yet an unproven/unfamiliar technology.
For either approach, I think you have some research cut out for you.
Good luck!

How to handle properties that exist "between" entities (in a many-to-many relationship in this case)?

I've found a few questions on modelling many-to-many relationships, but nothing that helps me solve my current problem.
Scenario
I'm modelling a domain that has Users and Challenges. Challenges have many users, and users belong to many challenges. Challenges exist, even if they don't have any users.
Simple enough. My question gets a bit more complicated as users can be ranked on the challenge. I can store this information on the challenge, as a set of users and their rank - again not too tough.
Question
What scheme should I use if I want to query the individual rank of a user on a challenge (without getting the ranks of all users on the challenge)? At this stage, I don't care how I make the call in data access, I just don't want to return hundreds of rank data points when I only need one.
I also want to know where to store the rank information; it feels like it's dependent upon both a user and a challenge. Here's what I've considered:
The obvious: when instantiating a Challenge, just get all the rank information; slower but works.
Make a composite UserChallenge entity, but that feels like it goes against the domain (we don't go around talking about "user-challenges").
Third option?
I want to go with number two, but I'm not confident enough to know if this is really the DDD approach.
Update
I suppose I could call UserChallenge something more domain appropriate like Rank, UserRank or something?
The DDD approach here would be to reason in terms of the domain and talk with your domain expert/business analyst/whoever about this particular point to refine the model. Don't forget that the names of your entities are part of the ubiquitous language and need to be understood and used by non-technical people, so maybe "UserChallenge" is not he most appropriate term here.
What I'd first do is try to determine if that "middle entity" deserves a place in the domain model and the ubiquitous language. For instance, if you're building a website and there's a dedicated Rankings page where the user he can see a list of all his challenges with the associated ranks, chances are ranks are a key matter in the application and a Ranking entity will be a good choice to represent that. You can talk with your domain expert to see if Rankings is a good name for it, or go for another name.
On the other hand, if there's no evidence that such an entity is needed, I'd stick to option 1. If you're worried about performance issues, there are ways of reducing the multiplicity of the relationship. Eric Evans calls that qualifying the association (DDD, p.83-84). Technically speaking, it could mean that the Challenge has a map - or a dictionary of ranks with the User as a key.
I would go with Option 2. You don't have to "go around talkin about user-challenges", but you do have to go around grabbin all them Users for a given challenge and sorting them by rank and this model provides you a great way to do it!

What is the best normalization for street address?

Today I have a table containing:
Table a
--------
name
description
street1
street2
zipcode
city
fk_countryID
I'm having a discussion what is the best way to normalize this in terms of quickest search. E.g. find all rows filtered by city or zipcode. Suggested new structure is this:
Table A
--------
name
description
fk_streetID
streetNumber
zipcode
fk_countryID
Table Street
--------
id
street1
street2
fk_cityID
Table City
----------
id
name
Table Country
-------------
id
name
The dicussion is about having only one field for street name instead of two.
My argument is that having two feilds is considered normal for supporting international addresses.
The pro argument is that it will go on the cost of performance on search and possible duplication.
I'm wondering what is the best way to go here.
UPDATE
I'm aiming at having 15.000 brands associated with 50.000 stores, where 1.000 users will do multiple searches each day by web and iPhone. In addition I will be having 3. parties fetching data from the DB for their sites.
The site is not launched yet, so we have no idea of the workload. And we'll only have around 1000 brands assocaited with around 4000 stores when we start.
My standard advice (from years of data warehouse /BI experience) here is:
always store the lowest level of broken out detail, i.e. the multiple fields option.
In addition to that, depending on your needs you can add indexes or even a compound field that is the other two field concatenated - though make sure to maintain with a trigger and not manually or you will have data syncronization and quality problems.
Part of the correct answer for you will always depend on your actual use. Can you ever anticipate needing the address in a standard (2-line) format for mailing... or exchange with other entities? Or is this a really pure 'read-only' database that is just set up for inquiries and not used for more standard address needs such as mailings.
A the end of the day if you have issues with query performance, you can add additional structures such as compound fields, indexes and even other tables with the same data in a different form. Then there are also options for caching at the server level if performance is slow. If building a complex or traffic intensive site, chances are you will end up with a product to help anyway, for example in the Ruby programming world people use thinking sphinx If query performance is still an issue and your data is growing you may ultimately need to consider non-sql solutions like MongoDB.
One final principle that I also adhere to: think about people updating data if that will occur in this system. When people input data initially and then subsequently go to edit that information, they expect the information to be "the same" so any manipulation done internally that actually changes the form or content of the users input will become a major headache when trying to allow them to do a simple edit. I have seen insanely complicated algorithms for encoding and decoding data in this fashion and they frequently have issues.
I think the topmost example is the way to go, maybe with a third free-form field:
name
description
street1
street2
street3
zipcode
city
fk_countryID
the only thing you can normalize half-way sanely for international addresses is zip code (needs to be a free-form field, though) and city. Street addresses vary way too much.
Note that high normalisation means more joins, so it won't yield to faster searches in every case.
As others have mentioned, address normalization (or "standardization") is most effective when the data is together in one table but the individual pieces are in separate columns (like your first example). I work in the address verification field (at SmartyStreets), and you'll find that standardizing addresses is a really complex task. There's more documentation on this task here: https://www.smartystreets.com/Features/Standardization/
With the volume of requests you'll be handling, I highly suggest you make sure the addresses are correct before you deploy. Process your list of addresses and remove duplicates, standardize formats, etc. A CASS-Certified vendor (such as SmartyStreets, though there are others) will provide such a service.

Implementing search on medical link list/table that allows for synonyms/abbreviations- and importing such a thing

I'm making a simple searchable list which will end up containing about 100,000 links on various medical topics- mostly medical conditions/diseases.
Now on the surface of things this sounds easy... in fact I've set my tables up in the following way:
Links: id, url, name, topic
Topics (eg cardiology, paediatrics etc): id, name
Conditions (eg asthma, influenza etc): id, name, aliases
And possibly another table:
Link & condition (since 1 link can pertain to multiple conditions): link id, condition id
So basically since doctors (including myself) are super fussy, I want to make it so that if you're searching for a condition- whether it be an abbreviation, british or american english, or an alternative ancient name- you get relevant results (eg "angiooedema", "angioedema", "Quincke's edema" etc would give you the same results; similarly with "gastroesophageal reflux" "gastro-oesophageal reflux disease", GERD, GORD, GOR). Additionally, at the top of the results it would be good to group together links for a diagnosis that matches the search string, then have matches to link name, then finally matches to the topic.
My main problem is that there are thousands if not tens of thousands of conditions, each with up to 20 synonyms/spellings etc. One option is to get data from MeSH which happens to be a sort of medical thesaurus (but in american english only so there would have to be a way of converting from british english). The trouble being that the XML they provide is INSANE and about 250mb. To help they have got a guide to what the data elements are.
Honestly, I am at a loss as to how to tackle this most effectively as I've just started programming and working with databases and most of the possibilities of what to do seem difficult/suboptimal.
Was wondering if anyone could give me a hand? Happy to clarify anything that is unclear.
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Link
Topic
Conditions
Then you can write a Lucene query such as Topic:edema and you should get all results.
You can do wildcard search for more.
To match british spellings (or even misspellings) you can use the ~ query which finds terms within a certain string distance. For example edema~0.5 matches oedema, oedoema and so on...
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
Summary: my recommendation is to use Apache Solr as an adjunct to your MySql db.
It's hard. Your best bet is to use MeSH and then perhaps soundex to match on British English terms.

How important is database normalization in a very simple database?

I am making a very simple database (mysql) with essentially two types of data, always with a 1 to 1 relationship:
Events
Sponsor
Time (Optional)
Location (City, State)
Venue (Optional)
Details URL
Sponsors
Name
URL
Cities will be duplicated often, but is there really much value in having a cities table for such a simple database schema?
The database is populated by screen-scraping a website. On this site the city field is populated via selecting from a dropdown, so there will not be mistypes, etc and it would be easy to match the records up with a city table. I'm just not sure there would be much of a point even if the users of my database will be searching by city frequently.
Normalize the database now.
It's a lot easier to optimize queries on normalized data than it is to normalize a pile of data.
You say it's simple now - these things have a tendency to grow. Design it right and you'll get the experience of proper design and some future proofing.
I think you are looking at things the wrong way - you should always normalize unless you have a good reason not to.
Trusting your application to maintain data integrity is a needless risk. You say the data is made uniform because it is selected from a dropdown. What if someone hacks on the form and modifies the data, or if your code inadvertently allows a querystring param with the same name?
Where will the city data come from that populates your dropdown box for the user? Wouldn't you want a table for that?
It looks like you are treating Location as one attribute including city and state. Suppose you want to sort or analyse events by state alone rather than city and state? That could be hard to do if you don't have an attribute for state. Logically I would expect state to belong in a city table - although that may depend on exactly how you want to identify cities.
Direct answer: Just because a problem is relatively simple is no reason to not do things to keep it simple. It's a lot easier to walk on my feet than on my hands. I don't recall ever saying, "Oh, I only have to go half a mile, that's a short distance so I might as well walk on my hands."
Longer answer: If you don't keep any information about a city other than it's name, and you don't have a pre-set list of cities (e.g. to build a drop-down), then your schema is already normalized. What would be in a City table other than the city name? (I presume State cannot be dependent on City because you could have two cities with the same name in different states, e.g. Dayton OH and Dayton TN.) The relevant rule of normalization is "no non-key dependencies", that is, you cannot have data that depends on data that is not a key. If you had, say, latitude and longitude of each city, then this data would be repeated in every record that referenced the same city. In that case you would certainly want to break out a separate city table to hold the latitude and longitude. You could, of course, create a "city code" that is an integer or abbreviation that links to a city table. But if there's no other data about a city, I don't see how this gains anything.
Technically, I would assume that City depends on Venue. If the venue is "Rockefeller Center", that implies that the city must be New York. But if venue is optional, this creates problems. One possibility is to have a Venue table that lists venue name, city, and state, and for cases where you don't specify a venue, have an "unspecified" for each city. This would be more textbook correct, but in practice if in most case you do not specify a venu, it would gain little. If most of the time you DO specify a venu, it would probably be a good idea.
Oh, and, is there really a 1:1 relation between event and sponsor? I can believe that an event cannot have more than one sponsor. (In real life, there are plenty of events with multiple sponsors, but maybe for your purposes you only care about a "primary sponsor" or some such.) But does a sponsor never hold more than one event? That seems unlikely.
Why not go ahead and normalize? You write as if there are significant costs of normalizing that outweigh the benefits. It's easier to set it up in a normal form before you populate it than to try and normalize it later.
Also, I wonder about your 1-to-1 relationship. Naively, I would imagine that an event might have multiple sponsors, or that a sponsor might be involved in more than one event. But I don't know your business logic...
ETA:
I don't know why I didn't notice this before, but if you are really averse to normalizing your database, and you know that you will always have a 1-to-1 relationship between the events and sponsors, then why would you have the sponsors in a separate table?
It sounds like you may be a little confused about what normalization is and why you would do it.
The answer hinges, IMO, on whether you want to prevent errors during data-entry. If you do, you will need a VENUES table:
VENUES
City
State
VenueName
as well as a CITIES and STATES table. (Note: I've seen situations where the same city occurs multiple times in the same state, usually smaller towns, so CITY/STATE do not comprise a unique dyad. Normally there's a zipcode to disambiguate.)
To prevent situations where the data-entry operator enters a venue for NY NY which is actually in SF CA, you'd need to validate the venue entry to see if such a venue exists in the city/state supplied on the record.
Then you'd need to make CITY/STATE mandatory, and have to write code to rollback the transaction and handle the error.
If you are not concerned about enforcing this sort of accuracy, then you don't really need to have CITY and STATES tables either.
If you are interested in learning about normalization, you should learn what happens when you don't normalize. For each normal form (beyond 1NF) there is an update anomaly that will occur as a consequence of harmful redundancy.
Often it's possible to program around the update anomalies, and sometimes that's more practical than always normalizing to the ultimate degree.
Sometimes, it's possible for a database to get into an inconsistent state due to failure to normalize, and failure to program the application to compensate.
In your example, the best I can come up with is a sort of lame hypotheical. What if the name of a city got mispelled in one row, but spelled correctly in all the others. What if you summarized by city and sponsor? Your output would reflect the error, and diovide one group into two groups. Maybe it would be better if the city were only spelled out once in the database, for better or for worse. At least the grouping for the summary would be correct, even if the name were mispelled.
Is this worth nromalizing for? Hey, it's your project, not mine. You decide