Database Normalization with user input - mysql

I develop a mysql database that will contain the country,city and occupation of each user.
While I can use a "country" table and then insert the id of the country into the user table, I still have to look for the perfect method for the other two tables.
The problem is that the city and occupation of each user are taken from an input field, meaning that users can type "NYC" or "New York" or "New York City" and millions of other combinations for each town, for example.
Is it a good idea to disregard this issue, create an own "town" table containing all the towns inserted by users and then put the id of the town entry into the user table or would it be more appropriate to use a VARCHAR column "town" in the user table and not normalize the database concerning this relation?
I want to display the data from the three tables on user profile pages.
I am concerned about normalization because I don't want to have too much redundant data in my database because it consumes a lot of space and the queries will be slower if I use a varchar index instead of an integer index for example (as far as I know):
Thanks

We had this problem. Our solution was to collect the various synonyms and typo-containing versions that people use and explicitly map them to a known canonical city name. This allowed to correctly guess the name from user input in 99% of cases.
For the remaining 1%, we created a new city entry and marked it as a non-canonical. Periodically we looked through non-canonical entries. For recognizable known cities, we remapped the non-canonical entry to the canonical (updating FKs of linked records and adding a synonym). For a genuinely new city name we didn't know about we kept the created entry as canonical.
So we had something like this:
table city(
id integer primary key,
name varchar not null, -- the canonical name
...
);
table city_synonym(
name varchar primary key, -- we want unique index
city_id integer foreign key references(city.id)
);

Usually data normalization helps you to work with data and keep it simple. If normalized schema not fit your needs you can use denormalized data as well. So it depends on queries you want to use.
There is no good solution to group cities without creating separate table where you will keep all names for each city within single id. So it will be good to have 3 tables then: user(user_id, city_id), city (city_id, correct name), city_alias(alias_id, city_id, name).

It would be better to store the data in a normalized design, containing the actual, government recognized city names.
#Varela's suggestion of an 'alias' for the city would probably work well in this situation. But you have to return a message along the lines of "You typed in 'Now Yerk'. Did you perhaps mean 'New York'?". Actually, you want to get these kinds of corrections regardless...
Of course, what you should probably actually store isn't the city, but the postal/zip code. Table design is along these lines:
State:
Id State
============
AL Alabama
NY New York
City:
Id State_Id City
========================
1 NY New York
2 NY Buffalo
Zip_Code:
Id Code City_Id
=========================
1 00001-0001 1
And then store a reference to Zip_Code.Id whenever you have an address. You want to know exactly which zip code a user has (claimed) to be a part of. Reasons include:
Taxes for retail (regardless of how Amazon plays out).
Addresses for delivery (There is a Bellevue in both Washington and New York, for example. Zip codes are different).
Social mapping. If you store it as 'user input' cities, you will not be able to (easily) analyze the data to find out things like which users live near each other, much less in the same city.
There are a number of other things that can be done about address verification, including geo-location, but this is a basic design that should help you in most of your needs (and prevent most of the possible 'invalid' anomalies).

Related

SQL (MySQLi) best fit for text comparation

i have the following scenario:
DB table of addresses linked with region ID. Based on address, the workers sorting the packets (QR scanning) to the shelves and re-distributing them to the warehouses all around the capitol city. So far so good, everything seems OK, but there is a problem:
My DB table (MySQL) has the following fields:
ID (*auto increment, PK)*
STREET_NAT (*local name of the street - Cyrillic*) UTF8
STREET_EN (*English name of the street - Latin*) UTF8
REGION_ID (*number from 1 to 116 , that describes in what part of the town (warehouse) will be the package distributed*)
The problem is, sometimes the addresses are not correctly written plus as a bonus, sometimes they are in Cyrillic, sometimes in Latin.
I need to create a sorting system that analyzes the best fit of the street address and decides in which part of the city will travel the package. But the people makes mistakes (for example they are not entering "Jules Verne str." , but "Jul Vern st." , or even in Cyrillic with mistakes.
So my question: Does exists some procedure/method in MySQL to guess the best fit for the address? I am thinking in point system based on
php:
$query = "
SELECT REGION_ID FROM ADDRESSES WHERE STREET_NAT LIKE '%{$scanned_address}%'
OR STREET_EN LIKE '%{$scanned_address}%' "
this system works in approx 55% of the cases,when the sender of the package does not makes a mistake. I need to improve this select to add something like "Points" how close is the scanned address to the database field value. Best fit wins, and the region ID will be shown and sorted to the corresponding shelve. I am talking about thousands of packets / day.
Any ideas?
Thanks
i have an idea, create a table with the street correct names in it, in Israel we have this data for free, then you can compare from what the user types to the data in your database table with the correct street names. so it's auto complete.
you don't need to insert the values your self. just find a data source that can provide you this data, and it will be autocomplete for the user.

Addresses stored in a database should you normalize?

quick question.
consider the following table (UK):
CustomerID (PK)
First Name
Surname
House_No/name
street
City
Postcode
Would you split off address into another table?
basic business assumption is that a customer cannot have more than one address.
originally i seperated this off to look something like this:
Customer Table
CustomerID (PK)
FirstName
Surname
AddressID (FK)
Address Table
AddressID(PK)
Postcode(FK)
House_Number_name
Postcode Table:
Postcode (PK)
StreetName
CityID(FK)
City Table
CityID (PK)
CityName
unless i have my assumptions wrong that a postcode uniquely identifies a streetname and city is this not in 3NF?
personally, i would put address in another table, and link them together.
the business assumption/rule may change and when you split these things you have the best chance of accommodating any possible business rule without a major redo.
for instance - oops, the customer has a different billing address than their shipping address, or oops, we need to know where something actually shipped last year even though the customer changed their address for this year, etc.
basic business assumption is that a customer cannot have more than
one address.
If this is an actual rule and not an assumption, I'd just keep them in the one table.
However, assume puts the 'ass' in 'u' and 'me'.
So play safe and sperate the address into another table.
But it looks like you are taking normalisation too far with that from your eample.
Yes, I would split off the address into a separate table.
However, the reason is not normalization per se (under most circumstances). The primary reason is that it is a slowly changing dimension and it might be useful to look up a previous addresses.
Whether you go ahead an normalize things like postal code is a matter of taste. In a more "amateur" database, I don't think it is necessary. However, for a large database of real customers, I would be inclined to split it off. It helps ensure that the postal codes are accurate. Also, they change over time. And, you might be purchasing additional information at the postal code level, for instance.
It all depends to your requirements, but as you mentioned above customer can't have more than one address so there's no need to another one to one relationship because you can put it in the same relation. But I suggest you break it into another one to many relationship because of future requirements.

Best way to store users address into european database?

I would like to storing user's address (country-region-town) into a database. This database is the core of a european app. I have a table of countries and foreach country, I have two tables, one for regions and one for towns/cities.
So, what's the best way to storing user's address into database?
I could create different tables foreach country and rename them "users_italian", "users_french" with columns "idUsers" and "idTown" and I should connect them with the respective town table. It's simple but this mode does not convince me because there would be too many tables.
alternatively I could create a table "users_country" with "idUsers" and "idCountry" (foreign key) and I should memorize region and country into another table "users_address" with "idUsers", "region" and "town", but with string date type, example:
users(idUsers:1, email:useremail#mail.com, psw:secrethashedpsw, token:secrettoken)
country (idCountry:1, Country: "United Kingdom")
users_country (idUsers:1, idCountry:1(refer to United Kingdom))
users_address (idUsers:1, region:"England", town:"London", zip: SW1H 0TL (example))
Also this mode does not convince me because the database would not be normalized, but it's simple to realize it! and the query select it's simple too. I could create different tables about "users_region", "users_town" and "users_zip", but the storing with strings does not convince me.
Anyone can help me?
One table per country? NO!
Have a single users table. It would have
Plan A: address and either city+region+country.
Plan B: address plus id linking to (normalized) table of cities. Cities contains city+region+country.
Normalizing country, for example, adds complexity without adding any benefit.

Database design structure

I´m new to database design and never took class on it, i have problem with structuring my database and assigning primary keys.
I have a list of cities, each city has 5 types of public transport. Each type of public transport has different ticket price, main station and CSV file with route coordinations etc. in every city. Then i need to daily calculate average cost of transportation in every city for each type of public transport based on route coordinations (distances), price, time it takes etc.
Table cities:
city (Primary key)
Table public transport:
city, type of transport, ticket price, main station, file1, file2
Table results:
city, type of transport, date, cost
How should i connect these tables (assuming their structure is right)? In table public transport, i think city should be foreign key but type of transport will repeat for every city so i dont think it can be primary key of this table - the same for table results.
The main idea is that you don't wish to repeat ya self. Not only is it an overhead but also it's quite error prone when you wish to change multiple entries that represent the same thing.
There are guidelines on database normalization which help you to ensure that your data is on a form that's easy to maintain and work with.
You don't need to become an expert in understanding which form does what, but being able to identify what should be kept separated is a must when it comes to database designing.
You should list what you know:
Different cities.
Different type of transport.
Different ticket prices.
Different stations.
If you create a separate table for all of those then it'll be easy to link them together in rows in a table that then represents something on a larger scale. Every entry should have a separate id that will be your primary key, you need to be able to allow e.g. multiple cities with the same name, thus not being able to hold a unique value if they are to be the primary key.
E.g. now it would be easy to identify routes for a city, there can be multiple routes in a city
route_id | city_id | route_name
1 2 test1
2 2 test2
You then could add another table that represent which kind of transport is tied with this specific kind of route.
route_id | transport_id
1 3
2 4
You're then able to create a new table that holds points of stations that are a part of your route and you can even identify whether it's a main route or not.
route_connected_id | route_id | station_id | main_route
1 1 2 1 // a main route
2 1 3 0 // not a main route
And it goes on and on, separating the most simple entries allows you to create complex relationships where all you have to do is link ids.
This is the basic idea which should hopefully get you started, whether you find it helpful or not then I recommend that you take a look on the reading material that I suggested, i.e. database normalization.

Many Bool columns in database table

I recently took over a website where people can register to help tutor kids. Part of the user's details is which areas they could work, represented by postal codes. The problem is, my predecessor designed the site such that in the database there is a Boolean column for every postal code. As such, the user table has almost 270 columns and can be quite slow at times (plus it's a nightmare to administer).
Most users select only a few postal codes so there is surely a better way to do it. I was thinking about a varchar that could save the selected areas comma separated, e.g. 6043,8811,1234
Any advice from somebody who's had the same problem?
both your predecessor's and your solution are... strange.
You should simply have a relationship table between user and localities (assuming you have a locality table, with a postalCode field and a surrogate key (id)).
UserLocality(userId int, localityId int)
so a locality could have many user, and a user could have many localities.
Coma separated fields is a really bad idea, when query time comes.
You should throw that entire idea out of your head and look into properly normalized data.
A possible solution to this problem would be a table for tutors, which has an id column to uniquely identify one tutor.
Then you would have a table for just Postal Codes (each with unique ids as well) and finally a tutor_availability table that creates one record of (t_id, pc_id) for each postal code a tutor wishes to offer their services, again with a unique id to avoid duplication risks in the case they can select the same location twice.