Best strategy for storing order's addresses - mysql

I have a 'strategy' question.
Thing is, we have a table of customers' addresses and customer orders. Structure is something like (just an example, ignore filed types etc.):
Address
id INT
line1 TEXT
line2 TEXT
state TEXT
zip TEXT
countryid INT
To preserve historical validity of the data we are storing those addresses in a text field with orders (previously it was done by reference, but this is wrong because if address changes all old orders change delivery address too, which is wrong). E.g:
Orders
id INT
productid INT
quantity INT
delivery_address TEXT
delivery address is something akin to CONCAT_WS("\n",line1,line2,state,zip,country_name)
Everything is nice and dandy, however it seems that customers need an access to historical data and be able to export those in XML format and they want to have those lines split up properly again. Because sometimes there is no line2 or state or zip or whatever, how can we store this information in a way that we can then decipher the 'label' of each line?
Storing as JSON encoded array was suggested but is this a best way? I thought about storing it as XML... or maybe create those 6-10 extra columns and store address data with every order? Perhaps some of you guys have more experience in dealing with this kind of stuff and be able to point me in the right direction.
Thanks in advance!

Personally I would model the addresses as a single table, every update to the address would generate a new row, this would be marked as the current address.
I guess you could allow deletes if there are no related orders, however it would be simpiler to mark the old record as inactive.
This will allow you to preserve the relationship between orders & addresses,
and to easily query the historic data at a later date.
see the wikipedia entry for slowly changing dimensions

The best way IMHO is to add history to the address-table. This will cause extra elements to be added for its key (say address_id and {start_of_validity, end_of_validity}) The customer id than becomes a foreign key into the customer table. The orders table references only the address_id field (which is "stable" in time). New orders would reference the "current" row in address.
NB: I dont know json.

You should store those as 6-10 extra fields, just like you do in the current address. You see, that way you have every piece of information at hand, without having to parse anything.
Any other approach (concatenation, JSON, XML) will make you have to do parsing when you need to access the info.

when you say "previously it was done by reference, but this is wrong because if address changes all old orders change delivery address too, which is wrong", it was not that wrong ...
Funny, isn't it?
So, as proposed by others, adresses should (must?) be stored in an independant table. You'll then have different address types (invoicing, delivery), address status (active, inactive) and a de facto address history log ...

In order to be able to utilize the address data for future uses you will definitely want to retain as much metadata (meaning, fields such as Address, City, State, and ZIP). Losing this data by pulling it all into a single line looks simpler and may conserve a small amount of space but in the end is not the best method. In fact, breaking it apart is very difficult--much like separating out first and last names from a generic, one-size-fits-all "name" column. Having the data stored in complete entries, utilizing 6-10 new fields (as mentioned) is the best way to go.
Even better would be standardizing the addresses (at least the US addresses) when they are first entered. That would ensure that the address is real and deliverable and eliminate shipping issues in the future. My thoughts, always retain as much of the data as possible because storage is cheap and data is valuable.
In the interest of full disclosure, I am the founder SmartyStreets. We provide street address verification.

Related

How to perform inexact matches on two data sets

i'm trying to compare two data sets (vendor masters) from two systems. we are moving to one system, so we want to avoid duplication. the issue is that the names, addresses, etc could be slightly different. for example, the name might end in 'Inc' or 'Inc.' or the address could be 'St' or 'Street'. the vendor masters have been dumped to excel, so i was thinking about pulling them into access to compare them, but i'm not sure how to handle the inexact matches. the data fields i need to compare are: name, address, telephone number, feder tax id (if populated), contact name
Here is how I would proceed. You will rarely get answers like this on Stack Exchange, since your question if not focused enough. This is a rather generic set of steps not specific to a particular tool (i.e. database or spreadsheet). As I said in my comments, you'll need to search for specific answers (or ask new ones) about the particular tools you use as you go. Without knowing all the details, Access can certainly be useful in doing some preliminary matching, but you could also utilize Excel directly or even Oracle SQL since you have it as a resource.
Back up your data.
Make a copy of your data for matching purposes.
Ensure that each record for both sets of data have a unique key (i.e. AutoNumber field or similar), so that until you have a confirmed match the records can always be separately identified.
Create new matched-key table and/or fields containing the list of matched unique key values.
Create new "matching" fields and copy your key fields into these new fields.
Scrub the data in all possible matching fields by
Removing periods and other punctuation
Choosing standard abbreviations and replacing all variations by the same value in all records. Example: replace "Incorporation" and "Inc." with "Inc"
Trim excess spaces from the end and between terms
Formatted all phone numbers exactly the same way, or better yet remove all space and punctuation for comparison purposes, excluding extension information: ##########
Parse and split multi-term fields into separate fields. Name -> First, Middle, Last Name fields; Address -> Street number, street name, extra address info.
The parsing process itself can identify and reconcile formatting differences.
Allows easier matching on terms separately.
Etc., etc.
Once the matching fields are sufficiently scrubbed, now match on the different fields.
Define matching priorities, that is which field or fields are likely to produce reliable matches with the least amount of uncertainty.
For records containing Tax ID numbers, that seems like the most logical place to start since an exact match on that number should be valid OR can indicate mistakes in your data.
For each type of match, update the matched-key fields mentioned above
For each successive matching query, exclude records that already have a match in the matched-key table/fields.
Refine and repeat all these steps until you are satisfied that all matches have been found.
Add all non-matched records to your final merged record set.
You never said how many records you have. If possible, it may be worth your organization's time to manually verify the automated matches by listing them side by side and manually tweaking them when needed.
But even if you successfully pair non-exact matches, someone still needs to make the decision of which record to keep for the merged system. I imagine you might have matches on company name and tax id--essentially verifying the match--but still have different addresses and/or contact name. There is no technical answer that will help you know which data to keep or discard. Once again, human review should be done to finalize the merged records. If you set this up correctly, a couple human eyeballs could probably go through thousands of record in just a day.

Many Bool columns in database table

I recently took over a website where people can register to help tutor kids. Part of the user's details is which areas they could work, represented by postal codes. The problem is, my predecessor designed the site such that in the database there is a Boolean column for every postal code. As such, the user table has almost 270 columns and can be quite slow at times (plus it's a nightmare to administer).
Most users select only a few postal codes so there is surely a better way to do it. I was thinking about a varchar that could save the selected areas comma separated, e.g. 6043,8811,1234
Any advice from somebody who's had the same problem?
both your predecessor's and your solution are... strange.
You should simply have a relationship table between user and localities (assuming you have a locality table, with a postalCode field and a surrogate key (id)).
UserLocality(userId int, localityId int)
so a locality could have many user, and a user could have many localities.
Coma separated fields is a really bad idea, when query time comes.
You should throw that entire idea out of your head and look into properly normalized data.
A possible solution to this problem would be a table for tutors, which has an id column to uniquely identify one tutor.
Then you would have a table for just Postal Codes (each with unique ids as well) and finally a tutor_availability table that creates one record of (t_id, pc_id) for each postal code a tutor wishes to offer their services, again with a unique id to avoid duplication risks in the case they can select the same location twice.

How address parameter stores in database

can any one explain how generally address taken from user stores in database...does that will be either of these or not
taken each line as different parameter from user using different text boxes and stores in different columns of a table
taken as one text area stores in a column of a table with field name address
Taken as text area and stores by parsing that text and storing in different columns.
I assume address contains Door no,street name,area,city,country,zip code
and also tell me which among the above is preferable way to store...
Think about what you want to do with the information to begin with. If you have no real use to ever use the addresss, but you are displaying it purely for informattional purposes, then you can just provide a single textarea.
If however, you are maybe going to be providing some GeoCoding service which needs to be able to pinpoint there address, then you will most certainly need: Postcode, City and Town etc.
When I store address information, this is what my schema looks like:
Address
City
StateProvince
zipPostCode
countryId
woeId (as per http://developer.yahoo.com/geo/geoplanet/data/ )
It is obviously just up to you. Think about how you need to best make use of the data, and make it as easy to enter as you require.

Should i to normalize this MySQL table

i'm designing a web site for a friend and i'm not sure what's the best way is to go in regards to one of my database tables.
To give you an idea, this is roughly what i have
Table: member_profile
`UserID`
`PlanID`
`Company`
`FirstName`
`LastName`
`DOB`
`Phone`
`AddressID`
`website`
`AllowNonUserComments`
`AllowNonUserBlogComments`
`RequireCaptchaForNonUserComments`
`DisplayMyLocation`
the last four
AllowNonUserComments
AllowNonUserBlogComments
RequireCaptchaForNonUserComments
DisplayMyLocation
(and possibly more such boolean fields to be added in the future) will control certain website functionality based on user preference.
Basically i'm not sure if i should move those fields to a
new table : member_profile_settings
`UserID`
`AllowNonUserComments`
`AllowNonUserBlogComments`
`RequireCaptchaForNonUserComments`
`DisplayMyLocation`
or if i should just leave it be part of the member_profile table since every member is going to have their own settings.
The target is roughly 100000 members on the long run and 10k to 20k in the short run. My main concern is database performance.
And while i'm at it question #2) would it make sense to move contact information of the member such as address street, city, state, zip , phone etc into the member_profile table instead of having address table and having the AddressID like i currently have.
Thank you
I would say "no" and "yes, but" as the answers to 1) and 2). For #1, your queries are going to be a lot easier to manage if you create columns for each preference. The best systems I've worked with were done that way. Moving the preferences into a separate table with "user, preference, value" triples leads to complex queries that join multiple tables just to check a setting.
For #2: there's no reason to put the address in another table, because the single "AddressID" column means there's just one address per member, anyway, and again, it's just going to complicate the queries. If you turn it around backwards and have an address table that embeds userids then that might make sense; it makes even more sense to do phone numbers that way, since people often have multiple phone numbers.
If each member in the database has exactly ONE value for each of the attributes you have listed, then your database is already normalized and thus in a quite convenient form. So, to answer #1, moving these fields to a different table would improve nothing and just make querying more difficult.
As for #2, if you wanted to contemplate the possibility of a member having multiple addresses or phone numbers, you should definitely put those in different tables, allowing many-to-one relationships. This might also make sense if you expect that a number of users will share the same address; this way, you will not be duplicating information by having to store all the same address information for multiple users, you would just reference an addresses table that would have the relevant information one time per address.
However, if you need neither multiple addresses per member nor multiple members per address, then putting the addresses information in another table is just unnecessary complexity. Which solution is more convenient depends on the needs of your specific application.
Since each member has exactly one value in this table, it's already normalized. However, considering query efficiency, sometimes denormalization should be considered.
Except the ID field, the others could seperate into 2 groups: profile group and settings group. If your website usaually use these two groups of data seperately, you should consider to have news table for different usage.
For example, if the profile fields only shows in profile page and the settings fields works in whole site, it's not necessary to look up profile fields all the time.

Shall I put contact information in a separate table?

I'm planning a database who has a couple of tables who contain plenty of address information, city, zip code, email address, phone #, fax #, and so on (about 11 columns worth of it), a table is an organizations table containing (up to) 2 addresses (legal contacts and contacts they should actually be used), plus every user has the same information tied to him.
We are going to have to run some geolocation stuff on those addresses too (like every address that's within X Kilometers from another address).
I have a bunch of options, each with its own problem:
I could put all the information inside every table but that would make for tables with a very large amount of columns which I'd have problems indexing, and if I change my address format it'll take a while to fix it.
I could put all the information inside an array and serialize it, then store the serialized information in one field, same problem with the previous method with a little less columns and much less availability through mysql queries
I could create a separate table with address information and link it to the other tables either by
putting an address_id column in the users and organizations table
putting a related_id and related_table columns in the addresses table
That should keep stuff tidier, but it might create some unforeseen problems with excessive joining or whatever.
Personally I think that solution 3.2 is the best, but I'm not too confident about it, so I'm asking for opinions.
Option 2 is definitely out as it would put the filtering logic into your codes instead of letting the DBMS handle them.
Option 1 or 3 will depend on your need.
if you need fast access to all the data, and you usually access both addresses along with the organization information, then you might consider option 1. But this will make it difficult to query out (i.e. slow) if the table get too big in mysql.
option 3 is good provided you index the tables correctly.