Designing a quick-lookup address database - mysql

If I were to design a database in MySQL to the following specifications:
1) Over 25mil records
2) Columns of house number, street, town, city, postcode
3) street, town, city and postcode need to be fulltext-searchable (on the front-end side, the search will be running on AJAX off a text-input field with immediate drop-down results)
How would I design the above?
I was thinking with working with a single table - is this a bad idea? Am not sure whether to normalize across different tables, given this is address data. I am also thinking that if working with a single table, I would do a FULLTEXT index across the searchable fields.
I have not worked with such a large DB before. Is the above a bad idea?
UPDATE #1:
Decided to normalize the street and postcode columns, which are the only ones actually being searched on (re-checked the original specification). Did some quick math and cardinality of street name is 2% and postcodes 6% of total data set, so I think this is the best way forward.
Currently running the import of 29 million rows - will take about 5 hours. Will update again later on performance tests for the sake of wrapping up this question.

Your design sounds reasonable. But. Are you sure that the addresses in the database all conform to the " , , " format? What about "c/o" addresses ("care/of")? Unit/apartment/floor/suite numbers? What about specific building names ("Barack Obama, White House, Washington, DC")?
In the United States, there are various exceptions to this address layout. For example, there is something called "Rural Routes", whose format is "RR BOX " (described here). There are PO Boxes and military addresses. In fact, I just learned that the US Post Office has a publication describing all sorts of different address formats (here).
A more general form is something like "Address Line 1", "Address Line 2", "City", "Post Code". There are services that standardize addresses for much of the world, and even software available for this purpose.
Your idea of using full text search is a good idea. When looking for a partial match on a street name, for instance, it will be much faster.

Related

SQL (MySQLi) best fit for text comparation

i have the following scenario:
DB table of addresses linked with region ID. Based on address, the workers sorting the packets (QR scanning) to the shelves and re-distributing them to the warehouses all around the capitol city. So far so good, everything seems OK, but there is a problem:
My DB table (MySQL) has the following fields:
ID (*auto increment, PK)*
STREET_NAT (*local name of the street - Cyrillic*) UTF8
STREET_EN (*English name of the street - Latin*) UTF8
REGION_ID (*number from 1 to 116 , that describes in what part of the town (warehouse) will be the package distributed*)
The problem is, sometimes the addresses are not correctly written plus as a bonus, sometimes they are in Cyrillic, sometimes in Latin.
I need to create a sorting system that analyzes the best fit of the street address and decides in which part of the city will travel the package. But the people makes mistakes (for example they are not entering "Jules Verne str." , but "Jul Vern st." , or even in Cyrillic with mistakes.
So my question: Does exists some procedure/method in MySQL to guess the best fit for the address? I am thinking in point system based on
php:
$query = "
SELECT REGION_ID FROM ADDRESSES WHERE STREET_NAT LIKE '%{$scanned_address}%'
OR STREET_EN LIKE '%{$scanned_address}%' "
this system works in approx 55% of the cases,when the sender of the package does not makes a mistake. I need to improve this select to add something like "Points" how close is the scanned address to the database field value. Best fit wins, and the region ID will be shown and sorted to the corresponding shelve. I am talking about thousands of packets / day.
Any ideas?
Thanks
i have an idea, create a table with the street correct names in it, in Israel we have this data for free, then you can compare from what the user types to the data in your database table with the correct street names. so it's auto complete.
you don't need to insert the values your self. just find a data source that can provide you this data, and it will be autocomplete for the user.

multiple like queries access

I am using Access.
Szenario
At work I've got a table with around 300k rows which defines Person IDs to House IDs with the associated information (First Name, Last Name, City, "Street + StreetNumber", Postal Code). Each person can live in n houses, and each house can inhabit n of those people.
When I am visited by various persons I get a table. This table is filled in by a human being, so there are no IDs in it and unfortunately often have spelling errors and information missing. It should contain "First Name", "Last Name", "Street & Nr", "City", "Postal Code".
To integrate the data I need the IDs from the persons. To counter the spelling error problem I want to build a table which gives me results ordered by "matching priority".
The handfilled Table is called tbl_to_fill and got empty Person-ID row, an indexed autonumber and First Name, Last Name, Street & Nr, City and Postal Code. The table with the relation information is called tbl_all.
So if I find an exact match (with a join query) from tbl_to_fill to tbl_all for "First Name", "Last Name" and "Postal Code" or "First Name", "Last Name", "Street & Nr", "City" it gets "matching priority" 1. If I find an exact match with only "Last Name", "Postal Code" or "Last Name", "City", "Street & Nr" I get a "matching priority" 2. And there are few more levels.
Then comes the tricky part:
Now I built a "tbl_filter" from "tbl_to_fill" with tweaked informations: The street numbers are cut, common spelling errors are replaced with a '*' (for example a common spelling error in german names: ph - f, like Stefan and Stephan), city names are shortened after the last space " " found and some more.
With this table I look for the same criteria as stated above, but with a "LIKE '*' & tbl_filter.Field & '*'" - query. And they get matching priority same as above + 10.
Now those join queries and the Like queries are all aggregated via a union query, let's call this Query 001 quni All rows.
I got this to work exactly like I want to, but it takes AGES, every time I run the last query.
My Question
Has someone done something alike? What can I do to fasten the process?
As many of my matching criteria expect First Name and Last Name to fit and then some more, should I first extract only matching rows from "tbl_all" via a make table and then run the according queries?
Should I use regex instead of like queries on a field which contains all information concatenated by a "-"?
Is there a better way to assign those priorities? Maybe all in one query via an Iff - function?
Select ..., matching_priority = IFF(tbl_all."First Name" = tbl_to_Fill."FirstName",1,
IFF(...)
)
From tbl_all;
I am a self tought access developer, so i often have problems knowing which approach is the most optimized.
I use VBA regularly and dont shy away from it, so if you got a solution via VBA let me know.
I think you might simplify your approach a bit if you were to use fuzzy text search. The common way of doing this is to use the Levenshtein distance, or the number of changes it takes to turn one string into another. A nice implementation of the Levenshtein is here:
Levenshtein Distance in Excel
In this manner, you can find the closest possible match on city, street, first name, last name, etc. You can even set up a "reasonable" limit, such as any record where the Levenshtein > 10 may be "unreasonable." I threw out 10, but it will vary depending on your data.
Some optimization notes:
Based on the fact that you have 300,000 rows, I would go so far as to say you still need to narrow your results a bit. Reading all 300,000 records for each match is unreasonable. For example, if you had state (which I see you don't), then a reasonable limit is to say the state must match. That takes your 300,000 down to a much lower number. You may want to also assume that the first letter of the last name must match. That will narrow it down further. Etc, etc.
If you can, I would use an actual RDBMS instead of Access to store the data and let the DB server do the heavy lifting. PostgreSQL, in particular, has nice fuzzy search capabilities available through extensions
One thing I have done in similar situations is to extract the first few characters of the last name, the first one or two characters of the first name, and the postal code, and write it from both tables to temp tables, and do the matching query on the truncated tables. After some tinkering with how many characters to extract, I can usually find a balance between speed and false positive matches that I can live with, then have a human review the resulting list. The speed difference can be significant - if instead of matching Schermerstien, Stephan to Schermerstien, Ste*an, you are matching Scher, St to Scher, St, you can see the processing advantage. But it only works if there is a small intersection between the tables and you can tolerate a human review step.

Grouping by part of a postcode

I am developing a php/mysql system for a legal advice agency. Clients are recorded in a table ‘clients’ which contains (amongst others) the columns clientid and postcode. (Note this is in the UK)
A client then registers for advice and a matter is opened. The matters table contains columns mattered, client and legalid.
Legalid refers to a table ‘legal’ which has columns legalid and legal (legal is the type of legal advice e.g. employment, housing etc)
What I need to be able to do is to count the number of people receiving advice in particular areas grouped by the first part of the UK postcode. I think I can do this except I don’t know how to do the postcode grouping as the first part could be 2, 3 or 4 characters. For example, the postcodes might be E2 6TY or E14 7YU - I want to group by E2 and E14 etc. Also, in some cases a client doesn't want to provide the whole postcode so only the 1st part is entered.
Has anybody any guidance as to how I might be able to do this grouping?
Many thanks
You can do
group by SUBSTRING_INDEX(postal_code, ' ', 1)
There are lots of posts on stackoverflow where people use and give examples of substring_index so just do a search for it.

Method to Match Multiple Columns Dynamically in one Table to another Table (e.g. Address, City, State, Zip)

I'm using SQL Server 2008 w T-SQL syntax. I know a little bit of C# and Python - so possibly could venture into those paths to make this work.
The Problem:
I have multiple databases where I have to match customers within each Database to a "Master Customer" file.
It's basically mapping those customers at those distributors to the supplier level.
There are 3-8 million customers for each Database (8 of them) that have to be matched to the Supplier Table (1800 customers).
I'd rather not have to do a Excel "Matching game" for about 3-4 weeks (30 million customers). I need some shortcuts as this is exhaustive.
This is one Distributor Table:
select
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
from
Distributor.sales
group by
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
This is a small snippet of what the table yields:
And I'd have to match that to a table that looks like this:
Basically I need a function that will search out the address lines in the Distributor DBs for a matches or closest match(es). I could do a case when address like '%birch%' to find all 'birch' street matches when distributor.zip_code=supplier.zip_Code" but I'd rather not have to write in each of those strings within the like statements.
Is there an XML trick that will pick out parts within the multiple distributor address columns to match that of the supplier address column? (or maybe 1 at a time, if that's easier)
Can I do a like '%[0-9][A-Z]% type of search? I'm open to best practices (will awards pts) as to tackle this beast. I guess I'mt not even sure how to tackle this other than brute force by grouping by zip codes and working street matches from there.
The matching/searching 'like' function (or XML) or whatever would have to try to dynamically match one column say "Birch St" in the Supplier Address column to find all matches of "Birch Ave" "Birch St" "Birch Ave" "Birch" that had that same zip.
Do you have SQL Server Integration Services? If so, you could use the SSIS Fuzzy Lookup to look up customers in the master customer table by address.
If you're not familiar with that feature of SSIS, it will let you specify columns to compare, how many potential matches to return, and a threshold that a comparison has to meet to be reported to you.
Here's an article specifically about matching addresses: Using Fuzzy Lookup Transformations in SQL Server Integration Services
If you don't have it available ... well, then we start getting into implementing your own fuzzy-matching algorithm. Which is quite possible: it's just a bit more laborious.

Should I "quick list" my drop-down list of countries?

My members can choose from a list of countries.
The A-Z lists starts at Afghanistan, and goes through many obscure countries.
Should I get the top ten countries and "quick-list" them at the top of the list?
Or is this seen as some sort of cultural superiority yadda yadda?
I'm using PHP/MySQL (trying to get a programming angle there)
I think it makes it harder to find your country.
Germany is (I think) one of those top 10 countries and I always have the problem that I don't know what to search for.
Search at the top, search for Germany, search for Deutschland,...
I think the easiest is alphabetically ordered countries.
If the List is very long you may start typing the first letter to get next to your country.
Another solution is to have the list only show countries that have been given as answers in the past, plus an "Other" option that expands the list (or shows a second list) with the full set.
Thus, if you've never had a visitor from, say, Kyrgyzstan, it wouldn't appear in the list at all. The first time a Kyrgyzstani user comes to the site, they'd choose "Other" on the list, and only then would you show the full list. After that, though, since Kyrgyzstan had been answered, you would show it in the initial list. (The threshold for that doesn't have to be 1 ... it can be any number you like, and you'd want to set it so that on balance, many more people are helped by the omission than are hurt by having to choose "Other".)
You could also include a population (or internet-using population) metric and automatically show all countries above a certain size, so the big ones like Germany would be included even before their first users start showing up. Or, if you know you'll have a lot of users from certain countries, for whatever reason, you can have a list of countries that are manually included as well.
Overall: don't underestimate the benefit you'll get by trimming down the list. It's little things like that that make a user interface "great" rather than "ok".
You may track the list of most picked countries from your list and put them in the beginning of your list. After this sublist add line and then ordered list of the rest of the countries.
After some time you may remove the logic and 'freeze' the sublist of n 'most popular' countries
Why not keep the list intact, but pre-select the country your visitor is in using geolocation?
Determine the user's physical location based on their IP address.
You can get started at http://www.ip2location.com/, but there are other free choices out there.
Do a Google search for city from "ip address", or country from "ip address".
Be aware that there are some physical divisions that are not apparent if you select a simple country. For example, the country France includes the French Caribbean, and if you are calculating port, you could be thrown off.
Apart from using geolocation, it is important to use the same list (and it is the same standardized list) used by Amazon, Google, Apple, etc. To see it, just go start ordering a product on Amazon and change your shipping address country.
The reason is that people who live in a given country are already used to selecting their country from this particular list, and know how to do it quickly. Any modifications that you make to the list, while well-meaning, will just slow them down.
Remember, people spend 99.99% of their time at other web sites. They know how to be efficient using the tools they've already come across. You should emulate those other sites whenever there is a standard way of doing things -- anything else will confuse your users.
I can also recommend using the country names in the language that your site displays. I am always annoyed by a country list on an English website that uses "Deutschland" for Germany. When I am on an English website, I am intuitively looking for the English country name.
An interesting question. I too have been wondering about this usability issue many times. Why not create a category call continent and users would visit a continent before selecting a country of his/her choice? Would it make it faster (or more convenient) for the users? Of course your list would now need to be level deep. Or with Ajax, this allows many opportunities for some new innovative usability ideas.
Personally, I don't like it. I'd prefer that countries were just listed alphabetically instead of "United States, Canada, Afghanistan, ..."
Perhaps if you know that 95% of your traffic is coming from one particular country then it might be defensible.