SQL (MySQLi) best fit for text comparation - mysql

i have the following scenario:
DB table of addresses linked with region ID. Based on address, the workers sorting the packets (QR scanning) to the shelves and re-distributing them to the warehouses all around the capitol city. So far so good, everything seems OK, but there is a problem:
My DB table (MySQL) has the following fields:
ID (*auto increment, PK)*
STREET_NAT (*local name of the street - Cyrillic*) UTF8
STREET_EN (*English name of the street - Latin*) UTF8
REGION_ID (*number from 1 to 116 , that describes in what part of the town (warehouse) will be the package distributed*)
The problem is, sometimes the addresses are not correctly written plus as a bonus, sometimes they are in Cyrillic, sometimes in Latin.
I need to create a sorting system that analyzes the best fit of the street address and decides in which part of the city will travel the package. But the people makes mistakes (for example they are not entering "Jules Verne str." , but "Jul Vern st." , or even in Cyrillic with mistakes.
So my question: Does exists some procedure/method in MySQL to guess the best fit for the address? I am thinking in point system based on
php:
$query = "
SELECT REGION_ID FROM ADDRESSES WHERE STREET_NAT LIKE '%{$scanned_address}%'
OR STREET_EN LIKE '%{$scanned_address}%' "
this system works in approx 55% of the cases,when the sender of the package does not makes a mistake. I need to improve this select to add something like "Points" how close is the scanned address to the database field value. Best fit wins, and the region ID will be shown and sorted to the corresponding shelve. I am talking about thousands of packets / day.
Any ideas?
Thanks

i have an idea, create a table with the street correct names in it, in Israel we have this data for free, then you can compare from what the user types to the data in your database table with the correct street names. so it's auto complete.
you don't need to insert the values your self. just find a data source that can provide you this data, and it will be autocomplete for the user.

Related

Method to Match Multiple Columns Dynamically in one Table to another Table (e.g. Address, City, State, Zip)

I'm using SQL Server 2008 w T-SQL syntax. I know a little bit of C# and Python - so possibly could venture into those paths to make this work.
The Problem:
I have multiple databases where I have to match customers within each Database to a "Master Customer" file.
It's basically mapping those customers at those distributors to the supplier level.
There are 3-8 million customers for each Database (8 of them) that have to be matched to the Supplier Table (1800 customers).
I'd rather not have to do a Excel "Matching game" for about 3-4 weeks (30 million customers). I need some shortcuts as this is exhaustive.
This is one Distributor Table:
select
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
from
Distributor.sales
group by
master_cust_Num,
master_cust_name,
cust_shipto_name,
cust_shipto_address,
cust_shipto_address_2,
cust_shipto_city,
cust_shipto_state,
cust_shipto_zip
This is a small snippet of what the table yields:
And I'd have to match that to a table that looks like this:
Basically I need a function that will search out the address lines in the Distributor DBs for a matches or closest match(es). I could do a case when address like '%birch%' to find all 'birch' street matches when distributor.zip_code=supplier.zip_Code" but I'd rather not have to write in each of those strings within the like statements.
Is there an XML trick that will pick out parts within the multiple distributor address columns to match that of the supplier address column? (or maybe 1 at a time, if that's easier)
Can I do a like '%[0-9][A-Z]% type of search? I'm open to best practices (will awards pts) as to tackle this beast. I guess I'mt not even sure how to tackle this other than brute force by grouping by zip codes and working street matches from there.
The matching/searching 'like' function (or XML) or whatever would have to try to dynamically match one column say "Birch St" in the Supplier Address column to find all matches of "Birch Ave" "Birch St" "Birch Ave" "Birch" that had that same zip.
Do you have SQL Server Integration Services? If so, you could use the SSIS Fuzzy Lookup to look up customers in the master customer table by address.
If you're not familiar with that feature of SSIS, it will let you specify columns to compare, how many potential matches to return, and a threshold that a comparison has to meet to be reported to you.
Here's an article specifically about matching addresses: Using Fuzzy Lookup Transformations in SQL Server Integration Services
If you don't have it available ... well, then we start getting into implementing your own fuzzy-matching algorithm. Which is quite possible: it's just a bit more laborious.

Designing a quick-lookup address database

If I were to design a database in MySQL to the following specifications:
1) Over 25mil records
2) Columns of house number, street, town, city, postcode
3) street, town, city and postcode need to be fulltext-searchable (on the front-end side, the search will be running on AJAX off a text-input field with immediate drop-down results)
How would I design the above?
I was thinking with working with a single table - is this a bad idea? Am not sure whether to normalize across different tables, given this is address data. I am also thinking that if working with a single table, I would do a FULLTEXT index across the searchable fields.
I have not worked with such a large DB before. Is the above a bad idea?
UPDATE #1:
Decided to normalize the street and postcode columns, which are the only ones actually being searched on (re-checked the original specification). Did some quick math and cardinality of street name is 2% and postcodes 6% of total data set, so I think this is the best way forward.
Currently running the import of 29 million rows - will take about 5 hours. Will update again later on performance tests for the sake of wrapping up this question.
Your design sounds reasonable. But. Are you sure that the addresses in the database all conform to the " , , " format? What about "c/o" addresses ("care/of")? Unit/apartment/floor/suite numbers? What about specific building names ("Barack Obama, White House, Washington, DC")?
In the United States, there are various exceptions to this address layout. For example, there is something called "Rural Routes", whose format is "RR BOX " (described here). There are PO Boxes and military addresses. In fact, I just learned that the US Post Office has a publication describing all sorts of different address formats (here).
A more general form is something like "Address Line 1", "Address Line 2", "City", "Post Code". There are services that standardize addresses for much of the world, and even software available for this purpose.
Your idea of using full text search is a good idea. When looking for a partial match on a street name, for instance, it will be much faster.

Best strategy for storing order's addresses

I have a 'strategy' question.
Thing is, we have a table of customers' addresses and customer orders. Structure is something like (just an example, ignore filed types etc.):
Address
id INT
line1 TEXT
line2 TEXT
state TEXT
zip TEXT
countryid INT
To preserve historical validity of the data we are storing those addresses in a text field with orders (previously it was done by reference, but this is wrong because if address changes all old orders change delivery address too, which is wrong). E.g:
Orders
id INT
productid INT
quantity INT
delivery_address TEXT
delivery address is something akin to CONCAT_WS("\n",line1,line2,state,zip,country_name)
Everything is nice and dandy, however it seems that customers need an access to historical data and be able to export those in XML format and they want to have those lines split up properly again. Because sometimes there is no line2 or state or zip or whatever, how can we store this information in a way that we can then decipher the 'label' of each line?
Storing as JSON encoded array was suggested but is this a best way? I thought about storing it as XML... or maybe create those 6-10 extra columns and store address data with every order? Perhaps some of you guys have more experience in dealing with this kind of stuff and be able to point me in the right direction.
Thanks in advance!
Personally I would model the addresses as a single table, every update to the address would generate a new row, this would be marked as the current address.
I guess you could allow deletes if there are no related orders, however it would be simpiler to mark the old record as inactive.
This will allow you to preserve the relationship between orders & addresses,
and to easily query the historic data at a later date.
see the wikipedia entry for slowly changing dimensions
The best way IMHO is to add history to the address-table. This will cause extra elements to be added for its key (say address_id and {start_of_validity, end_of_validity}) The customer id than becomes a foreign key into the customer table. The orders table references only the address_id field (which is "stable" in time). New orders would reference the "current" row in address.
NB: I dont know json.
You should store those as 6-10 extra fields, just like you do in the current address. You see, that way you have every piece of information at hand, without having to parse anything.
Any other approach (concatenation, JSON, XML) will make you have to do parsing when you need to access the info.
when you say "previously it was done by reference, but this is wrong because if address changes all old orders change delivery address too, which is wrong", it was not that wrong ...
Funny, isn't it?
So, as proposed by others, adresses should (must?) be stored in an independant table. You'll then have different address types (invoicing, delivery), address status (active, inactive) and a de facto address history log ...
In order to be able to utilize the address data for future uses you will definitely want to retain as much metadata (meaning, fields such as Address, City, State, and ZIP). Losing this data by pulling it all into a single line looks simpler and may conserve a small amount of space but in the end is not the best method. In fact, breaking it apart is very difficult--much like separating out first and last names from a generic, one-size-fits-all "name" column. Having the data stored in complete entries, utilizing 6-10 new fields (as mentioned) is the best way to go.
Even better would be standardizing the addresses (at least the US addresses) when they are first entered. That would ensure that the address is real and deliverable and eliminate shipping issues in the future. My thoughts, always retain as much of the data as possible because storage is cheap and data is valuable.
In the interest of full disclosure, I am the founder SmartyStreets. We provide street address verification.

Database Normalization with user input

I develop a mysql database that will contain the country,city and occupation of each user.
While I can use a "country" table and then insert the id of the country into the user table, I still have to look for the perfect method for the other two tables.
The problem is that the city and occupation of each user are taken from an input field, meaning that users can type "NYC" or "New York" or "New York City" and millions of other combinations for each town, for example.
Is it a good idea to disregard this issue, create an own "town" table containing all the towns inserted by users and then put the id of the town entry into the user table or would it be more appropriate to use a VARCHAR column "town" in the user table and not normalize the database concerning this relation?
I want to display the data from the three tables on user profile pages.
I am concerned about normalization because I don't want to have too much redundant data in my database because it consumes a lot of space and the queries will be slower if I use a varchar index instead of an integer index for example (as far as I know):
Thanks
We had this problem. Our solution was to collect the various synonyms and typo-containing versions that people use and explicitly map them to a known canonical city name. This allowed to correctly guess the name from user input in 99% of cases.
For the remaining 1%, we created a new city entry and marked it as a non-canonical. Periodically we looked through non-canonical entries. For recognizable known cities, we remapped the non-canonical entry to the canonical (updating FKs of linked records and adding a synonym). For a genuinely new city name we didn't know about we kept the created entry as canonical.
So we had something like this:
table city(
id integer primary key,
name varchar not null, -- the canonical name
...
);
table city_synonym(
name varchar primary key, -- we want unique index
city_id integer foreign key references(city.id)
);
Usually data normalization helps you to work with data and keep it simple. If normalized schema not fit your needs you can use denormalized data as well. So it depends on queries you want to use.
There is no good solution to group cities without creating separate table where you will keep all names for each city within single id. So it will be good to have 3 tables then: user(user_id, city_id), city (city_id, correct name), city_alias(alias_id, city_id, name).
It would be better to store the data in a normalized design, containing the actual, government recognized city names.
#Varela's suggestion of an 'alias' for the city would probably work well in this situation. But you have to return a message along the lines of "You typed in 'Now Yerk'. Did you perhaps mean 'New York'?". Actually, you want to get these kinds of corrections regardless...
Of course, what you should probably actually store isn't the city, but the postal/zip code. Table design is along these lines:
State:
Id State
============
AL Alabama
NY New York
City:
Id State_Id City
========================
1 NY New York
2 NY Buffalo
Zip_Code:
Id Code City_Id
=========================
1 00001-0001 1
And then store a reference to Zip_Code.Id whenever you have an address. You want to know exactly which zip code a user has (claimed) to be a part of. Reasons include:
Taxes for retail (regardless of how Amazon plays out).
Addresses for delivery (There is a Bellevue in both Washington and New York, for example. Zip codes are different).
Social mapping. If you store it as 'user input' cities, you will not be able to (easily) analyze the data to find out things like which users live near each other, much less in the same city.
There are a number of other things that can be done about address verification, including geo-location, but this is a basic design that should help you in most of your needs (and prevent most of the possible 'invalid' anomalies).

How to design this "bus stations" database?

I want to design a database about bus stations. There're about 60 buses in the city, each of them contains these informations:
BusID
BusName
Lists of stations in the way (forward and back)
This database must be efficient in searching, for example, when user want to list buses which are go through A and B stations, it must run quickly.
In my first thought, I want to put stations in a seperate table, includes StationId and Station, and then list of stations will contains those StationIds. I guest it may work, but not sure that it's efficient.
How can I design this database?
Thank you very much.
Have you looked at Database Answers to see if there is a schema that fits your requirements?
I had to solve this problem and I used this :
Line
number
name
Station
name
latitude
longitude
is_terminal
Holiday
date
description
Route
line_id
from_terminal : station_id
to_terminal : station_id
Route schedule
route_id
is_holiday_schedule
starting_at
Route stop
route_id
station_id
elapsed_time_from_start : in minutes
Does it looks good for you ?
Some random thoughts based on travel on London buses In My Youth, because this could be quite complex I think.
You might need entities for the following:
Bus -- the physical entity, with a particular model (ie. particular seating capacity and disabled access, and dimensions etc) and VIN.
Bus stop -- the location at which a bus stops. Usually bus stops come in pairs, one for each side of the road, but sometimes they are on a one-way road.
Route -- a sequence of bus stops and the route between them (multiple possible roads exist). Sometimes buses do not run the entire route, or skip stops (fast service). Is a route just one direction, or is it both? Maybe a route is actually a loop, not a there-and-back.
Service -- a bus following a certain route
Scheduled Run -- an event when a bus on a particular service follows a particular route. It starts at some part of the route, ends at another part, and maybe skips certain stops (see 3).
Actual Run -- a particular bus followed a particular scheduled run. What time did it start, what time did it get to particular stops, how many people got on and off, what kind of ticket did they have?
(This sounds like homework, so I won't give a full answer.)
It seems like you just need a many-to-many relationship between buses and stops using 3 tables. A query with two inner joins will give you the buses that stop at two specific stops.
I'd hack it.
bus_id int
path varchar(max)
If a bus goes through the following stations (in this order):
01
03
09
17
28
Then I'd put in a record where path was set to
'-01-03-09-17-28-'
When someone wants to find a bus to get from station 03 to 28, then my select statement is
select * from buses where path like '%-03-%-28-%'
Not scalable, not elegant, but dead simple and won't churn through tables like mad when trying to find a route. Of course, it only works if there's a single bus that goes through the two stations in question.
what you have thought is good, in some cases it may or may not be efficient. I think that yo u should create tables as table1(BusID, BusName) table 2(Station List, Bus Id). I think this would would help. And try to use joins between these two tables to get the result. One more thing if possible try to normalize the tables that would help you.
I'd go for 3 tables :
bus
stations
bus_stations
"bus" for what the name stands for, "stations" for the station id's and names, and "bus_stations" to connnect those other 2 tables, wich would have bus_id, station_id_from station_id_to
This is probably more complex that you really need, but if, in the furure, you need to know the full trajectory of a bus, and also, from witch station one bus comes when it goes to "B station", will be usefull.
60 buses will not make that much impact in performance though.