Normalization makes joins accross multiple tables difficult - mysql

I had a table for stores containing store name and address. After some discussion, we are now normalizing the the table, putting address in separate tables. This is done for two reasons:
Increase search speed for stores by location / address
Increase execution time for checking misspelled street names using the Levenshtein algorithm when importing stores.
The new structure looks like this (ignore typos):
country;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | varchar(2) | NO | PRI | NULL | |
| name | varchar(45) | NO | | NULL | |
| prefix | varchar(5) | NO | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
city;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| city | varchar(50) | NO | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
street;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| street | varchar(50) | YES | | NULL | |
| fk_cityID | int(11) | NO | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
address;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| streetNum | varchar(10) | NO | | NULL | |
| street2 | varchar(50) | NO | | NULL | |
| zipcode | varchar(10) | NO | | NULL | |
| fk_streetID | int(11) | NO | | NULL | |
| fk_countryID | int(11) | NO | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
*street2 is for secondary reference or secondary address in e.g. the US.
store;
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| name | varchar(50) | YES | | NULL | |
| street | varchar(50) | YES | | NULL | |
| fk_addressID | int(11) | NO | | NULL | |
+--------------------+--------------+------+-----+---------+-------+
*I've left out address columns in this table to shorten code
The new tables have been populated with correct data and the only thing remaining is to add foreign key address.id in store table.
The following code lists all street names correctly:
select a.id, b.street, a.street2, a.zipcode, c.city, a.fk_countryID
from address a
left join street b on a.fk_streetID = b.id
left join city c on b.fk_cityID = c.id
How can I update fk_addressID in store table?
How can I list all stores with correct address?
Is this bad normalization considering the reasons given above?
UPDATE
It seems like the following code lists all stores with correct address - however it is a bit slow (I have about 2000 stores):
select a.id, a.name, b.id, c.street
from sl_store a, sl_address b, sl_street c
where b.fk_streetID = c.id
and a.street1 = c.street
group by a.name
order by a.id

I'm not going to speak to misspellings. Since you're importing the data, misspellings are better handled in a staging table.
Let's look at this slightly simplified version.
create table stores
(
store_name varchar(50) primary key,
street_num varchar(10) not null,
street_name varchar(50) not null,
city varchar(50) not null,
state_code char(2) not null,
zip_code char(5) not null,
iso_country_code char(2) not null,
-- Depending on what kind of store you're talking about, you *could* have
-- two of them at the same address. If so, drop this constraint.
unique (street_num, street_name, city, state_code, zip_code, iso_country_code)
);
insert into stores values
('Dairy Queen #212', '232', 'N 1st St SE', 'Castroville', 'CA', '95012', 'US'),
('Dairy Queen #213', '177', 'Broadway Ave', 'Hartsdale', 'NY', '10530', 'US'),
('Dairy Queen #214', '7640', 'Vermillion St', 'Seneca Falls', 'NY', '13148', 'US'),
('Dairy Queen #215', '1014', 'Handy Rd', 'Olive Hill', 'KY', '41164', 'US'),
('Dairy Mart #101', '145', 'N 1st St SE', 'Castroville', 'CA', '95012', 'US'),
('Dairy Mart #121', '1042', 'Handy Rd', 'Olive Hill', 'KY', '41164', 'US');
Although a lot of people firmly believe that ZIP code determines city and state in the US, that's not the case. ZIP codes have to do with how carriers drive their routes, not with geography. Some cities straddle the borders between states; single ZIP code routes can cross state lines. Even Wikipedia knows this, although their examples might be out of date. (Delivery routes change constantly.)
So we have a table that has two candidate keys,
{store_name}, and
{street_num, street_name, city, state_code, zip_code, iso_country_code}
It has no non-key attributes. I think this table is in 5NF. What do you think?
If I wanted to increase the data integrity for street names, I might start with something like this.
create table street_names
(
street_name varchar(50) not null,
city varchar(50) not null,
state_code char(2) not null,
iso_country_code char(2) not null,
primary key (street_name, city, state_code, iso_country_code)
);
insert into street_names
select distinct street_name, city, state_code, iso_country_code
from stores;
alter table stores
add constraint streets_from_street_names
foreign key (street_name, city, state_code, iso_country_code)
references street_names (street_name, city, state_code, iso_country_code);
-- I don't cascade updates or deletes, because in my experience
-- with addresses, that's almost never the right thing to do when a
-- street name changes.
You could (and probably should) repeat this process for city names, state names (state codes), and country names.
Some problems with your approach
You can apparently enter a street id number for a street that's in the US, along with the country id for Croatia. (The "full name" of a city, so to speak, is the kind of fact you probably want to store in order to increase data integrity. That's probably also true of the "full name" of a street.)
Using id numbers for every bit of data greatly increases the number of joins required. Using id numbers doesn't have anything to do with normalization. Using id numbers without corresponding unique constraints on the natural keys--an utterly commonplace mistake--allows duplicate data.

Related

Mysql autoincrement does not increment

I have this table:
mysql> desc Customers;
+------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| CustomerID | int(10) unsigned | NO | PRI | NULL | |
| Name | char(50) | NO | | NULL | |
| Address | char(100) | NO | | NULL | |
| City | char(30) | NO | | NULL | |
+------------+------------------+------+-----+---------+-------+
Now, If I want to insert sample data:
mysql> insert into Customers values(null, 'Julia Smith', '25 Oak Street', 'Airport West');
ERROR 1048 (23000): Column 'CustomerID' cannot be null
I know I cannot make the ID null, but that should be job of mysql to set it numbers and increment them. So I try to simple not specifying the id:
mysql> insert into Customers (Name, Address, City) values('Julia Smith', '25 Oak Street', 'Airport West');
Field 'CustomerID' doesn't have a default value
Now I am in trap. I cannot make id null (which is saying for mysql "increment my ID"), and I cannot omit it, becuase there is no default value. So how should I make mysql to handle ids for me in new insertions?
Primary key means that every CustomerID has to be unique. and you defined it also as NOT NULL, so that an INSERT of NULL is not permitted
instead of >| CustomerID | int(10) unsigned | NO | PRI | NULL |
Make it
CustomerID BIGINT AUTO_INCREMENT PRIMARY KEY
and you can enter your data without problem
ALTER TABLE table_name MODIFY CustomerID BIGINT AUTO_INCREMENT PRIMARY KEY
#Milan,
Delete the CustomerID var from the table. And add this field again with the following details:
Field: CustomerID,
Type: BIGINT(10),
Default: None,
Auto_increment: tick in the checkbox
Click SAVE button to save this new field in the table. Now I hopefully it will work while inserting the new record. Thanks.

Optimizing searches for big MySQL table

I'm working with a MariaDB (MySQL) table which mainly contains information about the whole world cities, their latitude/longitude and the country code (2 characters) where the city is. The table is so big, over 2.5 milion rows.
show columns from Cities;
+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| city | varchar(255) | YES | | NULL | |
| lat | float | NO | | NULL | |
| lon | float | NO | | NULL | |
| country | varchar(255) | YES | | NULL | |
+---------+--------------+------+-----+---------+----------------+
I want to implement a city searcher, so I have to optimize the SELECTS, not the INSERTS or UPDATES (it will be always the same information).
I thought that I should:
create an index (by city? by city and country?)
create partitions (by country?)
Should I do both tasks? If so... How could I do them? Could anyone give me several advices? I'm a little bit lost.
PS. I tryied this to create and index by city and country (I don't know if I am doing it well...):
CREATE INDEX idx_cities ON Cities(city (30), country (2));
Do not use "prefix indexing". Simply use INDEX(city, country) This will work very well for either of these:
WHERE city = 'London' -- 26 results, half in the US
WHERE city = 'London' AND country = 'CA' -- one result
Do not use Partitions. The table is too small, and there is no performance benefit.
Since there are only 2.5M rows, use id MEDIUMINT UNSIGNED to save 2.5MB.
What other queries will you have? If you need to "find the 10 nearest cities to a given lat/lng", then see this.
Your table, including index(es), might be only 300MB.

Duplicate removal not working on table with many NULLs

Perhaps I've been staring at the screen too long but I have the following [legacy] table I'm messing with:
describe t3_test;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| provnum | varchar(24) | YES | MUL | NULL | |
| trgt_mo | datetime | YES | | NULL | |
| mcare | varchar(2) | YES | | NULL | |
| bed2prsn_asst | varchar(2) | YES | | NULL | |
| trnsfr2prsn_asst | varchar(2) | YES | | NULL | |
| tlt2prsn_asst | varchar(2) | YES | | NULL | |
| hygn2prsn_asst | varchar(2) | YES | | NULL | |
| bath2psrn_asst | varchar(2) | YES | | NULL | |
| ampmcare2prsn_asst | varchar(2) | YES | | NULL | |
| any2prsn_asst | varchar(2) | YES | | NULL | |
| n | float | YES | | NULL | |
| pct | float | YES | | NULL | |
| trgt_qtr | varchar(12) | YES | | NULL | |
| recno | int(10) unsigned | NO | PRI | NULL | auto_increment |
| enddate | date | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
15 rows in set (0.00 sec)
I have data that looks like this..
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","5767343","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"1",NULL,NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075309","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075308","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,"0",NULL,NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","5767342","2008-12-31"
"555223","2008-10-01 00:00:00","N",NULL,"1",NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075327","2008-12-31"
"555223","2008-10-01 00:00:00","N","1",NULL,NULL,NULL,NULL,NULL,NULL,"36","83.7209","2008Q4","4075323","2008-12-31"
"555223","2008-10-01 00:00:00","Y","1",NULL,NULL,NULL,NULL,NULL,NULL,"4","9.30233","2008Q4","4075325","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"0",NULL,NULL,NULL,NULL,NULL,"3","6.97674","2008Q4","4075310","2008-12-31"
"555223","2008-10-01 00:00:00",NULL,NULL,"1",NULL,NULL,NULL,NULL,NULL,"40","93.0233","2008Q4","4075311","2008-12-31"
The first two lines of the table clearly appear to be dupes (minus the A.I. index "recno"). I've tried a half dozen dupe-removal routines and they are not automatically removed.
At this point I am not sure what exactly is wrong? Is it possible there's an invisible character somewhere? Is it possible a letter is in a different character encoding? When I dump the data to CSV as is listed, it doesn't look any different.
Do you have a delete routine that would work on this file structure that would remove anything that is a dupe (minus the recno field)? I have been staring at this for two days and for some reason, it escapes me. (btw, I am aware of the column name anomaly for bathd2psrn_asst - that's not it)
This (original) table has over 13 million records in it. And is over 3GB in size so I'm looking for the most efficient way to kill dupes.. Any ideas?
Here's an example of one of the dupe-killing techniques I used that did not work:
DELETE a FROM t3_test as a, t3_test as b WHERE
(a.provnum=b.provnum)
AND (a.trgt_mo=b.trgt_mo OR a.trgt_mo IS NULL AND b.trgt_mo IS NULL)
AND (a.mcare=b.mcare OR a.mcare IS NULL AND b.mcare IS NULL)
AND (a.bed2prsn_asst=b.bed2prsn_asst OR a.bed2prsn_asst IS NULL AND b.bed2prsn_asst IS NULL)
AND (a.trnsfr2prsn_asst=b.trnsfr2prsn_asst OR a.trnsfr2prsn_asst IS NULL AND b.trnsfr2prsn_asst IS NULL)
AND (a.tlt2prsn_asst=b.tlt2prsn_asst OR a.tlt2prsn_asst IS NULL AND b.tlt2prsn_asst IS NULL)
AND (a.hygn2prsn_asst=b.hygn2prsn_asst OR a.hygn2prsn_asst IS NULL AND b.hygn2prsn_asst IS NULL)
AND (a.bath2psrn_asst=b.bath2psrn_asst OR a.bath2psrn_asst IS NULL AND b.bath2psrn_asst IS NULL)
AND (a.ampmcare2prsn_asst=b.ampmcare2prsn_asst OR a.ampmcare2prsn_asst IS NULL AND b.ampmcare2prsn_asst IS NULL)
AND (a.any2prsn_asst=b.any2prsn_asst OR a.any2prsn_asst IS NULL AND b.any2prsn_asst IS NULL)
AND (a.n=b.n OR a.n IS NULL AND b.n IS NULL)
AND (a.pct=b.pct OR a.pct IS NULL AND b.pct IS NULL)
AND (a.trgt_qtr=b.trgt_qtr OR a.trgt_qtr IS NULL AND b.trgt_qtr IS NULL)
AND (a.enddate=b.enddate OR a.enddate IS NULL AND b.enddate IS NULL)
AND (a.recno>b.recno);
For such a large table, delete can be quite inefficient -- all the logging needed for the deletes is very cumbersome.
I might recommend that you try the truncate/insert approach:
create table temp_t3_test as (
select provnum, targ_mo, . . .,
min(recno) as recno,
enddate
from t3_test
group by provnum, targ_mo, . . ., enddate;
truncate table t3_test;
insert into t3_test(provnum, targ_mo, . . . , recno, enddate)
select *
from temp_t3_test;
Try:
CREATE TABLE t3_new AS
(
SELECT provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prs‌​n_asst,
hygn2prsn_ass‌​t,
bath2psrn_asst,
amp‌​mcare2prsn_asst,
any2‌​prsn_asst,
n,
pct,
trgt‌​_qtr,
Min(recno),
endd‌​ate
FROM t3_test
GROUP BY provnum,
trgt_mo,
mcare,
bed2prsn_asst,
trnsfr2prsn_asst,
tlt2prs‌​n_asst,
hygn2prsn_ass‌​t,
bath2psrn_asst,
amp‌​mcare2prsn_asst,
any2‌​prsn_asst,
n,
pct,
trgt‌​_qtr,
enddate
)
When you use min(recno), you don't actually select just one row. you select the minimum of all recno and use the same value for all the rows. To remove less rows, you can use distinct or group by as I have used. I would say that you can remove the rec no from the temp table and use a new auto increment column in the table that you create again to avoid gaps in the ids.
This is to be used in with the method suggested by Gordon Linoff.
In the case of this scenario, the problem was not with the SQL statement. It was a problem with the DATA, but it was not visible.
The two fields designated type "float" held hidden decimal values that were slightly different from each other. Converting those fields to DECIMAL(a,b) type made the dupes show up and be properly deleted by conventional means.
Special thanks to Gordon Linoff for suggesting looking into this.

MySql - Index optimization

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of our customer. Each of our customer contains unique domain name that means customer determined by domain nam
Database server : MySql 5.6
Table rows : 400 million
Following is our table schema.
+---------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| domain | varchar(50) | NO | MUL | NULL | |
| guid | binary(16) | YES | | NULL | |
| sid | binary(16) | YES | | NULL | |
| url | varchar(2500) | YES | | NULL | |
| ip | varbinary(16) | YES | | NULL | |
| is_new | tinyint(1) | YES | | NULL | |
| ref | varchar(2500) | YES | | NULL | |
| user_agent | varchar(255) | YES | | NULL | |
| stats_time | datetime | YES | | NULL | |
| country | char(2) | YES | | NULL | |
| region | char(3) | YES | | NULL | |
| city | varchar(80) | YES | | NULL | |
| city_lat_long | varchar(50) | YES | | NULL | |
| email | varchar(100) | YES | | NULL | |
+---------------+------------------+------+-----+---------+----------------+
In above table guid represents visitor of our customer site and sid represents visitor session of our customer site. That means for every sid there should be associated guid.
We need queries like following
Query 1 : Find unique,total visitors
SELECT count(DISTINCT guid) AS count,count(guid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,sid
Query 2 : Find unique,total sessions
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'
composite index planning : domain,stats_time,guid
Query 3: Find visitors,sessions by country ,by region, by city
composite index planning : domain,country
composite index planning : domain,region
Each combination is requiring new composite index. That means huge index file, we can't keep this in memory so performance of the queries are low.
Is there any way optimize this index combinations to reduce index size and improve performance.
Just for grins, run this to see what type of spread you have...
select
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d') DATEONLY, count(*)
from
yourTable
group by
country, region, city,
DATE_FORMAT(colName, '%Y-%m-%d')
order by
count(*) desc
and then see how many rows it returns. Also, what sort of range does the COUNT column generate. Instead of just an index, does it make sense to create a separate aggregation table on the key elements you are trying to provide with data mining.
If so, I would recommend looking at a similar post also on the stack here. This shows a SAMPLE on how, but I would first look at the counts before suggesting further. But if you have it broken down on a daily basis, what MIGHT this be reduced to.
Additionally, you might want to create pre-aggregate tables ONCE to get started, then have a nightly procedure that builds any new records based on a day just completed. This way it is never running through all 400M records.
If your pre-aggregate tables store based on just the date (y,m,d only), your queries rolled-up per day would shorten querying requirements. The COUNT(*) is just an example basis, but your could add count( distinct whateverColumn ) as needed. Then, you could query the SUM( aggregateColumn ) based on domain, date range, etc. If your 400M records gets reduced down to 7M records, I would also have a minimum index on the (domain, dateOnlyField, and maybe country) to optimize your domain, date-range queries. Once you get something narrowed down at whatever level make sense, you could always drill into the raw data for the granular level.

MySQL CONCAT multiple unique rows

So, here's basically the problem:
For starter, I am not asking anyone to do my homework, but to just give me a nudge in the right direction.
I have 2 tables containing names and contact data for practicing
Let's call these tables people and contact.
Create Table for people:
CREATE TABLE `people` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`fname` tinytext,
`mname` tinytext,
`lname` tinytext,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Create Table for contact:
CREATE TABLE `contact` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`person_id` int(10) unsigned NOT NULL DEFAULT '0',
`tel_home` tinytext,
`tel_work` tinytext,
`tel_mob` tinytext,
`email` text,
PRIMARY KEY (`id`,`person_id`),
KEY `fk_contact` (`person_id`),
CONSTRAINT `fk_contact` FOREIGN KEY (`person_id`) REFERENCES `people` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
When getting the contact information for each person, the query I use is as follows:
SELECT p.id, CONCAT_WS(' ',p.fname,p.mname,p.lname) name, c.tel_home, c.tel_work, c.tel_mob, c.email;
This solely creates a response like:
+----+----------+---------------------+----------+---------+---------------------+
| id | name | tel_home | tel_work | tel_mob | email |
+----+----------+---------------------+----------+---------+---------------------+
| 1 | Jane Doe | 1500 (xxx-xxx 1500) | NULL | NULL | janedoe#example.com |
| 2 | John Doe | 1502 (xxx-xxx 1502) | NULL | NULL | NULL |
| 2 | John Doe | NULL | NULL | NULL | johndoe#example.com |
+----+----------+---------------------+----------+---------+---------------------+
The problem with this view is that row 1 and 2 (counting from 0) could've been grouped to a single row.
Even though this "non-pretty" result is due to corrupt data, it is likely that this will occur in a multi-node database environment.
The targeted result would be something like
+----+----------+---------------------+----------+---------+---------------------+
| id | name | tel_home | tel_work | tel_mob | email |
+----+----------+---------------------+----------+---------+---------------------+
| 1 | Jane Doe | 1500 (xxx-xxx 1500) | NULL | NULL | janedoe#example.com |
| 2 | John Doe | 1502 (xxx-xxx 1502) | NULL | NULL | johndoe#example.com |
+----+----------+---------------------+----------+---------+---------------------+
Where the rows with the same id and name are grouped when still showing the effective data.
Side notes:
innodb_version: 5.5.32
version: 5.5.32-0ubuntu-.12.04.1-log
version_compile_os: debian_linux-gnu
You could use GROUP_CONCAT(), which "returns a string result with the concatenated non-NULL values from a group":
SELECT p.id,
GROUP_CONCAT(CONCAT_WS(' ',p.fname,p.mname,p.lname)) name,
GROUP_CONCAT(c.tel_home) tel_home,
GROUP_CONCAT(c.tel_work) tel_work,
GROUP_CONCAT(c.tel_mob ) tel_mob,
GROUP_CONCAT(c.email ) email
FROM my_table
GROUP BY p.id