Automated normalization of mySQL database - how to do it? - mysql

I have a mySQL database filled with one huge table of 80 columns and 10 million rows. The data may have inconsistencies.
I would like to normalize the database in an automated and efficient way.
I could do it using java/c++/..., but I would like to do as much as possible inside the database. I guess that any work outside the database will slow down things very much.
Suggestions on how to do it? What are good resources/tutorials to start with?
I am not looking for any hints on what normalization is (found plenty of this stuff using google)!

You need to study the columns to identify 'like' entities and break them out into seperate tabels. At best an automated tool might identify groups of rows with identical values for some of the columns, but a person who understood the data would have to decide if those truely belong as a seperate entity.
Here's a contrived example - suppose your columns were first name, last name, address, city, state, zip. An automated tool might identify rows of people who were members of the same family with the same last name, address, city, state, and zip and incorrectly conclude that those five columns represented an entity. It might then split the tables up:
First Name, ReferenceID
and another table
ID, Last Name, Address, City, State, Zip
See what i mean?

I can't think of any way you can automate it. You would have to create the tables that you want, and then go through and replace each piece of data with manual queries.
e.g.,
INSERT INTO contact
SELECT DISTINCT first_name, last_name, phone
FROM massive_table;
then you could drop the columns out of the massive table and replace it with a contact_id column.
You would have a similar process when pulling out rows that go into a one-to-many table.

In cleaning up messy data, I like to create user defined mysql functions to do typical data-scrubbing stuff... that way you can reuse them later. Approaching this way also lets you see if you can find existing udf's that have been written which you can use (with or without modification)... for example mysqludf.org

Related

sql query to check many interests are matched

So I am building a swingers site. The users can search other users by their interests. This is only part of a number of parameters used to search a user. The thing is there are like 100 different interests. When searching another user they can select all the interests the user must share. While I can think of ways to do this, I know it is important the search be as efficient as possible.
The backend uses jdbc to connect to a mysql database. Java is the backend programming language.
I have debated using multiple columns for interests but generating the thing is the sql query need not check them all if those columns are not addressed in the json object send to the server telling it the search criteria. Also I worry i may have to make painful modifications to the table at a later point if i add new columns.
Another thing I thought about was having some kind of long byte array, or number (used like a byte array) stored in a single column. I could & this with another number corresponding to the interests the user is searching for but I read somewhere this is actually quite inefficient despite it making good sense to my mind :/
And all of this has to be part of one big sql query with multiple tables joined into it.
One of the issues with me using multiple columns would be the compiting power used to run statement.setBoolean on what could be 40 columns.
I thought about generating an xml string in the client then processing that in the sql query.
Any suggestions?
I think the correct term is a Bitmask. I could maybe have one table for the bitmask that maps the users id to the bitmask for querying users interests, and another with multiple entries for each interest per user id for looking up which user has which interests efficiently if I later require this?
Basically, it would be great to have a separate table with all the interests, 2 columns: id and interest.
Then, have a table that links the user to the interests: user_interests which would have the following columns: id,user_id,interest_id. Here some knowledge about many-to-many relations would help a lot.
Hope it helps!

Best way to implement many-to-many relationship in MySQL

I have 2 tables, users(~1000 users) and country(~50 countries). A user can support many countries so I am planning to create a mapping table, user_country. However, since I have 1000 users and 50 countries, I will have a maximum of 50000 entries for this table. Is this the best way to implement this or is there a more appropriate method for this?
If this is the best way, how can i add a user supporting many countries to this table using only one SQL statement?
For ex:
INSERT INTO user_country(userid, countrycode)
VALUES ('user001','US'),
('user001','PH'),
('user001','KR'),
('user001','JP')
Above SQL statement will be too long if a user supports all 50 countries. And I have to do it for 1000 users. Anyone have any ideas the most efficient way to implement this?
From the point of view of database design, a table like your user_country is the only sensible way to go. 50000 records are a breeze for MySQL, and having them together with the appropriate indexes will open up all ways of future use for those data.
As far as I can see, this is unrelated to the problem of many large SQL insert statements. No matter how you represent the data in the database, you will have to write statements containing, for each user, a list of countries.
This is a one-time action, right? So it doesn't need to be a masterpiece in software engineering. What I sometimes do is load the raw data in Excel, line by line, then write a formula that "calculates" the appropriate SQL statement for the first line, and copy this formula for all lines. Then throw all these statements at the database. Even if there are tens of thousands of them, it's not much effort.
Personally I'd do the insert based on a select:
INSERT INTO user_country SELECT 'user001', countryid from countries WHERE countryid IN ('US', 'PH', 'KR', 'JP');
You need to adapt to your column names.
The alternative of storing list of countries in a single column usercountries varchar (255) as US,FR,KR and so on would be possible as well - but you'd lose the ability to select users based on the country they support. In fact you don't lose it - but
SELECT * FROM users WHERE usercountries like '%KR%';
Is not a good query in terms of index usage. But as you only have 1000 users a tablescan will be mighty quick as well.

Can you produce a dynamically generated field in MySQL at the server lever?

We have an older system that's being replaced piecemeal. The people who originally designed it broke US telephone numbers for our clients up into three fields: phone_part_1, phone_part_2, and phone_part_3, corresponding to US Areacodes, Exchanges, and Phone Numbers respectively.
We're transitioning to use a single field, phone_number, to hold all 10 digits. But, because some pieces of the system will continue to reference the older fields, we've been forced to double up for the moment.
I'm wondering if it's possible to use MySQL built-in features to reroute requests for the old fields (both on read and write) to the newer field without having to change the old code (which is in a language nobody here is comfortable in anyhow.) So that:
SELECT phone_part_1 FROM users;
Would end up the same as
SELECT SUBSTRING( phone_number, 1, 3 );
To be clear, I want to do this without manipulating the individual queries. Is it possible? How?
You could define a VIEW:
CREATE VIEW users AS
SELECT SUBSTRING( phone_number, 1, 3 ) AS phone_number, ... FROM real_users;
Then you can query it as if it were a table:
SELECT phone_number FROM users;
But that would require your "real" table to be stored with a distinct table name. You can't make a view with the same name as an existing table.
When you're ready to really replace the table with the new structure, then you can use RENAME TABLE to change tables as a quick action (no table restructure required).
Have you looked into views? A view will take the place of a new table for now, providing a way to have your new structure, but still access the data in the original tables. Once you are ready for your final move, you can implement new tables and do a mass conversion of any remaining data you haven't done yet. Or you can go in reverse, which is what it sounds like you really would prefer.
Create your new table, convert your data, and set up a view that mimics the old structure.
Views in MySQL: http://dev.mysql.com/doc/refman/5.0/en/create-view.html

Is it good practice to consolidate small static tables in a database?

I am developing a database to store test data. Each piece of data has 11 tags of metadata. Currently I have a separate table for each of the metadata options. I have seen a few questions on here regarding best practices for numerous small tables, but I thought I'd pose the question for my own project because I didn't get a clear answer from the other questions asked.
Here is my table list, with the fields in each table:
Source Type - id, name, description
For Flight - id, name, description
Site - id, name, abrv, description
Stand - id, site (FK site table), name, abrv, descrition
Sensor Type - id, name, channels, descrition
Vehicle - id, name, abrv, descrition
Zone - id, vehicle (FK vehicle table), name, abrv, description
Event Type - id, name, description
Event - id, event type (FK to event type Table), name, descrition
Analysis - id, name, descrition
Bandwidth - id, name, descrition
You can see the fields are more or less the same in each of these tables. There are three tables that reference another table.
Would it be better to have just one large table called something like Meta with the following fields:
Meta: id, metavalue, name, abrv, FK, value, descrition
where metavalue = one of the above table names
and FK = a reference to another row in the Meta table in place of a foreign key?
I am new to databases and multiple tables seems most intuitive, but one table makes the programming easier.
So questions are:
Is it good practice to reduce the number of tables and put all static values in one table.
Is it bad to have a self referencing table.
FYI I am making this web database using django and mysql on a windows server with NTFS formatting.
Tips and best practices appreciate.
thanks.
"Would it be better to have just one large table" - emphatically and categorically, NO!
This anti-pattern is sometimes referred to as 'The one table to rule them all"!
Ten Common Database Design Mistakes: One table to hold all domain values.
Using the data in a query is much easier
Data can be validated using foreign key constraints very naturally,
something not feasible for the other
solution unless you implement ranges
of keys for every table – a terrible
mess to maintain.
If it turns out that you need to keep more information about a
ShipViaCarrier than just the code,
'UPS', and description, 'United Parcel
Service', then it is as simple as
adding a column or two. You could even
expand the table to be a full blown
representation of the businesses that
are carriers for the item.
All of the smaller domain tables will fit on a single page of disk.
This ensures a single read (and likely
a single page in cache). If the other
case, you might have your domain table
spread across many pages, unless you
cluster on the referring table name,
which then could cause it to be more
costly to use a non-clustered index if
you have many values.
You can still have one editor for all rows, as most domain tables will
likely have the same base
structure/usage. And while you would
lose the ability to query all domain
values in one query easily, why would
you want to? (A union query could
easily be created of the tables easily
if needed, but this would seem an
unlikely need.)
Most of these look like they won't do anything but expand codes into descriptions. Do you even need the tables? Just define a bunch of constants, or codes, and then have a dictionary of long descriptions for the codes.
The field in the referring table just stores the code. eg: "SRC_FOO", "EVT_BANG" etc.
This is also often known as the One True Lookup Table (OTLT) - see my old blog entry OTLT and EAV: the two big design mistakes all beginners make.

Shall I put contact information in a separate table?

I'm planning a database who has a couple of tables who contain plenty of address information, city, zip code, email address, phone #, fax #, and so on (about 11 columns worth of it), a table is an organizations table containing (up to) 2 addresses (legal contacts and contacts they should actually be used), plus every user has the same information tied to him.
We are going to have to run some geolocation stuff on those addresses too (like every address that's within X Kilometers from another address).
I have a bunch of options, each with its own problem:
I could put all the information inside every table but that would make for tables with a very large amount of columns which I'd have problems indexing, and if I change my address format it'll take a while to fix it.
I could put all the information inside an array and serialize it, then store the serialized information in one field, same problem with the previous method with a little less columns and much less availability through mysql queries
I could create a separate table with address information and link it to the other tables either by
putting an address_id column in the users and organizations table
putting a related_id and related_table columns in the addresses table
That should keep stuff tidier, but it might create some unforeseen problems with excessive joining or whatever.
Personally I think that solution 3.2 is the best, but I'm not too confident about it, so I'm asking for opinions.
Option 2 is definitely out as it would put the filtering logic into your codes instead of letting the DBMS handle them.
Option 1 or 3 will depend on your need.
if you need fast access to all the data, and you usually access both addresses along with the organization information, then you might consider option 1. But this will make it difficult to query out (i.e. slow) if the table get too big in mysql.
option 3 is good provided you index the tables correctly.