storing email as three attributes versus one (for statistical purposes) - mysql

I have a database that has a table email_eml that stores 3 attributes name_eml, host_eml and domain_eml. Which store email name the name of website and the domain name (like .com .net etc) It doesnt store # or a . in any of the variables.
This allows me some flexibility (for example checking average name lenght(before the # symbol) will be faster) . I can collect some statistics on email name, I can also create usernames from the name_eml attribute.
It however is also a burden to handle when people are submitting their email or i have to compare a whole email.
This will make me store the additional # and . symbols and make me seperate the name through script when i want to collect statistics.
I wonder if its better to store the email in a single column instead of the 3 columns.
Is one of the ways more proper or more normalized way?
I would like the answer to include pros and cons of both approaches to storing the email adresses. (even if storing the emails in 3 columns doesnt have many pros)

It doesnt store # or a . in any of the variables.
Well, it should; cat.call#somedomain.com is a legal email address.
I wonder if its better to store the email in a single column instead
of the 3 columns. Is one of the ways more proper or more normalized
way?
This doesn't have anything to do with normalization. It has to do with complex data types.
The relational model allows arbitrarily complex data types. A commonly used complex data type is a timestamp, which typically includes year, month, day, hour, minute, second, and microsecond.
Given a timestamp, sometimes you might need to know only the date, and sometimes you might need to know only the year or only the hour. The relational model imposes a specific burden on the dbms when dealing with complex data types. For a complex data type, the dbms is required either to return it in its entirety, or to provide functions that return its various parts. The point is that, if a user wants only the hour out of a timestamp, the user doesn't write code to get it.
SQL dbms have good support for timestamps; every dbms that I'm familiar with provides functions that return various parts of timestamps. None of them have native support for email addresses.
On a SQL platform, you have at least two alternatives to keep your database close to the relational model. You can write functions that can be incorporated into the database server (if your dbms and your programming skill allows that), or you can split up the data type into pieces so each can be addressed in its entirety like any other value.
While there are probably some data types that make sense to split like that (street addresses might be one of them), I don't really see any compelling reason to split an email address.
This allows me some flexibility (for example checking average name
lenght(before the # symbol) will be faster) . I can collect some
statistics on email name, I can also create usernames from the
name_eml attribute.
While that's true, right now I can't imagine anything at all interesting about the average length of a username. I don't find any of your reasons compelling, but you know more about your application than I do.
If you really need to do a lot of operations on the pieces, it makes more sense to keep the pieces separate. More "normal" client code should access the email addresses through a view that concatenates the pieces. (Concatenation is a lot easier than parsing an email address at run time.)

It's extremely rare to store email addresses in three columns. If you want to do something like search on the part of the email before the # symbol you could just use a LIKE query...
SELECT email FROM people WHERE email LIKE 'john.smith#%';
I'd be interested to hear of any real-life examples that aren't possible to do with an SQL query.

In terms of normalization, once you break apart common aspects (such as host and especially top-level domain), they should be modeled as foreign relationships. So you end up with three tables:
emailNames
emailHosts
emailTLDs
emailNames then has three columns:
emailName
hostID
tldID
Note that I used "TLD", as this is likely the only part with significant overlap in the host name, and you can expect the "." character in hostnames before the start of the TLD.

Related

Database issue: 2 tables with identical structure because of the quality of the data

I have a database with one table where I store two different types of data.
I store a Quote and a Booking in a unique table named Booking.
First, I thought that a quote and a booking is the same since they had the same fields.
But then a quote is not related to a user where a booking is.
We have a lot of quotes in our database which pollutes the table booking with less important data.
I guess it makes sense to have two different tables so they can also evolve independently.
Quote
Booking
The objective is to split the data into junk data (quote) and the actual data (booking).
Does it make sense in the relational-database theory?
I'd start by looking for the domain model to tie this to - is a "quote" the same logical thing as a "booking"? Quotes typically have a different lifecycle to bookings, and bookings typically represent financial commitments. The fact they share some attributes is a hint that they are similar domain concepts, but it's not conclusive. Cars and goldfish share some attributes - age, location, colour - but it's hard to think of them as "similar concepts" at any fundamental level.
In database design, it's best to try to represent the business domain as far as is possible. It makes your code easy to understand, which makes it less likely you'll introduce bugs. It often makes the code simpler, too, which may make it faster.
If you decide they are related in the domain model, it may be a case of trying to model an inheritance hierarchy in the relational database. This question discusses this extensively.

How to sanitize or randomize sensitive database fields

What's the most efficient method or tool to randomize a list of database table columns to obscure sensitive information?
I have a Django application used by several clients, and I need to onboard some development contractors to do work on the application. When they work on bugs (e.g. page /admin/model/123 has an error), ideally they'd need a snapshot of the client database in order to reproduce and fix the bug. However, because they're off-site contractors, I'd like to mitigate risk in the event they expose the client database (unintentionally or otherwise). I don't want to have to explain to a client why all their data's been published online because a foreign contractor left his laptop in an unlocked car.
To do this, I'd like to find or write a tool to "randomize" sensitive fields in the database, like usernames, email addresses, account numbers, company names, phone numbers, etc so that the structure of the data is maintained, but all personally identifiable information is removed.
Presumably, this is a task that many other people have had to do, but I'm not sure what the technical term is, so I'm not finding much through Google. Are there any existing tools to do this with a Django application running a MySQL or PostgreSQL backend?
Anonymize and sanitize are good words for this chore.
It's relatively easy to do. Use queries like
UPDATE person
SET name = CONCAT('Person', person_id),
email = CONCAT('Person', person_id, '#example.com')
and so forth, to stomp actual names and emails and all that. It's helpful to preserve the uniqueness of entries, and the autoincrementing IDs of various tables can help you do that.
(Adding this as an answer, as I am not allowed to comment yet.)
As Cerin said, O. Jones approach for anonymizing/sanitizing works for simple fields, but not more complicated ones like addresses, phone number or account numbers that need to match a specific format. However, the method can be modified to allow this too.
Let's take a phone number with format aaa-bbbb-ccc as an example and use the autoincrementing person_id as the source of unique numbers. For the ccc part of the phone number, use MOD(person_id,1000). This will give the remainder of person_id divided by a 1000. For bbbb, take MOD((person_id-MOD(person_id,1000))/1000,10000). It looks complicated, but what this is doing is taking person_id, removing the last three digits (which were used for ccc), then dividing by a 1000. The last four digits are taken from the resulting number to use as bbbb. I think you'll be able to figure out how to calculate aaa.
The three parts of the phone number can then be concatenated to give the complete phone number: CONCAT(aaa,"-",bbbb,"-",ccc)
(You might have to explicitly convert the numbers to string, I'm not sure)

Matching 2 databases of names, given first, last, gender and DOB?

I collect a list of Facebook friends from my users including First, Last, Gender and DOB. I am then attempting to compare that database of names (stored as a table in MySQL) to another database comprised of similar information.
What would be the best way to conceptually link these results, with the second database being the much larger set of records (>500k rows)?
Here was what I was proposing:
Iterate through Facebook names
Search Last + DOB - if they match, assume a "confident" match
Search Last + First - if they match, assume a "probable" match
Search Last + Lichtenstein(First) above a certain level, assume a "possible" match
Are there distributed computing concepts that I am missing that may make this faster than a sequential mySQL approach? What other pitfalls may spring up, noting that it is much more important to not have a false-positive rather than miss a record?
Yes, your idea seems like a better algorithm.
Assuming performance is your concern, you can use caching to store the values that are just being searched. You can also start indexing the results in a NoSQL database such that the results will be very faster, so that you will have better read performance. If you have to use MySQL, read about polyglot persistence.
Assuming simplicity is your concern, you can still use indexing in a NoSQL database, so over the time you don't have to do myriad of joins will spoil the experience of the user and the developer.
There could be much more concerns, but it all depends on where would you like to use it, to use in a website, or to such data analytic purpose.
If you want to operate on the entire set of data (as opposed to some interactive thing), this data set size might be small enough to simply slurp into memory and go from there. Use a List to hang on to the data then create a Map> that for each unique last name points (via integer index) to all the places in the list where it exists. You'll also set yourself up to be able to perform more complex matching logic without getting caught up trying to coerce SQL into doing it. Especially since you are spanning two different physical databases...

What is the best normalization for street address?

Today I have a table containing:
Table a
--------
name
description
street1
street2
zipcode
city
fk_countryID
I'm having a discussion what is the best way to normalize this in terms of quickest search. E.g. find all rows filtered by city or zipcode. Suggested new structure is this:
Table A
--------
name
description
fk_streetID
streetNumber
zipcode
fk_countryID
Table Street
--------
id
street1
street2
fk_cityID
Table City
----------
id
name
Table Country
-------------
id
name
The dicussion is about having only one field for street name instead of two.
My argument is that having two feilds is considered normal for supporting international addresses.
The pro argument is that it will go on the cost of performance on search and possible duplication.
I'm wondering what is the best way to go here.
UPDATE
I'm aiming at having 15.000 brands associated with 50.000 stores, where 1.000 users will do multiple searches each day by web and iPhone. In addition I will be having 3. parties fetching data from the DB for their sites.
The site is not launched yet, so we have no idea of the workload. And we'll only have around 1000 brands assocaited with around 4000 stores when we start.
My standard advice (from years of data warehouse /BI experience) here is:
always store the lowest level of broken out detail, i.e. the multiple fields option.
In addition to that, depending on your needs you can add indexes or even a compound field that is the other two field concatenated - though make sure to maintain with a trigger and not manually or you will have data syncronization and quality problems.
Part of the correct answer for you will always depend on your actual use. Can you ever anticipate needing the address in a standard (2-line) format for mailing... or exchange with other entities? Or is this a really pure 'read-only' database that is just set up for inquiries and not used for more standard address needs such as mailings.
A the end of the day if you have issues with query performance, you can add additional structures such as compound fields, indexes and even other tables with the same data in a different form. Then there are also options for caching at the server level if performance is slow. If building a complex or traffic intensive site, chances are you will end up with a product to help anyway, for example in the Ruby programming world people use thinking sphinx If query performance is still an issue and your data is growing you may ultimately need to consider non-sql solutions like MongoDB.
One final principle that I also adhere to: think about people updating data if that will occur in this system. When people input data initially and then subsequently go to edit that information, they expect the information to be "the same" so any manipulation done internally that actually changes the form or content of the users input will become a major headache when trying to allow them to do a simple edit. I have seen insanely complicated algorithms for encoding and decoding data in this fashion and they frequently have issues.
I think the topmost example is the way to go, maybe with a third free-form field:
name
description
street1
street2
street3
zipcode
city
fk_countryID
the only thing you can normalize half-way sanely for international addresses is zip code (needs to be a free-form field, though) and city. Street addresses vary way too much.
Note that high normalisation means more joins, so it won't yield to faster searches in every case.
As others have mentioned, address normalization (or "standardization") is most effective when the data is together in one table but the individual pieces are in separate columns (like your first example). I work in the address verification field (at SmartyStreets), and you'll find that standardizing addresses is a really complex task. There's more documentation on this task here: https://www.smartystreets.com/Features/Standardization/
With the volume of requests you'll be handling, I highly suggest you make sure the addresses are correct before you deploy. Process your list of addresses and remove duplicates, standardize formats, etc. A CASS-Certified vendor (such as SmartyStreets, though there are others) will provide such a service.

Database denormalization opportunity

I'm looking for a strategy in as far as stopping the repetitive problem of branching out tables. For example as a fictitious use case, say I have a table with users that contains their name, login, password and other meta data. In this particular scenario, say the user is restricted to login per a specific subset of IP(s). Thus, we have a 1:M relationship. Everytime a use case such as the following comes up, your normal work flow includes that of have a 'users' table and a table such as 'user_ips' in which case you'd have something such as a pk(ip_id), fk(user_id) and IP on the user_ips side.
For similar situations, do you folks normally fan out in the fashion as above? Is there an opportunity to denormalize effectively here? Perhaps store the IPs in a BLOB column in some CSV delimited fashion? What are some strategies you folks are deploying today?
Opportunity to denormalize? I think you may have misunderstood conventional wisdom - denormalization is an optimization technique. Not something you go out looking for.
I would suspect that any normalized solution when the number of potential related items is large is going to out perform a denormalized solution if properly indexed. My strategy is to normalize the database then provide views or table-based functions that take advantage of indexed joins to make the cost bearable. I'd let performance demands dictate the move to a denormalized form.
Keep this in mind. If you need to implement role-based security access to parts of the information, table-based security is MUCH easier to implement than column-based, especially at the database or data layer level.
I would strongly suggest against putting multiple IP addresses in a field. Never mind 3NF this breaks 1NF.
Tvanfsson is right in that if you index the FKEY you'll get pretty comparable performance unless there's going to be millions of records in the 'users_ips' table.
What's even better is that by keeping these tables normalized you can actually report on this information in the future so that when users are confused as to why they can't login from certain LANs, writing the app (or SQL) to troubleshoot and do user IP lookups will be A LOT easier.
One option would be to store your Ip addresses as an xml string. I think this would be better than a comma separted list and allow you flexibility to add other elements to the string should you need them (Port comes to mind) without database changes.
Although, I think the normalized fashion is better in most cases.
As with any denormalization question, you need to consider the costs associated with it. In particular, if you list the IP addresses in the main table, how are you going to be able to answer the question "which users can be associated with IP address w.x.y.z?". With the fully normalized form, that is easy and symmetric with "which IP addresses can be associated with user pqr?". With denormalized forms, the questions have very different answers. Also, ensuring that the correct integrity rules are applied is much harder in the denormalized version, in general.
You may want to consider an user-attribute table and attribute-type table where you can define what type of attributes a user can have. Each new use use case would become an attribute type, and the data would simply be added user-attribute table.
With your example of the IP addresses, you would have an attribute type of IP and store the respective IP's in the user-attribute table. This gives you the flexibility to add another type, such as MAC address, and does not require that you create a new table to support the new data types. For each new use case you do not have to add anything bu data.
The down side is that your queries will be a little more complex given this attribute structure.
IMHO, it's all about cost/benefit analysis. All depends on requirements (including probable ones) and capabilities of the platform you are using.
For example, if you have a requirement like "Show all unique IP addresses recorded in the system", then you better "branch" now and create a separate table to store IP addresses. Or if you need certain constraints on IP addresses (like "all IP addresses of a given user must be unique) then you might greatly benefit from having a separate table and proper constraints applied to it. (Please note that you could meet both requirements even if you used de-normalized design and proper XML-related machinery; however, RelDB-based solution to these requirements seems to be much cheaper to implement and maintain.)
Obviously, these are not be the only examples of requirements that would dictate normalized solution.
At the same time, I think that requirements like "Show all IP addresses of a user" or "Show all users associated with a given IP address" may not be sufficient to justify a normalized solution.
You could try to perform deeper analysis (in search of requirements of the first type), or just rely on your understanding of the project's context (current and future) and on your "guts feeling".
My own "guts feeling" in this particular case is that requirements of the first type (pro-normalization requirements) are extremely likely, so you'd be better off with a normalized solution, from the very beginning. However, you've said that this use case is fictitious, so in your real situation the conclusion may be exactly opposite.
Never say "never": 3NF is not always the best answer.