I was asked to use a database in which most of the primary keys, and other fields as well, uses char(n) to store numeric values with padding, for example:
product_id: char(8) [00005677]
user_id: char(6) [000043]
category_id: char(2) [05]
The reason they want to use it like that, is to be able to use characters (in the far future) if they want. However they have many rules based in numbers, for example, category_id from 01 to 79 correspond to a general category and from 80 to 89 is a special category and 90 to 99 is user defined category.
I personally think that using char(n) to store numbers is a bad practice. My reasons are:
using char, " " != 0, 0 != 00, 05 != 5, 00043 != 000043, and so on. For that reason,
the values have to be constantly checked (to prevent data corruption).
If I pad a number: 0 -> 00, then I have to pay attention not to pad
a character (A -> 0A)
If characters are used, then ranges become strange, something like:
from 01 to 79 and AB and RX and TZ and S, etc...
Indexing numbers instead of chars result in a performance gain
I'm proposing to change it to decimal(n) with zerofill to make it more "error-proof", as this information is modified by different sources (web, windows client, upload csv). If they want to add more categories, for example, then updating from decimal(2) to decimal(3) will be easier.
My question then is: Am I wrong? can char(n) be trusted for this task? If "chars" are evil with numbers, then which other disadvantages am I missing in the above list (I may need better reasons if I want to win my case)?
TIA (any comment/answer will be appreciated).
If this was SQL Server or Oracle or any other RDBMS, I would recommend enforcing a check constraint on those columns so that the data always matched the full capacity of the column - this would ensure your identifiers are uniform.
Unfortunately MySQL doesn't support this.
While it wouldn't stop the annoyance of having to pad things coming into the database or in search routines, on the client or in procs in the database, it would guarantee you that the fields were clean at the lowest level.
I find using constraints like this help avoid things getting badly out of hand.
As far as the optimization by using numbers, if they have to accommodate non-numeric characters in the future, that's not going to be an option.
It is very common to have natural keys (which could be candidates for a primary key) with varchar/char data, but yet instead enforce referential integrity on surrogate keys (usually some kind of autonumbering integer which is simply an internal reference, and often the clustered index and primary key).
Quoting your question:
...store numeric values with padding...
You did not show any examples of numeric data, only character data that happens to consist of numbers. If you had said that their OrderTotal column was a char(10), then I'd start to worry.
Just treat this as character data and you will be fine. I can see no business or technical case to change the database (unless you are beginning a near-total rewrite).
Regarding performance... If this is actually a concern, then you most likely have far bigger issues to deal with. MySQL is fast and accurate.
--
Write a function somewhere that will zerofill user inputted ID's for the purpose of querying. Use this function everywhere you need to accept user input. NEVER EVER use a numeric data type to store your data (if PHP, never use +, always use . to concat, etc...)
Remember, this is no different than Item_Number = "SHIRT123" or any other string ID you may encounter.
Take care
Related
I'm taking the Meta Data Engineer Professional Certificate and I was just given this prompt in a lab:
Mr. Carl needs to have a new table to store the contact details of each customer including customer account number, customer phone number and customer email address.
You are required to choose a relevant data type for each of the columns.
Solution:
Account number: INTEGER
Phone number: INTEGER
Email: VARCHAR
Prior to reading the solution I selected VARCHAR(10) as the datatype for storing phone numbers as I thought they should be treated as string data. My reasoning is that there's no reason to perform any sort of mathematical operation on a phone number, and they're often typed with other characters like "(" or "-".
Is there any compelling reason for storing a phone number as an INT? Do you agree with the solution to this prompt? What is the best practice for storing phone numbers?
Is "Meta Data Engineer Professional Certificate" aimed at MySQL?
General Professional: If not MySQL-specific, then you need to understand that "INTEGER" is implemented in different ways by different database engines.
MySQL Professional: INTEGER, in MySQL, maps to INT SIGNED, which is limited to about 2 billion--That is only 9 digits. I don't know what the max phone number is worldwide, but I know that 10 is needed.
BIGINT gives you about 18 digits (in 8 bytes), but that seems silly. For the reasons already mentioned VARCHAR(...) is reasonable. (Perhaps a limit of 20 would be quite sufficient.) In that case, a 10-digit number would take 11 bytes (1 for length, plus 10 for the number.)
Arguably, you could say, for example DECIMAL(15) to allow up to 15 digits in a 7-byte column.
(I prefer VARCHAR, in spite of it taking the most space.)
Either way: It is a bad test question if it does not understand the two cases I present here.
Non digits: 'typed with other characters like "(" or "-"' -- That brings up a different issue. It comes under the general heading of GIGO. Cleanse the data before storing it into the database.
If you ever needed to compare two phone numbers for equality, you would wish you had removed all non-digits. (Or added them in some canonical way, such as US: "(800)543-1212"
User input: If you ever create a UI for entering phone numbers, dates, SSNs, (or other numbers with some structure), DO NOT require the user to follow some punctuation rules. DO allow a variety of typical formats. (OK, Dates are tricky because there are incompatible orderings. But what if I type "1-1-2021", will you spit at me not having the leading zeroes?
Indexing: VARCHAR, DECIMAL, INT, etc are all indexable. Any speed difference is not significant.
Extensions: Without VARCHAR, how would you represent the "extension" in "(800)543-1212x543"? Might this point be the deciding factor in favor of VARCHAR? And you should write a bug report against that 'Certification' test?
Duplicate?: Which is best data type for phone number in MySQL and what should Java type mapping for it be? covers most of what I have said, and hints that [perhaps] VARCHAR(20) is sufficient. (The quoted 15, excludes the international prefix.)
In my opinion, there is no absolute best choice in this. Both have pros and cons. Personally, I'm in favor of using varchar. Though special characters like hyphen can cause dupes when mishandled (it's a rare case and it's the user to blame as it's required to verify the input before submitting),it does have the merit of formatting the phone which improves the readability. e.g area_code-tel: xxx-xxxxxxxx (without it it's near impossible to separate the area code and the phone number as both can have a varied length). About indexing,though numerics does have advantages over strings, I'm not sure if a phone number would be used as an index. There are more worthy candidates such as ID or date, but what would a phone number do? Usually we look for the phone based on indexed column such as ID, but how often do we get something based on phone number? Unless we want to list all phones from a particular area, we don't really need it to be indexed. Then it actually would be more fitting to use special characters like hyphen to help determine the area part.
P.S Like Ken White kindly suggested, there are cases when phone numbers should be indexed, especially when they are more suitable to be an identifier.
Storing phone numbers as strings can be a disaster, the first things coming up to my mind are:
You can get dupes easily, maybe someone types the number with (
and/or - and another user does type the same number without those
characters, long story short you end up with a duplicate.
Thinking about a way to normalize the phone number using an integer
makes too more sense in terms of normalization and non duplication.
Also think about a search with the scenario above, what would you use ? a like a numeric operator ? spread casts ? Messy...
Now comes the important thing and it is related to the indexing, the
int will be faster. The longer is the varchar the slower it gets
however you are limiting its length.
The validation can be on the UI with a field mask, or using a regex on the logic whatever makes more sense for you.
Hope i helped a little bit :)
I'm working with some database abstraction layers and most of them are using attributes like "String" which is VARCHAR 250 or INTEGER which has length of 11 digits. But for example I have something that will be less than 250 characters long. Should I go and make it less? Does it really makes any valuable difference?
Thanks in advance!
INT length does nothing. All INTs are 4 bytes. The number you can set, is only used for zerofill (and who uses that!?).
VARCHAR length does more. It's the maxlength of the field. VARCHAR is saved so that only the actual data is stored, so the length doesn't mattter. These days, you can have bigger VARCHARs than 255 bytes (being 256^2-1). The difference is the bytes that are used for the field length. VARCHAR(100) and VARCHAR(8) and VARCHAR(255) use 1 byte to save the field length. VARCHAR(1000) uses 2.
Hope that helps =)
edit
I almost always make my VARCHARs 250 long. Actual length should be checked in the app anyway. For bigger fields I use TEXT (and those are stored differently, so can be much much longer).
edit
I don't know how current this is, but it used to help me (understand): http://help.scibit.com/Mascon/masconMySQL_Field_Types.html
First, remember that the database is meant to store facts and is designed to protect itself against bad data. Thus, the reason you do not want to allow a user to enter 250 characters for a first name is that a user will put all kinds of data in there that is not a first name. They'll put their whole name, their underwear size, a novel about what they did last summer and so on. Thus, you want to strive to enforce that the data is as correct as possible. It is a mistake to assume that the application is the sole protector against bad data. You want users to tell you that they had a problem stuffing War in Peace into a given column.
Thus, the most important question is, "What is the most appropriate value for the data being stored?" Ideally, you would use an int and a check constraint to ensure that the values have an appropriate range (e.g. greater than zero, less than a billion etc.). Unfortunately, this is one of MySQL's greatest weakness: it does not honor check constraints. That simply means you must implement those integrity checks in triggers which admittedly is more cumbersome.
Will the difference between an int (4 bytes) make an appreciable difference to a tinyint (1 byte)? Obviously, it depends on the amount of data. If you will have no more than 10 rows, the answer is obviously no. If you will have 10 billion rows, the answer is obviously "Yes". However, IMO, this is premature optimization. It is far better to focus on ensuring correctness first.
For text, you should ask whether your data should support Chinese, Japanese or non-ANSI values (i.e., should you use nvarchar or varchar)? Does this value represent a real world code like a currency code, or bank code which has a specific specification?
Not so sure in MySQL, but in MS SQL it only makes a difference for sufficiently large databases. Typically, I like to use smaller fields for a) the space saving (it never hurts to practice good habits) and b) for the implied validation (if you know a certain field should never be more than 10 characters, why allow eleven, let alone 250?).
I thinks Rudie is wrong, not all INTs are 4 bytes... in MySQL you have:
tinyint = 1 byte,
smallint = 2 bytes,
mediumint = 3 bytes,
int = 4 bytes,
bigint = 8 bytes.
I think Rudie refers to the "display with" that is the number you put between parenthesis when you are creating a column, e.g.:
age INT(3)
You're telling to the RDBMS just to SHOW no more than 3 numbers.
And VARCHARs are (variable length charcter string) so if you declare let's say name varchar(5000) and you store a name like "Mario" you only are using 7 bytes (5 for the data and 2 for the length of the value).
The correct field size serves to limit the bad data that can be put in. For instance suppose you have a phone number field. If you allow 250 characters, you will often end up with things like the following in the phone field (an example not taken at random):
Call the good-looking blonde secretary instead.
So first limiting the length is part of how we enforce data integrity rules. As such it is critical.
Second, there is only so much space on a datapage and while some databases will allow you to create tables where the potential record is longer than the width of the data page, they often will not allow you to actually exceed it when storing the data. This can lead to some very hard to find bugs when suddenly one record can't be saved. I don't know about MySql and whether it does this but I know SQL Server does and it is very hard to figure out what is wrong. So making data the correct size can be critical to preventing bugs.
I'm learning about the usage of datatypes for databases.
For example:
Which is better for email? varchar[100], char[100], or tinyint (joking)
Which is better for username? should I use int, bigint, or varchar?
Explain. Some of my friends say that if we use int, bigint, or another numeric datatype it will be better (facebook does it). Like u=123400023 refers to user 123400023, rather then user=thenameoftheuser. Since numbers take less time to fetch.
Which is better for phone numbers? Posts (like in blogs or announcments)? Or maybe dates (I use datetime for that)? maybe some have make research that would like to share.
Product price (I use decimal(11,2), don't know about you guys)?
Or anything else that you have in mind, like, "I use serial datatype for blablabla".
Why do I mention innodb specifically?
Unless you are using the InnoDB table
types (see Chapter 11, "Advanced
MySQL," for more information), CHAR
columns are faster to access than
VARCHAR.
Inno db has some diffrence that I don't know.
I read that from here.
Brief Summary:
(just my opinions)
for email address - VARCHAR(255)
for username - VARCHAR(100) or VARCHAR(255)
for id_username - use INT (unless you plan on over 2 billion users in you system)
phone numbers - INT or VARCHAR or maybe CHAR (depends on if you want to store formatting)
posts - TEXT
dates - DATE or DATETIME (definitely include times for things like posts or emails)
money - DECIMAL(11,2)
misc - see below
As far as using InnoDB because VARCHAR is supposed to be faster, I wouldn't worry about that, or speed in general. Use InnoDB because you need to do transactions and/or you want to use foreign key constraints (FK) for data integrity. Also, InnoDB uses row level locking whereas MyISAM only uses table level locking. Therefore, InnoDB can handle higher levels of concurrency better than MyISAM. Use MyISAM to use full-text indexes and for somewhat less overhead.
More importantly for speed than the engine type: put indexes on the columns that you need to search on quickly. Always put indexes on your ID/PK columns, such as the id_username that I mentioned.
More details:
Here's a bunch of questions about MySQL datatypes and database design (warning, more than you asked for):
What DataType should I pick?
Table design question
Enum datatype versus table of data in MySQL?
mysql datatype for telephne number and address
Best mysql datatype for grams, milligrams, micrograms and kilojoule
MySQL 5-star rating datatype?
And a couple questions on when to use the InnoDB engine:
MyISAM versus InnoDB
When should you choose to use InnoDB in MySQL?
I just use tinyint for almost everything (seriously).
Edit - How to store "posts:"
Below are some links with more details, but here's the short version. For storing "posts," you need room for a long text string. CHAR max length is 255, so that's not an option, and of course CHAR would waste unused characters versus VARCHAR, which is variable length CHAR.
Prior to MySQL 5.0.3, VARCHAR max length was 255, so you'd be left with TEXT. However, in newer versions of MySQL, you can use VARCHAR or TEXT. The choice comes down to preference, but there are a couple differences. VARCHAR and TEXT max length is now both 65,535, but you can set you own max on VARCHAR. Let's say you think your posts will only need to be 2000 max, you can set VARCHAR(2000). If you every run into the limit, you can ALTER you table later and bump it to VARCHAR(3000). On the other hand, TEXT actually stores its data in a BLOB (1). I've heard that there may be performance differences between VARCHAR and TEXT, but I haven't seen any proof, so you may want to look into that more, but you can always change that minor detail in the future.
More importantly, searching this "post" column using a Full-Text Index instead of LIKE would be much faster (2). However, you have to use the MyISAM engine to use full-text index because InnoDB doesn't support it. In a MySQL database, you can have a heterogeneous mix of engines for each table, so you would just need to make your "posts" table use MyISAM. However, if you absolutely need "posts" to use InnoDB (for transactions), then set up a trigger to update the MyISAM copy of your "posts" table and use the MyISAM copy for all your full-text searches.
See bottom for some useful quotes.
MySQL Data Type Chart (outdated)
MySQL Datatypes (outdated)
Chapter 10. Data Types (better details)
The BLOB and TEXT Types (1)
11.9. Full-Text Search Functions (2)
10.4.1. The CHAR and VARCHAR Types (3)
(3) "Values in VARCHAR columns are
variable-length strings. The length
can be specified as a value from 0 to
255 before MySQL 5.0.3, and 0 to
65,535 in 5.0.3 and later versions.
Before MySQL 5.0.3, if you need a data
type for which trailing spaces are not
removed, consider using a BLOB or TEXT
type.
When CHAR values are stored, they are
right-padded with spaces to the
specified length. When CHAR values are
retrieved, trailing spaces are
removed.
Before MySQL 5.0.3, trailing spaces
are removed from values when they are
stored into a VARCHAR column; this
means that the spaces also are absent
from retrieved values."
Lastly, here's a great post about the pros and cons of VARCHAR versus TEXT. It also speaks to the performance issue:
VARCHAR(n) Considered Harmful
There are multiple angles to approach your question.
From a design POV it is always best to chose the datatype which expresses the quantity you want to model best. That is, get the data domain and data size right so that illegal data cannot be stored in the database in the first place. But that is not where MySQL is strong in the first place, and especially not with the default sql_mode (http://dev.mysql.com/doc/refman/5.1/en/server-sql-mode.html). If it works for you, try the TRADITIONAL sql_mode, which is a shorthand for many desireable flags.
From a performance POV, the question is entirely different. For example, regarding the storage of email bodies, you might want to read http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/ and then think about that.
Removing redundancies and having short keys can be a big win. For example, in a project that I have seen, a log table has been storing http User-Agent information. By simply replacing each user agent string in the log table with a numeric id of a user agent string in a lookup table, data set size was considerably (more than 60%) reduced. By parsing the user agent further and then storing a bunch of ids (operating system, browser type, version index) data set size was reduced to 1% of the original size.
Finally, there is a number of rules that can help you spot errors in schema design.
For example, anything that has id in the name and is not an unsigned integer type is probably a bug (especially in the context of innodb).
For example, anything that has price or cost in the name and is not unsigned is a potential source of fraud (fraudster creates article with negative price, and buys that).
For example, anything that works on monetary data and is not using the DECIMAL data type of the appropriate size is probably doing math wrong (DECIMAL is doing BCD, decimal paper math with correct precision and rounding, DOUBLE and FLOAT do not).
SQLyog has Calculate optimal datatype feature which helps in finding out optimal datatype based on records inserted in a table.
It uses
SELECT * FROMtable_name` PROCEDURE ANALYSE(1, 10);
query to find out optimal datatype
In Access 2003 I need to display numbers like this while keeping the leading zeroes:
080000
090000
070000
What data type should I use for this?
Use a string (or text, or varchar, or whatever string variant your particular RDBMS uses) and pad it with whatever character you want ("0") that you need.
Key question:
Are the leading zeros meaningful data, or just formatting?
For instance, 07086 is my zip code, and the leading zero is meaningful, so US zip codes have to be stored as text.
Are the values '1', '01', '001' and '0001' considered to be unique, legal values or are they considered to be duplicates?
If the leading zero is not meaningful in your table, and is just there for formatting, then store the data as a number and format with leading zeros as needed for display purposes.
You can use the Format() function to do your formatting, as in this example query:
SELECT Format(number_field, "000000") AS number_with_leading_zeroes
FROM YourTable;
Also, number storage and indexing in all database engines I know of are more efficient than text storage and indexing, so with large data sets (100s of thousands of records and more), the performance drag of using text data type for numeric data can be quite large.
Last of all, if you need to do calculations on the data, you want them to be stored as numbers.
The key is to start from how the data is going to be used and choose your data type accordingly. One should worry about formatting only at presentation time (in forms and reports).
Appearance should never drive the choice of data types in the fields in your table.
If your real data looks like your examples and has a fixed number of digits, just store the data in a numeric field and use the format/input mask attributes of the column in Access table design display them with the padded zeros.
Unless you have a variable number of leading zeros there is no reason to store them and it is generally a bad idea. unecessarily using a text type can hurt performance, make it easier to introduce anomalous data, and make it harder to query the database.
Fixed width character with Unicode compression with a CHECK constraint to ensure exactly six numeric characters e.g. ANSI-92 Query Mode syntax:
CREATE TABLE IDs
(
ID CHAR(6) WITH COMPRESSION NOT NULL
CONSTRAINT uq__IDs UNIQUE,
CONSTRAINT ID__must_be_ten_numeric_chars
CHECK (ID ALIKE '[0-9][0-9][0-9][0-9][0-9][0-9]')
);
Do you need to retain them as numbers within the table (i.e. do think you will need to do aggregations within queries - such as SUM etc)?
If not then a text/string datatype will suffice.
If you DO then perhaps you need 2 fields.
to store the number [i.e. 80000] and
to store some meta-data about how the value needs to be displayed
perhaps some sort of mask or formatting pattern [e.g. '000000'].
You can then use the above pattern string to format the display of the number
if you're using a .NET language you can use System.String.Format() or System.Object.ToString()
if you're using Access forms/reports then Access uses very similar string formatting patterns in it's UI controls.
When I'm setting up a MySQL table, it asks me to define the name of the column, type of input, and length. My assumption, without having read anything about it, is that it's for minimization. Specify the smallest possible int/smallint/tinyint for your needs, and it will reduce overhead of some sort. If it's all positives, make it unsigned to double your space, etc.
What happens if I just make every field a varchar-200 characters? When/why is this bad, what will I miss out on, and when will any inefficiencies manifest themselves? 100k records?
I think about this every time I set up a DB, but I haven't built anything to scale enough where I've ever had my scheme setup inappropriately, either too "strict/small" or "loose/big". Can someone confirm that I'm making good assumptions about speed and efficiency?
Thanks!
Data types not only optimize storage, but how data is indexed. As your databases get bigger, it will become apparent that it's quicker to search for all the records that have a 1 in an integer field than those that have a "1" in a varchar field. This becomes especially important when you're joining data from more than one table and your database engine is having to do this sort of thing repeatedly. (Daren also rightly points out below that it's important that the types of the fields you're matching on are identical as well.)
The level at which these inefficiencies become an issue depends greatly on your hardware and your application design. We have big enough iron these days that if you're building moderate-scale apps, you may not see an appreciable difference. (Aside from feeling a little bit guilty about your database design!) But establishing good habits on small projects makes the bigger ones easier when they come along.
If you have two columns as varchar and put in the values 10 and 20 and add them, you'll get 1020 instead of 30 which you'd likely expect.
Sure, you could save everything as VARCHAR strings. But you'd be giving up a lot of functionality provided by the database engine.
You should choose the database type that most closely matches the intended use of the column. For example, using DATE or DATETIME to store dates provides you with all sorts of date/time functions that you don't get with basic VARCHAR types.
Likewise, fields used to count things or provide simple unique IDs should be INT or one of its related types. Also bear in mind that an INT occupies only 4 bytes, whereas a 9-digit string uses at least 9 bytes.
For character data, it's wise to use NVARCHAR for internationalized values that users in any locale are going to enter (esp. names and locations). If you know the text is limited to US or internal use only, VARCHAR is safe.