I have a "best practices" question, concerning indexing.
I have to index phone numbers, which I normally format the column for an integer. I could separate the number into multiple columns: areacode, suffix, prefix, country code. But since I have to account for international numbers, and the numbers get kinda funny in certain countries, I prefer to keep one column.
So my question is, should I keep the column data saved for integers, characters, or varchars?
I do strip out anything non-int related, so varchar is probably not needed.
I have to provide the searching ability for my clients, thus I need to index the number.
If all the phone numbers were from the US, then I'd separate the columns, but I'm catering to the international too.
So I'm curious about the indexing part, and other people's practices in this arena. Is it best to index with integers (for something like this), or does it even matter.
As a side note, the phone numbers are not going to be all the same length. Which is why I ask about formatting the column structure in char or varchar.
thanks guys!!
How large is the table expected to be? The reason I ask is that indexes on ints are going to be smaller, obviously, but on a small table this isn't a major consideration. Using varchar gives you more flexibility to do things like "...WHERE phonenumber like '415%' etc., at the cost of a larger index. If the table is quite large, and the box it runs on is at all memory-constrained, you can run into the situation where your index doesn't fit into memory, sending queries against that index into swap hell. This can be exacerbated by your choice of storage engine: InnoDB prefixes every index with the primary key, for example, which can bloat your indexes if your PK is on a wide field or fields.
Phone numbers can include # and *, so I would recommend against using integers.
Also the international prefix is + this is to support international prefixes regardless of country you are in.
e.g. in South Africa you need to prefix the country code with 09; in Europe the prefix is 00.
To make the numbers work everywhere you replace the international prefix with + and your cellphone will replace this with the local prefix for dialing abroad.
I'd use a varchar for phone numbers.
Furthermore, I'd use an integer auto_increment as primary key and use the phone number as secondary key, in order to keep performance on InnoDB snappy.
Also remember that people can 'share' a phone number, so it's not guaranteed be by unique.
Related
I'm taking the Meta Data Engineer Professional Certificate and I was just given this prompt in a lab:
Mr. Carl needs to have a new table to store the contact details of each customer including customer account number, customer phone number and customer email address.
You are required to choose a relevant data type for each of the columns.
Solution:
Account number: INTEGER
Phone number: INTEGER
Email: VARCHAR
Prior to reading the solution I selected VARCHAR(10) as the datatype for storing phone numbers as I thought they should be treated as string data. My reasoning is that there's no reason to perform any sort of mathematical operation on a phone number, and they're often typed with other characters like "(" or "-".
Is there any compelling reason for storing a phone number as an INT? Do you agree with the solution to this prompt? What is the best practice for storing phone numbers?
Is "Meta Data Engineer Professional Certificate" aimed at MySQL?
General Professional: If not MySQL-specific, then you need to understand that "INTEGER" is implemented in different ways by different database engines.
MySQL Professional: INTEGER, in MySQL, maps to INT SIGNED, which is limited to about 2 billion--That is only 9 digits. I don't know what the max phone number is worldwide, but I know that 10 is needed.
BIGINT gives you about 18 digits (in 8 bytes), but that seems silly. For the reasons already mentioned VARCHAR(...) is reasonable. (Perhaps a limit of 20 would be quite sufficient.) In that case, a 10-digit number would take 11 bytes (1 for length, plus 10 for the number.)
Arguably, you could say, for example DECIMAL(15) to allow up to 15 digits in a 7-byte column.
(I prefer VARCHAR, in spite of it taking the most space.)
Either way: It is a bad test question if it does not understand the two cases I present here.
Non digits: 'typed with other characters like "(" or "-"' -- That brings up a different issue. It comes under the general heading of GIGO. Cleanse the data before storing it into the database.
If you ever needed to compare two phone numbers for equality, you would wish you had removed all non-digits. (Or added them in some canonical way, such as US: "(800)543-1212"
User input: If you ever create a UI for entering phone numbers, dates, SSNs, (or other numbers with some structure), DO NOT require the user to follow some punctuation rules. DO allow a variety of typical formats. (OK, Dates are tricky because there are incompatible orderings. But what if I type "1-1-2021", will you spit at me not having the leading zeroes?
Indexing: VARCHAR, DECIMAL, INT, etc are all indexable. Any speed difference is not significant.
Extensions: Without VARCHAR, how would you represent the "extension" in "(800)543-1212x543"? Might this point be the deciding factor in favor of VARCHAR? And you should write a bug report against that 'Certification' test?
Duplicate?: Which is best data type for phone number in MySQL and what should Java type mapping for it be? covers most of what I have said, and hints that [perhaps] VARCHAR(20) is sufficient. (The quoted 15, excludes the international prefix.)
In my opinion, there is no absolute best choice in this. Both have pros and cons. Personally, I'm in favor of using varchar. Though special characters like hyphen can cause dupes when mishandled (it's a rare case and it's the user to blame as it's required to verify the input before submitting),it does have the merit of formatting the phone which improves the readability. e.g area_code-tel: xxx-xxxxxxxx (without it it's near impossible to separate the area code and the phone number as both can have a varied length). About indexing,though numerics does have advantages over strings, I'm not sure if a phone number would be used as an index. There are more worthy candidates such as ID or date, but what would a phone number do? Usually we look for the phone based on indexed column such as ID, but how often do we get something based on phone number? Unless we want to list all phones from a particular area, we don't really need it to be indexed. Then it actually would be more fitting to use special characters like hyphen to help determine the area part.
P.S Like Ken White kindly suggested, there are cases when phone numbers should be indexed, especially when they are more suitable to be an identifier.
Storing phone numbers as strings can be a disaster, the first things coming up to my mind are:
You can get dupes easily, maybe someone types the number with (
and/or - and another user does type the same number without those
characters, long story short you end up with a duplicate.
Thinking about a way to normalize the phone number using an integer
makes too more sense in terms of normalization and non duplication.
Also think about a search with the scenario above, what would you use ? a like a numeric operator ? spread casts ? Messy...
Now comes the important thing and it is related to the indexing, the
int will be faster. The longer is the varchar the slower it gets
however you are limiting its length.
The validation can be on the UI with a field mask, or using a regex on the logic whatever makes more sense for you.
Hope i helped a little bit :)
I have to create a booking software and I've started to design the database.
The room may be in the any place of the world so I would like to have advice for address.
Reading a lot of similary question I have designed this schema, what do you think about? Is it enough to store all type of addresses?
I followed this structure.
Don't use a 4-byte INT for "country_id", use the standard 2-char "country codes"; it is smaller and avoids a JOIN.
There is not really much need to have "city" split out from address -- it does not save enough space to matter, nor is it is useful for "normalization".
So, combine Country and City into Address.
It is rarely useful to have 1:1 relationships; it is almost always better to merge the two tables into one. I am thinking about Building and Address.
Might you have a billion "floors"? Don't use 4-byte INT when 1-byte TINYINT UNSIGNED is more appropriate. See also SMALLINT UNSIGNED (2 bytes, range of 0..65535).
DECIMAL(6,4)/(7,4) is all that is needed for lat/long. 8 decimal places is getting into microscopic distances.
Pick the appropriate CHARACTER SET. country_code and (I suspect) postal_code can be ascii anywhere in the world. (I could be wrong -- Bangladesh uses non-Arabic numerals on license plates.) Other VARCHARs should probably be utf8mb4 if you are truly international.
It is good that you read a lot of similar questions to come up with the schema. It is unfortunate that they were novices.
I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san
I'm working tons of phone numbers, and many are international.
I've changed my phone numbers table structure to have 5 columns:
`phonenumbers`.`phoneID`
`phonenumbers`.`countrycode`
`phonenumbers`.`areacode`
`phonenumbers`.`phonenumber`
`phonenumbers`.`ext`
At the moment the phoneID is the only column that's an INT, since it's the primary key.
Should I change the other columns to integers? I've heard indexes work best with numeric values, and I'm only storing numbers in each of the columns (no dashes, parenthesis, spaces, etc)
I'm still learning how MySQL works with indexes, so I'm curious how others work with searching for numbers. In this case, I'm sure I'll be searching for numbers that start with a certain known areacode and part of a known phonenumber, or an entire phonenumber.
The part that gets me with indexing and table columns like phone numbers is that I don't always know how long a phonenumber will be. Since countries have different lengths for areacodes and phonenumbers.
In summary, INT vs VARCHAR indexing with numbers.
Phone numbers are not integers, so don't store them as one, it'll just cause you trouble. The obvious cases are when you have to handle phone numbers too big to fit in an int, or phone numbers starting with a 0.
Moreover, as you want to do prefix matches (phonenumber like '800%'), mysql will be able to use indexes if you're using varchar columns.
You have to figure out how you're querying this data, if you're frequently doing queries like where countrycode='1' and areacode='123' and phonenumber like '2%' , you'd want a compound index on (countrycode,areacode,phonenumber) , and if you're also often doing queries on only the phonenumber, you'd want an additional index only on the phonenumber column, but this is something you have to work out depeding on the amount of data you have and queries you do - work with EXPLAIN to learn how your indexes are used and where they are needed.
Use varchar for representing phone numbers NOT integers. Otherwise you will find your design decision will come back to bite you.
Also: "I've heard indexes work best with numeric values" - well, that's not strictly accurate: yes the index will take up less space, and more rows will fit per page etc, but an index on a varchar column works perfectly well.
Worry about index size and performance when (1) you have a huge amount of data and (2) when you have measured a performance problem.
In my opinion you have a lot of attributes, that you don´t need, and for phone numbers i usualy use an auto-increment key for id and the phone number is a varchar. This makes it easier the validation with the use of a programming language. It´s my opinion...
Use a BIGINT UNSIGNED simple because this forces you to normalize your data. Force your user to store the phonenumber in root level. That means at country level. You could store the country prefix in a separate column to ease the usage.
Everybody types phone-numbers in different ways and this makes it almost impossible to search the data.
E.g. %020123456% will not match 02 0123456. Are you going to search all combinations or just parse it?
This i know from experience, we had to fix manually about 1,000 phonenumbers which we could not script out when installing an auto-dialer.
I was asked to use a database in which most of the primary keys, and other fields as well, uses char(n) to store numeric values with padding, for example:
product_id: char(8) [00005677]
user_id: char(6) [000043]
category_id: char(2) [05]
The reason they want to use it like that, is to be able to use characters (in the far future) if they want. However they have many rules based in numbers, for example, category_id from 01 to 79 correspond to a general category and from 80 to 89 is a special category and 90 to 99 is user defined category.
I personally think that using char(n) to store numbers is a bad practice. My reasons are:
using char, " " != 0, 0 != 00, 05 != 5, 00043 != 000043, and so on. For that reason,
the values have to be constantly checked (to prevent data corruption).
If I pad a number: 0 -> 00, then I have to pay attention not to pad
a character (A -> 0A)
If characters are used, then ranges become strange, something like:
from 01 to 79 and AB and RX and TZ and S, etc...
Indexing numbers instead of chars result in a performance gain
I'm proposing to change it to decimal(n) with zerofill to make it more "error-proof", as this information is modified by different sources (web, windows client, upload csv). If they want to add more categories, for example, then updating from decimal(2) to decimal(3) will be easier.
My question then is: Am I wrong? can char(n) be trusted for this task? If "chars" are evil with numbers, then which other disadvantages am I missing in the above list (I may need better reasons if I want to win my case)?
TIA (any comment/answer will be appreciated).
If this was SQL Server or Oracle or any other RDBMS, I would recommend enforcing a check constraint on those columns so that the data always matched the full capacity of the column - this would ensure your identifiers are uniform.
Unfortunately MySQL doesn't support this.
While it wouldn't stop the annoyance of having to pad things coming into the database or in search routines, on the client or in procs in the database, it would guarantee you that the fields were clean at the lowest level.
I find using constraints like this help avoid things getting badly out of hand.
As far as the optimization by using numbers, if they have to accommodate non-numeric characters in the future, that's not going to be an option.
It is very common to have natural keys (which could be candidates for a primary key) with varchar/char data, but yet instead enforce referential integrity on surrogate keys (usually some kind of autonumbering integer which is simply an internal reference, and often the clustered index and primary key).
Quoting your question:
...store numeric values with padding...
You did not show any examples of numeric data, only character data that happens to consist of numbers. If you had said that their OrderTotal column was a char(10), then I'd start to worry.
Just treat this as character data and you will be fine. I can see no business or technical case to change the database (unless you are beginning a near-total rewrite).
Regarding performance... If this is actually a concern, then you most likely have far bigger issues to deal with. MySQL is fast and accurate.
--
Write a function somewhere that will zerofill user inputted ID's for the purpose of querying. Use this function everywhere you need to accept user input. NEVER EVER use a numeric data type to store your data (if PHP, never use +, always use . to concat, etc...)
Remember, this is no different than Item_Number = "SHIRT123" or any other string ID you may encounter.
Take care