I'm working tons of phone numbers, and many are international.
I've changed my phone numbers table structure to have 5 columns:
`phonenumbers`.`phoneID`
`phonenumbers`.`countrycode`
`phonenumbers`.`areacode`
`phonenumbers`.`phonenumber`
`phonenumbers`.`ext`
At the moment the phoneID is the only column that's an INT, since it's the primary key.
Should I change the other columns to integers? I've heard indexes work best with numeric values, and I'm only storing numbers in each of the columns (no dashes, parenthesis, spaces, etc)
I'm still learning how MySQL works with indexes, so I'm curious how others work with searching for numbers. In this case, I'm sure I'll be searching for numbers that start with a certain known areacode and part of a known phonenumber, or an entire phonenumber.
The part that gets me with indexing and table columns like phone numbers is that I don't always know how long a phonenumber will be. Since countries have different lengths for areacodes and phonenumbers.
In summary, INT vs VARCHAR indexing with numbers.
Phone numbers are not integers, so don't store them as one, it'll just cause you trouble. The obvious cases are when you have to handle phone numbers too big to fit in an int, or phone numbers starting with a 0.
Moreover, as you want to do prefix matches (phonenumber like '800%'), mysql will be able to use indexes if you're using varchar columns.
You have to figure out how you're querying this data, if you're frequently doing queries like where countrycode='1' and areacode='123' and phonenumber like '2%' , you'd want a compound index on (countrycode,areacode,phonenumber) , and if you're also often doing queries on only the phonenumber, you'd want an additional index only on the phonenumber column, but this is something you have to work out depeding on the amount of data you have and queries you do - work with EXPLAIN to learn how your indexes are used and where they are needed.
Use varchar for representing phone numbers NOT integers. Otherwise you will find your design decision will come back to bite you.
Also: "I've heard indexes work best with numeric values" - well, that's not strictly accurate: yes the index will take up less space, and more rows will fit per page etc, but an index on a varchar column works perfectly well.
Worry about index size and performance when (1) you have a huge amount of data and (2) when you have measured a performance problem.
In my opinion you have a lot of attributes, that you don´t need, and for phone numbers i usualy use an auto-increment key for id and the phone number is a varchar. This makes it easier the validation with the use of a programming language. It´s my opinion...
Use a BIGINT UNSIGNED simple because this forces you to normalize your data. Force your user to store the phonenumber in root level. That means at country level. You could store the country prefix in a separate column to ease the usage.
Everybody types phone-numbers in different ways and this makes it almost impossible to search the data.
E.g. %020123456% will not match 02 0123456. Are you going to search all combinations or just parse it?
This i know from experience, we had to fix manually about 1,000 phonenumbers which we could not script out when installing an auto-dialer.
Related
I'm taking the Meta Data Engineer Professional Certificate and I was just given this prompt in a lab:
Mr. Carl needs to have a new table to store the contact details of each customer including customer account number, customer phone number and customer email address.
You are required to choose a relevant data type for each of the columns.
Solution:
Account number: INTEGER
Phone number: INTEGER
Email: VARCHAR
Prior to reading the solution I selected VARCHAR(10) as the datatype for storing phone numbers as I thought they should be treated as string data. My reasoning is that there's no reason to perform any sort of mathematical operation on a phone number, and they're often typed with other characters like "(" or "-".
Is there any compelling reason for storing a phone number as an INT? Do you agree with the solution to this prompt? What is the best practice for storing phone numbers?
Is "Meta Data Engineer Professional Certificate" aimed at MySQL?
General Professional: If not MySQL-specific, then you need to understand that "INTEGER" is implemented in different ways by different database engines.
MySQL Professional: INTEGER, in MySQL, maps to INT SIGNED, which is limited to about 2 billion--That is only 9 digits. I don't know what the max phone number is worldwide, but I know that 10 is needed.
BIGINT gives you about 18 digits (in 8 bytes), but that seems silly. For the reasons already mentioned VARCHAR(...) is reasonable. (Perhaps a limit of 20 would be quite sufficient.) In that case, a 10-digit number would take 11 bytes (1 for length, plus 10 for the number.)
Arguably, you could say, for example DECIMAL(15) to allow up to 15 digits in a 7-byte column.
(I prefer VARCHAR, in spite of it taking the most space.)
Either way: It is a bad test question if it does not understand the two cases I present here.
Non digits: 'typed with other characters like "(" or "-"' -- That brings up a different issue. It comes under the general heading of GIGO. Cleanse the data before storing it into the database.
If you ever needed to compare two phone numbers for equality, you would wish you had removed all non-digits. (Or added them in some canonical way, such as US: "(800)543-1212"
User input: If you ever create a UI for entering phone numbers, dates, SSNs, (or other numbers with some structure), DO NOT require the user to follow some punctuation rules. DO allow a variety of typical formats. (OK, Dates are tricky because there are incompatible orderings. But what if I type "1-1-2021", will you spit at me not having the leading zeroes?
Indexing: VARCHAR, DECIMAL, INT, etc are all indexable. Any speed difference is not significant.
Extensions: Without VARCHAR, how would you represent the "extension" in "(800)543-1212x543"? Might this point be the deciding factor in favor of VARCHAR? And you should write a bug report against that 'Certification' test?
Duplicate?: Which is best data type for phone number in MySQL and what should Java type mapping for it be? covers most of what I have said, and hints that [perhaps] VARCHAR(20) is sufficient. (The quoted 15, excludes the international prefix.)
In my opinion, there is no absolute best choice in this. Both have pros and cons. Personally, I'm in favor of using varchar. Though special characters like hyphen can cause dupes when mishandled (it's a rare case and it's the user to blame as it's required to verify the input before submitting),it does have the merit of formatting the phone which improves the readability. e.g area_code-tel: xxx-xxxxxxxx (without it it's near impossible to separate the area code and the phone number as both can have a varied length). About indexing,though numerics does have advantages over strings, I'm not sure if a phone number would be used as an index. There are more worthy candidates such as ID or date, but what would a phone number do? Usually we look for the phone based on indexed column such as ID, but how often do we get something based on phone number? Unless we want to list all phones from a particular area, we don't really need it to be indexed. Then it actually would be more fitting to use special characters like hyphen to help determine the area part.
P.S Like Ken White kindly suggested, there are cases when phone numbers should be indexed, especially when they are more suitable to be an identifier.
Storing phone numbers as strings can be a disaster, the first things coming up to my mind are:
You can get dupes easily, maybe someone types the number with (
and/or - and another user does type the same number without those
characters, long story short you end up with a duplicate.
Thinking about a way to normalize the phone number using an integer
makes too more sense in terms of normalization and non duplication.
Also think about a search with the scenario above, what would you use ? a like a numeric operator ? spread casts ? Messy...
Now comes the important thing and it is related to the indexing, the
int will be faster. The longer is the varchar the slower it gets
however you are limiting its length.
The validation can be on the UI with a field mask, or using a regex on the logic whatever makes more sense for you.
Hope i helped a little bit :)
I have an online form where users have to submit a few multiple choice answers, and have the option to insert their email address (to be kept up to date on the results). However, only few people to actually do this.
So currently I have a table with 3 columns: submission_id INT, encoded_answers varchar(20) , and email VARCHAR(50). However, given that 95% of the email entries are NULL, this is quite wasteful.
Of course I could use two tables: a large one with submission_id and encoded_answers, and a smaller one with submission_id and email. But is there also a solution within 1 table? Sort of a sparse-type column, that would only take space if the field is not NULL?
Why is it wasteful? Have you done any tests to confirm this? A column with no value doesn't really take up much space, perhaps a byte per column per row. That's what VARCHAR is all about, being variable length.
Further, arbitrarily limiting your fields to short lengths is actually considered harmful. It is not uncommon for an email address to exceed 50 characters. Note that the storage requirements of VARCHAR(50) and VARCHAR(255) are the same for strings of equal length. It is only for columns of length 256 and beyond where you'll pay in the form of an additional length byte.
Remember that MySQL will arbitrarily truncate your data if it doesn't fit in the field. This is really bad for important data like email addresses.
Not in SQL, no. You should think about using a NoSQL engine for such task.
I've been inserting some numbers as INT UNSIGNED in MySQL database. I perform search on this column using "SELECT. tablename WHERE A LIKE 'B'. I'm coming across some number formats that are either too long for unsigned integer or have dashes in them like 123-456-789.
What are some good options for modifying the table here? I see two options (are there others?):
Make another column (VARCHAR(50)) to store numbers with dashes. When a search query detects numbers with dashes, look in this new column.
Recreate the table using a VARCHAR(50) instead of unsigned integer for this column in question.
I'm not sure which way is the better in terms of (a) database structure and (b) search speed. I'd love some inputs on this. Thank you.
Update: I guess I should have included more info.
These are order numbers. The numbers without dashes are for one store (A), and the one with dashes are for Amazon (B; 13 or 14 digits I think with two dashes). A's order numbers should be sortable. I'm not sure if B has to be since the numbers don't mean anything to me really (just a unique number).
If I remove the dashes and put them all together as big int, will there be any decrease in performance in the search queries?
the most important question is how you would like to use the data. What do you need? If you make a varchar, and then you would like to sort it as a number, you will not be able to, since it will be treating it as string..
you can always consider big int, however the question is: do you need dashes? or can you just ignore them on application level? if you need them, it means you need varchar. in that case it might make sense to have two columns if you want to be able to for example sort them as numbers, or perform any calculations. otherwise probably one makes more sense.
you should really provide more context about the problem
Mysql has the PROCEDURE ANALYSE , which helps you to identify with your existing data sets. here's some example.
Given you are running query WHERE A LIKE 'B' mainly. You can also try full text search if "A" varies a lot.
I think option 2 makes the most sense. Just add a new column as varchar(50), put everything in the int column into that varchar, and drop the int. Having 2 separate columns to maintain just isn't a good idea.
I have a "best practices" question, concerning indexing.
I have to index phone numbers, which I normally format the column for an integer. I could separate the number into multiple columns: areacode, suffix, prefix, country code. But since I have to account for international numbers, and the numbers get kinda funny in certain countries, I prefer to keep one column.
So my question is, should I keep the column data saved for integers, characters, or varchars?
I do strip out anything non-int related, so varchar is probably not needed.
I have to provide the searching ability for my clients, thus I need to index the number.
If all the phone numbers were from the US, then I'd separate the columns, but I'm catering to the international too.
So I'm curious about the indexing part, and other people's practices in this arena. Is it best to index with integers (for something like this), or does it even matter.
As a side note, the phone numbers are not going to be all the same length. Which is why I ask about formatting the column structure in char or varchar.
thanks guys!!
How large is the table expected to be? The reason I ask is that indexes on ints are going to be smaller, obviously, but on a small table this isn't a major consideration. Using varchar gives you more flexibility to do things like "...WHERE phonenumber like '415%' etc., at the cost of a larger index. If the table is quite large, and the box it runs on is at all memory-constrained, you can run into the situation where your index doesn't fit into memory, sending queries against that index into swap hell. This can be exacerbated by your choice of storage engine: InnoDB prefixes every index with the primary key, for example, which can bloat your indexes if your PK is on a wide field or fields.
Phone numbers can include # and *, so I would recommend against using integers.
Also the international prefix is + this is to support international prefixes regardless of country you are in.
e.g. in South Africa you need to prefix the country code with 09; in Europe the prefix is 00.
To make the numbers work everywhere you replace the international prefix with + and your cellphone will replace this with the local prefix for dialing abroad.
I'd use a varchar for phone numbers.
Furthermore, I'd use an integer auto_increment as primary key and use the phone number as secondary key, in order to keep performance on InnoDB snappy.
Also remember that people can 'share' a phone number, so it's not guaranteed be by unique.
When I'm setting up a MySQL table, it asks me to define the name of the column, type of input, and length. My assumption, without having read anything about it, is that it's for minimization. Specify the smallest possible int/smallint/tinyint for your needs, and it will reduce overhead of some sort. If it's all positives, make it unsigned to double your space, etc.
What happens if I just make every field a varchar-200 characters? When/why is this bad, what will I miss out on, and when will any inefficiencies manifest themselves? 100k records?
I think about this every time I set up a DB, but I haven't built anything to scale enough where I've ever had my scheme setup inappropriately, either too "strict/small" or "loose/big". Can someone confirm that I'm making good assumptions about speed and efficiency?
Thanks!
Data types not only optimize storage, but how data is indexed. As your databases get bigger, it will become apparent that it's quicker to search for all the records that have a 1 in an integer field than those that have a "1" in a varchar field. This becomes especially important when you're joining data from more than one table and your database engine is having to do this sort of thing repeatedly. (Daren also rightly points out below that it's important that the types of the fields you're matching on are identical as well.)
The level at which these inefficiencies become an issue depends greatly on your hardware and your application design. We have big enough iron these days that if you're building moderate-scale apps, you may not see an appreciable difference. (Aside from feeling a little bit guilty about your database design!) But establishing good habits on small projects makes the bigger ones easier when they come along.
If you have two columns as varchar and put in the values 10 and 20 and add them, you'll get 1020 instead of 30 which you'd likely expect.
Sure, you could save everything as VARCHAR strings. But you'd be giving up a lot of functionality provided by the database engine.
You should choose the database type that most closely matches the intended use of the column. For example, using DATE or DATETIME to store dates provides you with all sorts of date/time functions that you don't get with basic VARCHAR types.
Likewise, fields used to count things or provide simple unique IDs should be INT or one of its related types. Also bear in mind that an INT occupies only 4 bytes, whereas a 9-digit string uses at least 9 bytes.
For character data, it's wise to use NVARCHAR for internationalized values that users in any locale are going to enter (esp. names and locations). If you know the text is limited to US or internal use only, VARCHAR is safe.