Facebook user_id : big_int, int or string? - mysql

Facebook's user id's go up to 2^32 .. which by my count it 4294967296.
mySQL's unsigned int's range is 0 to 4294967295 (which is 1 short - or my math is wrong)
and its unsigned big int's range is 0 to 18446744073709551615
int = 4 bytes, bigint = 8 bytes
OR
Do I store it as a string?
varchar(10) = ? bytes
How will it effect efficiency, I heard that mysql handle's numbers far better than strings (performance wise). So what do you guys recommend

Because Facebook assigns the IDs, and not you, you must use BIGINTs.
Facebook does not assign the IDs sequentially, and I suspect they have some regime for assigning numbers.
I recently fixed exactly this bug, so it is a real problem.
I would make it UNSIGNED, simply because that is what it is.
I would not use a string. That makes comparisons painful and your indexes clunkier than they need to be.

You can't use INT any more. Last night I had two user ids that maxed out INT(10).

I use a bigint to store the facebook id, because that's what it is.
but internally for the primary and foreign keys of the tables, i use a smallint, because it is smaller. But also because if the bigint should ever have to become a string (to find users by username instead of id), i can easily change it.
so i have a table that looks like this:
profile
- profile_key smallint primary key
- profile_name varchar
- fb_profile_id bigint
and one that looks like this
something_else
- profile_key smallint primary key
- something_else_key smallint primary key
- something_else_name varchar
and my queries for a singe page could be something like this:
select profile_key, profile_name
from profile
where fb_profile_id = ?
now i take the profile_key and use it in the next query
select something_else_key, something_else_name
from something_else
where profile_key = ?
the profile table almost always gets queried for almost any request anyway, so i don't consider it an extra step.
And ofcourse it is also quite ease to cache the first query for some extra performance.

If you are reading this in 2015 when facebook has upgraded their API to 2.0 version. They have added a note in their documentation stating that their ids would be changed and would have an app scope. So maybe there is huge possibility later in the future that they might change all the ids to Alpha numeric.
https://developers.facebook.com/docs/apps/upgrading#upgrading_v2_0_user_ids
So I would suggest to keep the type to varchar and avoid any future migration pains

Your math is a little wrong... remember that the largest number you can store in N bytes is 2^(N) - 1... not 2^(N). There are 2^N possible numbers, however the largest number you can store is 1 less that.
If Facebook uses an unsigned big int, then you should use that. They probably don't assign them sequentially.
Yes, you could get away with a varchar... however it would be slower (but probably not as much as you are thinking).

Store them as strings.
The Facebook Graph API returns ids as strings, so if you want comparisons to work without having to cast, you should use strings. IMO this trumps other considerations.

I would just stick with INT. It's easy, it's small, it works and you can always change the column to a larger size in the future if you need to.
FYI:
VARCHAR(n) ==> variable, up to n + 1 bytes
CHAR(n) ==> fixed, n bytes

Unless you expect more than 60% of the world's population to sign up, int should do?

Related

Limit of SQL 'auto increment' as primary key

I want to create a system of online billboard where everyone can post a topic as my project.
I try to design the database using SQL to store the information of each topic, including the topic's id as primary key.
At first I design the id using integer datatype with auto-increment, as I think it's the simplest way. Then I thought about it and found out that the integer has limit(the number may be high but it is there), so I'm here finding another method.
Now I think of some pseudo-random algorithms, or use the hashing of topic's name but still not clear.
I also find the GUID from research in here, but not sure can it be used.
I wish you suggest me some ways of how to deal with the limit size of integer as primary key, or suggest me any keywords for me to do further research.
This answer assumes MySQL/MariaDB, because it uses the terminology "auto-increment" for such columns (as opposed to other databases that use identity or serial).
If int isn't big enough, you can use bigint.
Although I might consider it unlikely that you'll exceed the thresholds for int (it works for many applications), bigint would require great effort on you and your computers part for a long, long time to exceed the maximum value.
This is explained in the documentation.
With int, the maximum value supported by SQL Server is 2,147,483,647.
Just for completeness, I will also add that yet another option is to change the data type of the column to bigint (maximum value 9,223,372,036,854,775,807 - this will allow you to insert a million rows per second, for almost 300,000 years in a row).
Or if you fear that you will overflow even that, you can consider using decimal(38,0) - the maximum here is a number consisting of 38 9's (which will allow you to maintain that same pace for a whopping 31,709,791,983,764,586,504,312,531 years). 
http://sqlblog.com/blogs/hugo_kornelis

How to store social id in a MySQL DB

I need to store some social ids (facebook/google/twitter user id, facebook place id, ecc..) in my MySQL DB. I found a lot of questions about this here on stackoverflow, but I didn't find a satisfactory answer. For example you can't be sure 100% that facebook id will always be a unsigned bigint, from facebook documentation the facebook id is described as "numeric string". Google id seems one digit bigger than bigint.
I believe that an index on a varchar column is slower than an index on a bigint column, for this reason I thought that using bigint, when possible, would be better than varchar. But I realized that you can store a varchar as a binary with the appropriate attribute.
For this reason I was thinking about use a varchar for all these social ids and (since the ordering is not an issue) store it as binary (attribute=binary), this because I need a fast index on them.
what's your thoughts: is this a good and fast solution?
Thanks
I use varchar. You are right about the differences, but even more importantly, there is no guarantee that the current type will stay the same. For instance, Facebook changes the size in the past and they mentioned somewhere that they may include chars in it.
An index is an index, if done right, there is no need to worry about its performance. No real difference between an index on numbers or varchar.
bigint + INDEX KEY + INNODB = Fast

Mysql TINYINT & VARCHAR performance

I have a column build in mysql db, it store the value - 10000 in TINYINT, what if I change it to 10k VARCHAR which one will be better performance?
ex. 10000, 20000, 30000.... or 10k, 20k, 30k...
If you are just selecting the value then it doesn't matter.
But if you are using it in a where condition then the performance will be better using an int. Example:
select * from your_table
where your_column > 1000
will only work if the column is an int and you don't need to convert it back to a number.
Generally - if it is a number - store it as number.
The bigger question is -- would it help your project? With TINYINT you can perform math operations on it. You can't do that with VARCHAR, since your numbers would be a string. Sure you could manipulate the string into a number, but that'll cost you in performance as well as the need to make extra code (which can unnecessarily complicate the SQL or other language).
If you're going to make a value be a number -- then by all means make it an integer. There's a reason really really smart people made a difference between string's & int's.

SHA1 sum as a primary key?

I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.
Q1. SHA1 PK will not be a sequentially increasing/decreasing number.
so, will it be more resource consuming for the database to
maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.
Q2. I read here:
https://stackoverflow.com/a/614483/986818 storing the data as
binary(20) field. can someone advice me in this regard:
a) do i have to create this column as: TYPE=integer, LENGTH=20,
COLLATION=binary, ATTRIBUTES=binary?
b) how to convert the sha1 value in MySQL or Perl to store into the
table?
c) is there a danger of duplicacy for this 20 char value?
**
---------UPDATE-------------
**
The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)
Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:
CREATE TABLES files (
id INT auto_increment,
hash INT unsigned,
filename VARCHAR(100),
PRIMARY KEY(id),
INDEX(hash)
);
Then you can use:
SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';
The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.
For generating an integer hash-key in perl I suggest using md5() and pack().
If i decide to keep it in database as 40 char value.
Using a character sequence as a key will degrade performance for obvious reasons.
Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.
Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.
Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);
-- Regarding: Is there a danger of duplicacy for this 20 char value?
The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.
There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.
No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.
It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.
I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).
If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.

MySQL: Fields length. Does it really matter?

I'm working with some database abstraction layers and most of them are using attributes like "String" which is VARCHAR 250 or INTEGER which has length of 11 digits. But for example I have something that will be less than 250 characters long. Should I go and make it less? Does it really makes any valuable difference?
Thanks in advance!
INT length does nothing. All INTs are 4 bytes. The number you can set, is only used for zerofill (and who uses that!?).
VARCHAR length does more. It's the maxlength of the field. VARCHAR is saved so that only the actual data is stored, so the length doesn't mattter. These days, you can have bigger VARCHARs than 255 bytes (being 256^2-1). The difference is the bytes that are used for the field length. VARCHAR(100) and VARCHAR(8) and VARCHAR(255) use 1 byte to save the field length. VARCHAR(1000) uses 2.
Hope that helps =)
edit
I almost always make my VARCHARs 250 long. Actual length should be checked in the app anyway. For bigger fields I use TEXT (and those are stored differently, so can be much much longer).
edit
I don't know how current this is, but it used to help me (understand): http://help.scibit.com/Mascon/masconMySQL_Field_Types.html
First, remember that the database is meant to store facts and is designed to protect itself against bad data. Thus, the reason you do not want to allow a user to enter 250 characters for a first name is that a user will put all kinds of data in there that is not a first name. They'll put their whole name, their underwear size, a novel about what they did last summer and so on. Thus, you want to strive to enforce that the data is as correct as possible. It is a mistake to assume that the application is the sole protector against bad data. You want users to tell you that they had a problem stuffing War in Peace into a given column.
Thus, the most important question is, "What is the most appropriate value for the data being stored?" Ideally, you would use an int and a check constraint to ensure that the values have an appropriate range (e.g. greater than zero, less than a billion etc.). Unfortunately, this is one of MySQL's greatest weakness: it does not honor check constraints. That simply means you must implement those integrity checks in triggers which admittedly is more cumbersome.
Will the difference between an int (4 bytes) make an appreciable difference to a tinyint (1 byte)? Obviously, it depends on the amount of data. If you will have no more than 10 rows, the answer is obviously no. If you will have 10 billion rows, the answer is obviously "Yes". However, IMO, this is premature optimization. It is far better to focus on ensuring correctness first.
For text, you should ask whether your data should support Chinese, Japanese or non-ANSI values (i.e., should you use nvarchar or varchar)? Does this value represent a real world code like a currency code, or bank code which has a specific specification?
Not so sure in MySQL, but in MS SQL it only makes a difference for sufficiently large databases. Typically, I like to use smaller fields for a) the space saving (it never hurts to practice good habits) and b) for the implied validation (if you know a certain field should never be more than 10 characters, why allow eleven, let alone 250?).
I thinks Rudie is wrong, not all INTs are 4 bytes... in MySQL you have:
tinyint = 1 byte,
smallint = 2 bytes,
mediumint = 3 bytes,
int = 4 bytes,
bigint = 8 bytes.
I think Rudie refers to the "display with" that is the number you put between parenthesis when you are creating a column, e.g.:
age INT(3)
You're telling to the RDBMS just to SHOW no more than 3 numbers.
And VARCHARs are (variable length charcter string) so if you declare let's say name varchar(5000) and you store a name like "Mario" you only are using 7 bytes (5 for the data and 2 for the length of the value).
The correct field size serves to limit the bad data that can be put in. For instance suppose you have a phone number field. If you allow 250 characters, you will often end up with things like the following in the phone field (an example not taken at random):
Call the good-looking blonde secretary instead.
So first limiting the length is part of how we enforce data integrity rules. As such it is critical.
Second, there is only so much space on a datapage and while some databases will allow you to create tables where the potential record is longer than the width of the data page, they often will not allow you to actually exceed it when storing the data. This can lead to some very hard to find bugs when suddenly one record can't be saved. I don't know about MySql and whether it does this but I know SQL Server does and it is very hard to figure out what is wrong. So making data the correct size can be critical to preventing bugs.