How to store URLs in MySQL - mysql

I need to store potentially 100s of millions URLs in a database. Every URL should be unique, hence I will use ON DUPLICATE KEY UPDATE and count the duplicate URLs.
However, I am not able to create an index on the URL field as my varchar field is 400 characters. MySQL is complaining and saying; "#1071 - Specified key was too long; max key length is 767 bytes". (Varchar 400 will take 1200 bytes)
What is the best way to do this, if you need to process minimum 500000 URLs per day in a single server?
We are already thinking using MongoDB for the same application, so we can simply query MongoDB and find the duplicate URL, and update the row. However, I am not in favor of solving this problem using MongoDB , and I'd like to use just MySQL at this stage as I'd like to be as lean as possible in the beginning and finish this section of the project much faster. (We haven't played with MongoDB yet and don't want to spend time at this stage)
Is there any other possibility doing this using less resources and time. I was thinking to get MD5 hash of the URL and store it as well. And I can make that field UNIQUE instead. I know, there will be collision but it is ok to have 5-10-20 duplicates in the 100 million URLs, if that's the only problem.
Do you have any suggestions? I also don't want to spend 10 seconds to insert just one URL, as it will process 500k URLs per day.
What would you suggest?
Edit: As per the request this is the table definition. (I am not using MD5 at the moment, it is for testing)
mysql> DESC url;
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| url_id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| url_text | varchar(400) | NO | | | |
| md5 | varchar(32) | NO | UNI | | |
| insert_date | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| count | mediumint(9) unsigned | NO | | 0 | |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
5 rows in set (0.00 sec)

According to the DNS spec the maximum length of the domain name is :
The DNS itself places only one restriction on the particular labels
that can be used to identify resource records. That one restriction
relates to the length of the label and the full name. The length of
any one label is limited to between 1 and 63 octets. A full domain
name is limited to 255 octets (including the separators).
255 * 3 = 765 < 767 (Just barely :-) )
However notice that each component can only be 63 characters long.
So I would suggest chopping the url into the component bits.
Using http://foo.example.com/a/really/long/path?with=lots&of=query&parameters=that&goes=on&forever&and=ever
Probably this would be adequate:
protocol flag ["http" -> 0 ] ( store "http" as 0, "https" as 1, etc. )
subdomain ["foo" ] ( 255 - 63 = 192 characters : I could subtract 2 more because min tld is 2 characters )
domain ["example"], ( 63 characters )
tld ["com"] ( 4 characters to handle "info" tld )
path [ "a/really/long/path" ] ( as long as you want -store in a separate table)
queryparameters ["with=lots&of=query&parameters=that&goes=on&forever&and=ever" ] ( store in a separate key/value table )
portnumber / authentication stuff that is rarely used can be in a separate keyed table if actually needed.
This gives you some nice advantages:
The index is only on the parts of the url that you need to search on (smaller index! )
queries can be limited to the various url parts ( find every url in the facebook domain for example )
anything url that has too long a subdomain/domain is bogus
easy to discard query parameters.
easy to do case insensitive domain name/tld searching
discard the syntax sugar ( "://" after protocol, "." between subdomain/domain, domain/tld, "/" between tld and path, "?" before query, "&" "=" in the query)
Avoids the major sparse table problem. Most urls will not have query parameters, nor long paths. If these fields are in a separate table then your main table will not take the size hit. When doing queries more records will fit into memory, therefore faster query performance.
(more advantages here).

To index a field up to 767 chars wide, it charset must be ascii or similar, it can´t be utf8 because it uses 3 bytes per char, so the maximun wide for indexed utf-8 fields is 255
Of course, an 767 ascii url field, excedes your initial 400 chars spec. Of course, some urls excedes the 767 limit. Perhaps you can store and index on the first 735 chars plus the md5 hash. You can also have a text full_url field to preserve original value.
Notice that ascii charset is good enough for urls

A well formed URL can only contain characters within the ASCII range - other characters need to be encoded. So assuming the URLs you intend to store are well formed (and if they are not, you may want to fix them prior to inserting them to the database), you could define your url_text column character set to ASCII (latin1 in MySQL). With ASCII, one char is one byte, and you will be able to index the whole 400 characters like you want.

The odds of a spurious collision with MD5 (128 bits) can be phrased this way:
"If you have 9 Trillion different items, there is only one chance in 9 Trillion that two of them have the same MD5."
To phrase it another way, it is more likely to be hit by a meteor while winning the mega-lottery.

You can change the url_text from VarChar(400) to Text, then you can add a full text index against it allowing you to search for the existence of the URL before you insert it.

Related

Is it better to have many columns or a single column bit string for many checkboxes

I have the following scenario:
A form with many checkboxes, around 100.
I have 2 ideas on how to save them in database:
1. Multicolumn
I create a table looking like this:
id | box1 | box2 | ... | box100 | updated| created
id: int
box1: bit(1)
SELECT * FROM table WHERE box1 = 1 AND box22 = 1 ...
2. Single data column
Table is simply:
id | data | updated | created
data: varchar(100)
SELECT * FROM table WHERE data LIKE '_______1___ ... ____1____1'
where data looks like 0001100101010......01 each character representing if value was checked or not.
Considering that the table will have 200k+ rows, which is a more scalable solution?
3. Single data column of type JSON
I have no good information about this yet.
Or...
4. A few SETs
5. A few INTs
These are much more compact: about 8 checkboxes per byte.
They are a bit messy to set/test.
Since they are limited to 64 bits, you would need more than one SET or INT. I recommend grouping the bits is some logical way, based on the app.
Be aware of FIND_IN_SET().
Be aware of (1 << $n) for creating the value 2^n.
Be aware of | and & Operators.
Which of the 5 is best? That depends on the queries you need to run -- for searching (if necessary?), for inserting, for updating (if necessary?), and for selecting.
An example: For INTs , WHERE (bits & 0x2C08) = 0x2C08 would simultaneously check for 4 flags being 'ON'. That constant could either be constructed in app code, or ((1<<13) | (1<<11) | (1<<10) | (1<<3)) for bits 3,10,11,13. Meanwhile, the other flags are ignored. If you need them to be 'OFF', the test would be WHERE bits ^ 0x2C08 = 0. If either of these kind of test is your main activity, then Choice 5 is probably the best for both performance and space, though it is somewhat cryptic to read.
When adding another option, SET requires an ALTER TABLE. INT usually has some spare bits (TINYINT UNSIGNED has 8 bits, ... BIGINT UNSIGNED has 64). So, about one time in 8, you would need an ALTER to get a bigger INT or add another INT. Deleting an option: suggest just abandoning that SET element or bit of INT.

Which type has INET6_ATON? / Max length of INET6_ATON?

I wanted to know how to save an INET6_ATON result to MYSQL. So I've read the MYSQL-Help article and it says, I should use VARBINARY(16). But now, with an IPv4-Address it has the content 0x7F000001 and I'm unable to get results using SQL. My idea was to use CHAR, but in this case I don't know what's the maximum length of an INET6_ATON-result.
So: How to get MYSQL-results if the result is saved as VARBINARY?
Or otherwise: What's the maximum length of an INET6_ATON-result?
I'm converting the IP-Addresses using this SQL-Statement:
SELECT HEX(INET6_ATON("FE80:0000:0000:0000:0202:B3FF:FE1E:8329"))
Thanks a lot.
Note that the human readable column below is for human consumption.
The inet6 column will contain out-of-range numbers should I say to make much sense to humans. Goobly goop, if you will.
create table myFriends
( id int auto_increment primary key,
friendlyName varchar(100) not null,
inet6 binary(16) not null,
humanReadable char(32) not null
);
insert myFriends (friendlyName,inet6,humanReadable) values
('Kathy Higgins',INET6_ATON("FE80:0000:0000:0000:0202:B3FF:FE1E:8329"),HEX(INET6_ATON("FE80:0000:0000:0000:0202:B3FF:FE1E:8329")));
select * from myFriends;
+----+---------------+------------------+----------------------------------+
| id | friendlyName | inet6 | humanReadable |
+----+---------------+------------------+----------------------------------+
| 1 | Kathy Higgins | ■Ç ☻☻│ ■▲â) | FE800000000000000202B3FFFE1E8329 |
+----+---------------+------------------+----------------------------------+
FE80 represents 2 bytes. FE 80. Hexadecimal. Each byte ranges from 00 to FF (255).
Check out my answer Here on different formats. Often nearly duplicate info is used in one table.

mysql bitwise operations with string columns

I need to store some flags for user records in a MySQL table (I'm using InnoDB):
---------------------------
| UserId | Mask |
| -------------------------
| 1 | 00000...001 |
| 2 | 00000...010 |
---------------------------
The number of flags is bigger than 64, so I can't use a BIGINT or BIT type to store the value.
I don't want to use many-to-many association tables, because each user can have more than one profile, each one with its set of flags and it would grow too big very quickly.
So, my question is, is it possible to store these flags in a VARCHAR, BLOB or TEXT type column and still do bitwise operations on them? If yes, how?
For now I just need one operation: given a mask A with X bits set to 1 what users have at least those X bits set to 1.
Thanks!
EDIT
To anyone reading this, I've found a solution (for me, at least). I'm using a VARCHAR for the mask field and when searching for a specific mask I use this query:
select * from my_table where mask like '__1__1'
Every record that has the 3rd and last bit set to on will be returned. The "_" symbol is a SQL placehoder for "any single character" (mySQL only, perhaps?).
In terms of speed is doing fine right now, will have to check later when my user base grows.
Anyway, thanks for your input. Other ideas welcomed, of course.

Optimized datatypes + simple database design

i am using a simple database design and i think the best database example is e-commerce, because it does have a lot of problems, and its familiar to cms.
USERS TABLE
UID |int | PK NN UN AI
username |varchar(45) | UQ INDEX
password |varchar(100) | 100 varchar for $6$rounds=5000$ crypt php sha512
name |varchar(100) | 45 for first name 45 for last 10 for spaces
gender |bit | UN ,0 for women 1 for men, lol.
phone |varchar(30) | see [2]
email |varchar(255) | see RFC 5322 [1]
verified |tinyint | UN INDEX
timezone |tinyint | -128 to 127 just engough for +7 -7 or -11 +11 UTC
timeregister |int | 31052010112030 for 31-05-2010 11:20:30
timeactive |int | 01062010110020 for 1-06-2010 11:00:20
COMPANY TABLE
CID |int | PK NN UN AI
name |varchar(45) |
address |varchar(100) | not quite sure about 100.
email |varchar(255) | see users.email, this is for the offcial email
phone |varchar(30) | see users.phone
link |varchar(255) | for www.website.com/companylink 255 is good.
imagelogo |varchar(255) | for the retrieving image logo & storing
imagelogosmall |varchar(255) | not quite good nameing huh? let see the comments
yahoo |varchar(100) | dont know
linkin |varchar(100) | dont know
twitter |varchar(100) | twitter have 100 max username? is that true?
description |TEXT | or varchar[30000] for company descriptions
shoutout |varchar(140) | status that companies can have.
verified |tinyint | UN INDEX
PRODUCT TABLE
PID |int | PK NN UN AI
CID |int | from who?santa? FK: company.cid cascade delete
name |varchar(100) | longest productname maybe hahaha.
description |TEXT | still confused useing varchar[30000]
imagelarge |varchar(255) | for the retrieving product image & storing
imagesmall |varchar(255) | for the retrieving small product image & storing
tag |varchar(45) | for tagging like stackoverflow ( index )
price |decimal(11,2) | thats in Zimbabwe dollar.
see Using a regular expression to validate an email address
see What's the longest possible worldwide phone number I should consider in SQL varchar(length) for phone
why innodb specific ?
please see quote How to choose optimized datatypes for columns [innodb specific]?
its getting of the topic so i have to create another question and it people doent understand what im trying to say, or maybe i cant explain what i want there . this time its + database design.
so again please copy the example above and put your changes + comments just like the example. give an advice about it.
Remember for INNODB mysql. read the quote on above link. thanks.
I'm going to answer this as if you're asking for advice on column definitions. I will not copy and paste your tables, because that's completely silly.
Don't store dates and times as integers. Use a DATETIME column instead.
Keep in mind that MySQL stores DATETIMEs as GMT, but presents them in whatever timezone it's been configured to use. It may be worth setting the connection time zone to GMT so that your separate time zone storage will work.
Keep in mind that not all time zones are full hour offsets from GMT. Daylight Saving Time can throw a monkey wrench in hour-based calculations as well. You may want to store the time zone string (i.e. "America/Los_Angeles") and figure out the proper offset at runtime.
You do not need to specify a character count for integer columns.
Don't be afraid of TEXT columns. You have a lot of VARCHAR(255)s for data that can easily be longer than 255 characters, like a URL.
Keep in mind that optimizing for a specific database engine, or optimizing for storage on disk is the very last thing you should do. Make your column definitions fit the data. Premature optimization is one of the many roots of all evil in this world. Don't worry about tinyint vs smallint vs int vs bigint, or char vs varchar vs text vs bigtext. It won't matter for you 95% of the time as long as the data fits.
You should store all your dates/times in GMT. It's a best practice to convert them to 0 offset and then convert them back to whatever local time zone offset the user is currently in for display.
The maximum length of a URL in Internet Explorer (the lowest common denominator) is 2,000 characters (just use TEXT).
You don't set lengths on INT types (take them off!). INT is 32 bits (-2147483648 to 2147483647), BIGINT is 64 bits, TINYINT is 8 bits.
For bool/flags you can use BIT which is 1 or 0 (your "verified" for example)
VARCHAR(255) might be too small for "imagelarge" and "imagesmall" if it is to includes the image file name and path (see above for max URL length).
If you are confused on how big a VARCHAR is too big and when to start using TEXT, just use TEXT!
10.2. Numeric Types
USERS TABLE
UID |int(11) | PK as primery key ? NN as not null ? UN AI
username |varchar(45) | UQ
password |varchar(200) | 200 is better.
name |varchar(100) | ok
gender |blob | f and M
phone |varchar(30) |
email |varchar(300) | thats 256 , put 300 insted
verified |tinyint(1) | UN
timezone(delate) |datetime | this should be a php job
timeregister |datetime |
timeactive |datetime |
COMPANY TABLE
CID |int(11) | PK NN UN AI
name |varchar(45) |
address |varchar(100) | 100 is fine
email |varchar(255) |
phone |varchar(30) |
link |varchar(255) |
imagelogo |varchar(255) |
imagelogosmall |varchar(255) | the nameing is just fine for me
yahoo |varchar(100) | see 3.
linkin |varchar(100) | linkin use real names, maybe 100 is great
twitter |varchar(20) | its 20 ( maybe 15 )
description |TEXT |
shoutout |varchar(140) | seems ok.
verified |tinyint(1) | UN
PRODUCT TABLE
PID |int(11) | PK NN UN AI
CID |int(11) | FK: company.cid cascade delete & update
name |varchar(100) |
description |TEXT |
imagelarge |varchar(255) |
imagesmall |varchar(255) |
tag |varchar(45) |
price |decimal(11,2) |
in php see php.net/manual/en/function.date-default-timezone-set.php
$time = date_default_timezone_set('Region/Town');
$time = date( 'Y-m-d H:i:s' );
echo $time;
http://www.eph.co.uk/resources/email-address-length-faq/
For what it's worth, the integer argument (e.g. INT(11)) is not meaningful for storage or optimization in any way. The argument does not indicate a max length or max range of values, it's only a hint for display. This confuses a lot of MySQL users, perhaps because they're used to CHAR(11) indicating max length. Not so with integers. TINYINT(1) and TINYINT(11) and TINYINT(255) are stored identically as an 8-bit integer, and they have the same range of values.
The max length of an email address is 320 characters. 64 for the local part, 1 for #, and 255 for the domain.
I am not a fan of using VARCHAR(255) as a default string declaration. Why is 255 the right length? Is 254 just not long enough and 256 seems too much? The answer is that people believe that the length of each string is stored somewhere, and by limiting the length to 255 they can ensure that the length only takes 1 byte. They've "optimized" by allowing as long a string as they can while still keeping the length to 1 byte.
In fact, the length of the field is not stored in InnoDB. The offset of each field in the row is stored (see MySQL Internals InnoDB). If your total row length is 255 or less, the offsets use 1 byte. If your total row length could be longer than 255, the offsets use 2 bytes. Since you have several long fields in your row, it's almost certain to store the offsets in two bytes anyway. The ubiquitous value 255 may be optimized for some other RDBMS implementation, but not InnoDB.
Also, MySQL converts rows to a fixed-length format in some cases, padding variable-length fields as necessary. For example, when copying rows to a sort buffer, or storing in a MEMORY table, or preparing the buffer for a result set, it has to allocate memory based on the maximum length of the columns, not the length of usable data on a per-row basis. So if you declare VARCHAR columns far longer than you ever actually use, you're wasting memory in those cases.
This points out the hazard of trying to optimize too finely for a particular storage format.

Common MySQL fields and their appropriate data types

I am setting up a very small MySQL database that stores, first name, last name, email and phone number and am struggling to find the 'perfect' datatype for each field. I know there is no such thing as a perfect answer, but there must be some sort of common convention for commonly used fields such as these. For instance, I have determined that an unformatted US phone number is too big to be stored as an unsigned int, it must be at least a bigint.
Because I am sure other people would probably find this useful, I dont want to restrict my question to just the fields I mentioned above.
What datatypes are appropriate for common database fields? Fields like phone number, email and address?
Someone's going to post a much better answer than this, but just wanted to make the point that personally I would never store a phone number in any kind of integer field, mainly because:
You don't need to do any kind of arithmetic with it, and
Sooner or later someone's going to try to (do something like) put brackets around their area code.
In general though, I seem to almost exclusively use:
INT(11) for anything that is either an ID or references another ID
DATETIME for time stamps
VARCHAR(255) for anything guaranteed to be under 255 characters (page titles, names, etc)
TEXT for pretty much everything else.
Of course there are exceptions, but I find that covers most eventualities.
Here are some common datatypes I use (I am not much of a pro though):
| Column | Data type | Note
| ---------------- | ------------- | -------------------------------------
| id | INTEGER | AUTO_INCREMENT, UNSIGNED |
| uuid | CHAR(36) | or CHAR(16) binary |
| title | VARCHAR(255) | |
| full name | VARCHAR(70) | |
| gender | TINYINT | UNSIGNED |
| description | TINYTEXT | often may not be enough, use TEXT
instead
| post body | TEXT | |
| email | VARCHAR(255) | |
| url | VARCHAR(2083) | MySQL version < 5.0.3 - use TEXT |
| salt | CHAR(x) | randomly generated string, usually of
fixed length (x)
| digest (md5) | CHAR(32) | |
| phone number | VARCHAR(20) | |
| US zip code | CHAR(5) | Use CHAR(10) if you store extended
codes
| US/Canada p.code | CHAR(6) | |
| file path | VARCHAR(255) | |
| 5-star rating | DECIMAL(3,2) | UNSIGNED |
| price | DECIMAL(10,2) | UNSIGNED |
| date (creation) | DATE/DATETIME | usually displayed as initial date of
a post |
| date (tracking) | TIMESTAMP | can be used for tracking changes in a
post |
| tags, categories | TINYTEXT | comma separated values * |
| status | TINYINT(1) | 1 – published, 0 – unpublished, … You
can also use ENUM for human-readable
values
| json data | JSON | or LONGTEXT
In my experience, first name/last name fields should be at least 48 characters -- there are names from some countries such as Malaysia or India that are very long in their full form.
Phone numbers and postcodes you should always treat as text, not numbers. The normal reason given is that there are postcodes that begin with 0, and in some countries, phone numbers can also begin with 0. But the real reason is that they aren't numbers -- they're identifiers that happen to be made up of numerical digits (and that's ignoring countries like Canada that have letters in their postcodes). So store them in a text field.
In MySQL you can use VARCHAR fields for this type of information. Whilst it sounds lazy, it means you don't have to be too concerned about the right minimum size.
Since you're going to be dealing with data of a variable length (names, email addresses), then you'd be wanting to use VARCHAR. The amount of space taken up by a VARCHAR field is [field length] + 1 bytes, up to max length 255, so I wouldn't worry too much about trying to find a perfect size. Take a look at what you'd imagine might be the longest length might be, then double it and set that as your VARCHAR limit. That said...:
I generally set email fields to be VARCHAR(100) - i haven't come up with a problem from that yet. Names I set to VARCHAR(50).
As the others have said, phone numbers and zip/postal codes are not actually numeric values, they're strings containing the digits 0-9 (and sometimes more!), and therefore you should treat them as a string. VARCHAR(20) should be well sufficient.
Note that if you were to store phone numbers as integers, many systems will assume that a number starting with 0 is an octal (base 8) number! Therefore, the perfectly valid phone number "0731602412" would get put into your database as the decimal number "124192010"!!
Any Table ID
Use: INT(11).
MySQL indexes will be able to parse through an int list fastest.
Anything Security
Use: BINARY(x), or BLOB(x).
You can store security tokens, etc., as hex directly in BINARY(x) or BLOB(x). To retrieve from binary-type, use SELECT HEX(field)... or SELECT ... WHERE field = UNHEX("ABCD....").
Anything Date
Use: DATETIME, DATE, or TIME.
Always use DATETIME if you need to store both date and time (instead of a pair of fields), as a DATETIME indexing is more amenable to date-comparisons in MySQL.
Anything True-False
Use: BIT(1) (MySQL 8-only.) Otherwise, use BOOLEAN(1).
BOOLEAN is actually just an alias of TINYINT(1), which actually stores 0 to 255 (not exactly a true/false, is it?).
Anything You Want to call `SUM()`, `MAX()`, or similar functions on
Use: INT(11).
VARCHAR or other types of fields won't work with the SUM(), etc., functions.
Anything Over 1,000 Characters
Use: TEXT.
Max limit is 65,535.
Anything Over 65,535 Characters
Use: MEDIUMTEXT.
Max limit is 16,777,215.
Anything Over 16,777,215 Characters
Use: LONGTEXT.
Max limit is 4,294,967,295.
FirstName, LastName
Use : VARCHAR(255).
UTF-8 characters can take up three characters per visible character, and some cultures do not distinguish firstname and lastname. Additionally, cultures may have disagreements about which name is first and which name is last. You should name these fields Person.GivenName and Person.FamilyName.
Email Address
Use : VARCHAR(256).
The definition of an e-mail path is set in RFC821 in 1982. The maximum limit of an e-mail was set by RFC2821 in 2001, and these limits were kept unchanged by RFC5321 in 2008. (See the section: 4.5.3.1. Size Limits and Minimums.) RFC3696, published 2004, mistakenly cites the email address limit as 320 characters, but this was an "info-only" RFC that explicitly "defines no standards" according to its intro, so disregard it.
Phone
Use: VARCHAR(255).
You never know when the phone number will be in the form of "1800...", or "1-800", or "1-(800)", or if it will end with "ext. 42", or "ask for susan".
ZipCode
Use: VARCHAR(10).
You'll get data like 12345 or 12345-6789. Use validation to cleanse this input.
URL
Use: VARCHAR(2000).
Official standards support URL's much longer than this, but few modern browsers support URL's over 2,000 characters. See this SO answer: What is the maximum length of a URL in different browsers?
Price
Use: DECIMAL(11,2).
It goes up to 11.
I am doing about the same thing, and here's what I did.
I used separate tables for name, address, email, and numbers, each with a NameID column that is a foreign key on everything except the Name table, on which it is the primary clustered key. I used MainName and FirstName instead of LastName and FirstName to allow for business entries as well as personal entries, but you may not have a need for that.
The NameID column gets to be a smallint in all the tables because I'm fairly certain I won't make more than 32000 entries. Almost everything else is varchar(n) ranging from 20 to 200, depending on what you wanna store (Birthdays, comments, emails, really long names). That is really dependent on what kind of stuff you're storing.
The Numbers table is where I deviate from that. I set it up to have five columns labeled NameID, Phone#, CountryCode, Extension, and PhoneType. I already discussed NameID. Phone# is varchar(12) with a check constraint looking something like this: CHECK (Phone# like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'). This ensures that only what I want makes it into the database and the data stays very consistent. The extension and country codes I called nullable smallints, but those could be varchar if you wanted to. PhoneType is varchar(20) and is not nullable.
Hope this helps!