index on url or hashing considering RAM - mysql

I am working on a project which needs to add/update around 1 million urls daily. Some days are mostly updates and some days are mostly add and some days are mix.
So, on every query there is need to look up uniqueness of url in url table.
How look up for url can be made really fast because at the moment index is set at url column and it works good but in coming weeks RAM would not be enough if index are kept on same column and new records will be added in millions.
That's why I am looking for a solution so that when there will be 150+ million urls in total then its look up should be fast. I am thinking of creating indexing on md5 but then worries about collision chances. A friend tipped me to calculate crc32 hash also and concatenate with md5 to make collision possibility to zero and store it in binary(20) that way only 20 bytes would be taken as index instead of 255 currently varchar(255) set as url column data type.
Currently there are total around 50 million urls and with 8GB ram its working fine.
Yesterday, I asked a question url text compression (not shortening) and storing in mysql related to the same project.
[Edit]
I have thought of another solution of putting crc32 hash only in decimal form to speed up look up. And at application level porting a check on how many records are returned. If more than 1 record is returned then exact url should also be matched.
That way collision would also be avoided while keep low load on RAM and disk space by storing 4 bytes for each row instead of 20 bytes (md5+crc32). What you say?

After reading all your questions ( unique constraint makes hashes useless? , 512 bit hash vs 4 128bit hash and url text compression (not shortening) and storing in mysql), I understood that your problem is more or less the following:
"I need to store +150M URLs in mySQL, using 8GB of RAM, and still have a good performance on writing them all and retrieving them, because daily I'll update them, so I'll retrive a lot of URLs, check them against the database. Actually it has 50M URLs, and will grow about 1M each day in the following 3 monts."
Is that it?
The following points are important:
How is the format of the URL that you'll save? Will you need to read the URL back, or just update informations about it, but never search based in partial URLs, etc?
Assuming URL = "http://www.somesite.com.tv/images/picture01.jpg" and that you want to store everything, inclusing the filename. If it's different, please provide more details or correct my answer assumptions.
If can save space by replacing some group of characters in the URL. Not all ASCII characters are valid in an URL, as you can see here: RFC1738, so you can use those to represent (and compress) the URL. For example: using character 0x81 to represent "http://" can make you save 6 characters, 0x82 to represent ".jpg" can save you another 3 bytes, etc.
Some words might be very common (like "image", "picture", "video", "user"). If you choose to user characters 0x90 up to 0x9f + any other character (so, 0x90 0x01, 0x90 0x02, 0x90 0xfa) to encode such words, you can have 16 * 256 = 4,096 "dictionary entries" to encode the most used words. You'll use 2 bytes to represent 4 - 8 characters.
Edit: as you can read in the mentioned RFC, above, in the URL you can only have the printable ASCII characters. This means that only characters 0x20 to 0x7F should be used, with some observations made in the RFC. So, any character after 0x80 (hexadecimal notation, would be character 128 decimal in the ASCII table) shouldn't be used. So, if can choose one character (let's say the 0x90) to be one flag to indicate "the following byte is a indication in the dictionary, the index that I'll use". One character (0x90) * 256 characters (0x00 up to 0xFF) = 256 entries in the dictionary. But you can also choose to use characters 0x90 to 0x9f (or 144 to 159 in decimal) to indicate that they are a flag to the dictionary, thus giving you 16 *256 possibilities...
These 2 methods can save you a lot of space in your database and are reversible, without the need to worry about collisions, etc. You'll simple create a dictionary in your application and go encode/decode URLs using it, very fast, making your database much lighter.
Since you already have +50M URLs, you can generate statistics based on them, to generate a better dictionary.
Using hashes : Hashes, in this case, are a tradeoff between size and security. How bad will it be if you get a collision?
And in this case you can use the birthday paradox to help you.
Read the article to understand the problem: if all inputs (possible characters in the URL) were equivalent, you could stimate the probability of a collision. And could calculate the opposite: given your acceptable collision probability, and your number of files, how broad should your range be? And since your range is exactlly related to the number of bits generated by the hash function...
Edit: if you have a hash function that gives you 128 bits, you'll have 2^128 possible outcomes. So, your "range" in the birthday paradox is 2^128: it's like your year have 2^128 days, instead of 365. So, you calculate the probabilities of collision ("two files being born in the same day, with a year that have 2^128 days instead of 365 days). If you choose to use a hash that gives you 512 bits, your range would go from 0 to 2^512...
And, again, have the RFC in mind: not all bytes (256 characters) are valid in the internet / URL world. So, the probabillity of collisions decrease. Better for you :).

Related

Is PayPal's "PNREF" always 12 characters?

Does anyone know if PayPal's "PNREF" (returned from zero-dollar authorizations) is always 12 characters?
This I ask because I want to optimize my mySQL storage.
And also, I trust SO's answer more than PP's :-D
Don't "optimize" your storage. Not only do server-grade terabyte sized drives cost just a few hundred dollars, making the cost of storing a handful of bytes nearly zero, but VARCHAR(255) columns only take up as much space as you have content because they are variable length.
If you ran a million transactions and saved ten bytes on each, you've saved all of ten megabytes of data, or about $0.0001 worth of storage. I'm presuming if you've run a million transactions you can afford the bytes. The PayPal fees will be literally several quadrillion times higher.
In actuality there's zero savings between 12 characters in VARCHAR(12) and VARCHAR(255). Internally these are represented as a single length byte plus N bytes for the content. For regular 7-bit values that means 13 bytes per entry.
The only difference is you're arbitrarily limiting the former to 12 characters and will get truncation errors (if this flag is set, as it is on newer versions of MySQL) if you insert longer values, or you'll lose data and have no idea until it's probably too late to fix it.
Just use VARCHAR(255) so that your code doesn't explode when PayPal decides today's the day to use 14 characters. These things can change without warning and without any logical reason.

Sorting a AS3 ByteArray

I am designing an Air application that needs to store thousands of records in memory, and needs to sort them efficiently, by various keys.
I thought of using a ByteArray, since that would avoid all the overhead of normal AS3 objects, and would allow me to use memory more efficiently.
However, the challenge is how to sort the records inside the ByteArray. I have thought of two possibilities:
1- Implement quick-sort or heap-sort in AS3, and sort the array this way. However, I am not sure this will be performant enough. For example, ByteArrays do not have methods to copy chunks of memory around; it has to be done byte-by-byte.
2- Create an Air Native Extension (ANE) that takes the ByteArray and sorts it, using C. the drawback of this is that it will be harder to implement for all the platforms it needs to run on.
What would you recommend? Do you have any previous experience doing something similar?
I'd say use Array or Vector objects, there's a possibility to sort Arrays on whatever key you want via sortOn(), and Vectors via sort(), so you can achieve whatever behavior you need, as the latter accepts a function as its parameter, check here. And I believe you won't get anywhere with ByteArrays, since what is actually done in sorting objects is sorting links in there, while a ByteArray will contain actual data.
You should never design anything that HAS to have hundreds of thousands of anything in the memory at once. Offload stuff while you don't need it. Do you know how much 100,000 is? Taking a single byte and multiplying by 100,000 gives you a MB. For every 1 byte of data in a record, you will generate 1MB of memory. Recording 100,000 ints takes 4MB.
If your records have 2 20 character strings (a first and last name), a String character is represented with 8 bytes, so you have just filled the memory with 640 MB of nothing more than first and last names. Most 'bad' computers have like, what... 2GB of memory? Good Job taking up 1/4 of that. Even if you managed to truncate this down to ByteArray level with superhuman uber bitshifing, you're still talking about reducing data by a factor of 8. So now you have 80MB for just first and last names and no other data. You could survive on that- except I suspect your records have more data then 2 strings. 20 strings? You're eating 800MB of data. Offload everything but 100 records at a time, and you're down to 640KB of memory for those names. And yes, you can load and offload while sorting.
Chunks of memory don't copy faster than bytes. It's all the same. The reasons Vectors of Objects are performant when switching is because they switch references/pointers/one single 32 bit/64 bit number instead of copying chunks of memory.
It's not clear what you're sorting. Bytes only go up to values of 256, so clearly you're using more bytes than 1 for each record. You want to evaluate each set of... like 2000+ bytes against each other set of 2000+ bytes? Like "Ah, last name is bytes 32-83, so extract those bytes, for every group of 4 bytes, bit shift them 0, 8, 16, 32 bits respectively, add them together, concatinate their integer values into a a String, do a comparison, now compare bytes 84-124 against the bytes in the next option, now transfer bytes 0-2000 to location 443530-441530 and....... Do these records have variable length strings or arrays in them? Oh lord.
Flash is not the place to write in assembly!
Use objects and test the speed & memory consumption. If either makes you cry, use more conventional methods of reducing load; like offloading materials temporarily into text files. The ugliest you should be getting is avoiding objects by storing each individual property in a different Vector. Vector., etc and having the same index refer to one record across the board.

url text compression (not shortening) and storing in mysql

I have url table in mysql which has only two fields id and varchar(255) for url. There are currently more than 50 million urls there and my boss has just given my clue about the expansion of our current project which will result in more urls to be added in that url table and expected numbers are well around 150 million in the mid of the next year.
Currently database size is about 6GB so I can safely say that if things are left same way then it will cross 20GB which is not good. So, I am thinking of some solution which can reduce the disk space of url storage.
I also want to make it clear that this table is not a busy table and there are not too many queries at the momen so I am just looking to save disk space and more importantly I am looking to explore new ideas of short text compression and its storage in mysql
BUT in future that table can also be accessed heavily so its better to optimize the table well before the time come.
I worked quite a bit to change the url into numeric form and store using BIGINT but as it has limitations of 64 bits so it didn't work out quite well. And same is the problem with BIT data type and imposes the limit of 64 bits too.
My idea behind converting to numeric form is basically as 8byte BIGINT stores 19 digits so if each digit points a character in a character set of all possible characters then it can store 19 characters in 8 bytes if all characters are ranged from 1-10 but as in real world scenario there are 52 characters of English and 10 numeric plus few symbols so its well around 100 character set. So, in worst case scenario BIGINT can still point to 6 characters and yes its not a final verdict it still needs some workout to know exactly what each digit is point to it is 10+ digit or 30+ digit or 80+ digit but you have got pretty much the idea of what I am thinking about.
One more important thing is that as url are of variable length so I am also trying to save disk space of small urls so I don't want to give a fixed length column type.
I have also looked into some text compression algo like smaz and Huffman compression algo but not pretty much convinced because they use some sort of dictionary words but I am looking for a clean method.
And I don't want to use binary data type because it also take too many space like varchars in bytes.
Another idea to try might be to identify common strings and represent them with a bitmap. For instance, have two bits to represent the protocol (http, https, ftp or something else), another bit to indicate whether the domain starts with "wwww", two bits to indicate whether the domain ends with ".com", ".org", ".edu" or something else. You'd have to do some analysis on your data and see if these make sense, and if there are any other common strings you can identify.
If you have a lot of URLs to the same site, you could also consider splitting your table into two different ones, one holding the domain and the other containing the domain-relative path (and query string & fragment id, if present). You'd have a link table that had the id of the URL, the id of the domain and the id of the path, and you'd replace your original URL table with a view that joined the three tables. The domain table wouldn't have to be restricted to the domain, you could include as much of the URL as was common (e.g., 'http://stackoverflow.com/questions'). This wouldn't take too much code to implement, and has the advantage of still being readable. Your numeric encoding could be more efficient, once you get it figured out, you'll have to analyze your data to see which one makes more sense.
If you are looking for 128 bit integers then you can use binary(16) here 16 is bytes. And you can extend it to 64 bytes (512 bits) so it doesn't take more space than bit data type. You can say Binary data type as an expansion of BIT data type but its string variant.
Having said that I would suggest dictionary algorithms to compress URLs and short strings but with the blend of techniques used by url shortening services like using A-Z a-z 0-9 combination of three words to replace large dictionary words and you would have more combinations available than available words 62 X 62 X 62.
Though I am not sure what level of compression you would achieve but its not a bad idea to implement url compression this way.

Can I use part of MD5 hash for data identification?

I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.
My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.
Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.
Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes compared to the original hash?
The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:
http://en.wikipedia.org/wiki/Birthday_paradox
The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.
EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.
Here is a related topic you may refer to
What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?
Another way to shorten the hash is to convert it to something more efficient than HEX like Base64 or some variant there-of.
Even if you're determined to take on 4 characters, taking 4 characters of base64 gives you more bits than hex.

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".