Choosing a magic byte least likely to appear in real data - language-agnostic

I hope this isn't too opinionated for SO; it may not have a good answer.
In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.
I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.
So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?
Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.

Capture real-world data from a diverse set of inputs that would be used by applications of your library.
Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.
This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!
Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.

I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.

Related

Cython: string to list of strings

in pure Python it is easy:
in_string = 'abc,def,ghi,jklmnop,, '
out = in_string.lower().rstrip().split(',') # too slow!!!
out -> ['abc','def','ghi','jklmnop','']
In my case this is called several million times and I need to speed it up a little. I am already using Cython but I do not know not to speed up this particular portion of code.
There can be up to 300 substrings. Pure ASCII. Letters, numbers and some other printable characters. There can be no comma "," in a substring. So a comma is the separator.
Edit:
OK, I see that a simple question turns into a big one. So the data comes from files which have a CSV-like format (no ready to run software works on this) and in total can be 100GB in size. The method reads the file line by line, needs to get the substrings and then sends the substrings to a SQlite database (I am already using executemany). The whole is done in multiprocessing manner, so each file is processed by its own process. The whole is already fast, but I want to squeeze out the last bit of performance. Additionally I want to learn more about Cython. So I have picked this (simple) part of Python code and have run it with "cython -a" which produces a big amount of generated code. So I think this is the best part to start optimizing.
Profiling the code is not that easy because of multiprocessing and cython is being used.
So once someone answers my question, I could implement this code and make a test run. So even I might not improve the speed of my code I will for sure learn a lot. Unfortunately I am a C noob
Yes you can do this in Cython, larger question is if you should.
Where does the input come from?
Is it a file? Then other optimisations are possible, e.g. you could map the file into memory.
Is it a database or network connection? In that case your runtime is probably dominated by waiting for disk/network.
What do you plan to do with the output?
Does the output have to be a string, or can it be a buffer?
"abc,def" -> "abc\0def\0"
buffer1 ------^ ^
buffer2 -----------!
You mentioned that string splitting fragment was called millions of times, processing the string is not that slow, what probably kills performance is allocating all the small strings, an array to hold the result, and then collecting the garbage once substrings are no longer user.
If you could give out pointers to existing data instead, you could speed things up a bit.
How often are these substrings used? If split is called millions of times, it seems to suggest that most substrings are discarded (or you'd run out of memory).
For example, consider the problem "split into substrings and return numbers only"
filter(str.isdigit, "dfasdf,6785,2,dhs,dfgsd,dsg,dsffg".split(","))
If you know in advance that most substrings are not numbers, you'd want to optimise this larger problem as a single block.
How many substrings are there in a typical input?
If there are 4, like in your example, it's not worth it. If there are millions, or even thousands, you may get somewhere.
Is there unicode?
.lower() on an ASCII string is trivial, but not so on unicode. I'd stick to Python if you expect unicode.

Is it useful to compress localStorage?

I'll start off with a solid example: I have a function that generates hashes (32-bit integers) and saves them in localStorage. This is to implement a "don't show me again" feature for common notifications: if the hash is in the list, don't show the notification.
After my first attempt at coding this solution, my localStorage entry looked like this:
616845040,796177849,848184043,1133088406,1205053317,1478518197,1525440546,1686606993,1753347541,1908577591,2056496592,-864967541,-1185668678,-835401591,-1017499054,-559563441,-1842092814,-1069291933,-1887162563
19 hashes, 210 bytes of data.
A little later, I revisited the code. Instead of just dumping the integers as decimal strings, I converted them into actual binary data. In other words, each hash is now a string of four characters in length representing the binary value of the integer. My localStorage entry now looks like this:
$ÄNð/tµ¹2BëCGÓ§X eµZì`"dhõÕqÂ7z¥Ðᅩq¤ᄍT!ºᅫ4ÈᅢZ2R￞¥½Oメ3äò￀Cæcマ/=
19 hashes, 76 bytes of data (There's some non-printable characters in there)
That's a savings of 63.8%.
Now, I am well aware that localStorage provides, by default, 5MB of storage space. I could easily store tens of thousands of hashes with the first method with no issues at all. But I like being efficient. I certainly wouldn't want a 5MB file on my computer when I could have the same data in 1.8MB (same compression ratio as above). That's why I save all my PNGs as indexed-palette when possible.
Is this a good mentality to have? Or am I just being pedantic? I guess this question could be summarised as: Should I compress, or just not care due to having more resources than I'll ever need?
Pedantic is good when coming to code. Compress when you can, but be sure that when reading your code, it's still readable and understandable that hashes are kept in whatever way.
What I mean is, don't sacrifice your code readability and maintainability for efficiency.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

What are important points when designing a (binary) file format? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:
have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
possibly reserve some space for further per-file attributes that might be necessary in the future?
What else would be useful to make the format more future-proof and minimize headache in the future?
Take a look at the PNG spec. This format has some very good rationale behind it.
Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.
As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.
I agree that these are good ideas:
Magic numbers at the beginning. Pretty much required in *nix:
File version number for backwards compatibility.
Endianness specification.
But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).
possibly reserve some space for further per-file attributes that might be necessary in the future?
Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.
In addition to 1-3 above, I'd add these:
simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.
version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.
Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).
And here are some optional ones:
If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.
If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.
It all depends on the purpose of the format, of course.
One flexible approach is to structure entire file as TLV (Tag-Length-Value) triplets.
For example, make your file comprized of records, each record beginning with a 4-byte header:
1 byte = record type
3 bytes = record length
followed by record content
Regarding the endianness, if you store endianness indicator in the file, all your applications will have to support all endianness formats. On the other hand, if you specify a particular endianness for your files, only applications on platforms with non-matching endiannes will have to do additional work, and it can be decided at compile time (using conditional compilation).
Another point, taken from .xz file spec (http://tukaani.org/xz/xz-file-format.txt): one of the first few bytes should be a non-character, "to prevent applications from misdetecting the file as a text file.". Note sure how many header bytes are usually inspected by editors and other tools, but using a non-binary byte in the first four or eight bytes seems useful.
One of the most important things to know before even starting is how your file will be used.
Will random or sequential access be the norm?
How often will the data be read?
How often will the data be written?
Will you write out the file in one go or will you be slowing writing it as data comes in.
Will the file need to be portable? Not all formats need to be.
Does it need to be compatible with other versions? Maybe updating the file is sufficient.
Does it need to be easy to read/write?
Size/Speed/Compexity tradeoff.
Most answers here give good advise on the portability/compatibility front so I am not going to add more. But consider the following (often overlooked) things.
Some files are often written and rarely read (backups, logs, ...) and you may want to focus on filesize and easy-writing.
Converting endianness is slow (relatively) if your file will never leave the host, or leaves rarely enough that conversion is a good option you can get a significant performance boost. Consider writing a number such as 0x1234 as part of the header so that you can detect (and instruct the user to convert) if this is the case.
Sometimes easy reading is really useful. If you are doing logs or text documents, consider compressing all in one go rather than per-entry so that you can zcat | strings the file and see what is inside.
There are many things to keep in mind and designing a good format takes a lot of planning and foresight. The little things such as zcating a file and getting useful information or the small performance boost from using native integers can give your product an edge, however you need to be careful that you don't sacrifice something important to get it.
One way to future proof the file would be to provide for blocks. Straight after your file header data, you can begin the first block. The block could have a byte or word code for the type of block, then a size in bytes. Now you can arbitrarily add new block types, and you can skip to the end of a block.
I would consider defining a substructure that higher levels use to store data, a little like a mini file system inside the file.
For example, even though your file format is going to store application-specific data, I would consider defining records / streams etc. inside the file in such a way that application-agnostic code is able to understand the layout of the file, but not of course understand the opaque payloads.
Let's get a little more concrete. Consider the usual ways of storing data in memory: generally they can be boiled down to either contiguous expandable arrays / lists, pointer/reference-based graphs, and binary blobs of data in particular formats.
Thus, it may be fruitful to define the binary file format along similar lines. Use record headers which indicate the length and composition of the following data, whether it's in the form of an array (a list of identically-typed records), references (offsets to other records in the file), or data blobs (e.g. string data in a particular encoding, but not containing any references).
If carefully designed, this can permit the file format to be used not just for persisting data in and out all in one go, but on an incremental, as-needed basis. If the substructure is properly designed, it can be application agnostic yet still permit e.g. a garbage collection application to be written, which understands the blobs, arrays and reference record types, and is able to trace through the file and eliminate unused records (i.e. records that are no longer pointed to).
That's just one idea. Other places to look for ideas are in general file system designs, or relational database physical storage strategies.
Of course, depending on your requirements, this may be overkill. You may simply be after a binary format for persisting in-memory data, in which case an approach to consider is tagged records.
In this approach, every piece of data is prefixed with a tag. The tag indicates the type of the immediately following data, and possibly its length and name. Lists may be suffixed with an "end-list" tag that has no payload. The tag may have an embedded identifier, so tags that aren't understood can be ignored by the serialization mechanism when it's reading things in. It's a bit like XML in this respect, except using binary idioms instead.
Actually, XML is a good place to look for long-term longevity of a file format. Look at its namespacing capabilities. If you construct your reading and writing code carefully, it ought to be possible to write applications that preserve the location and content of tagged (recursively) data they don't understand, possibly because it's been written by a later version of the same application.
Make sure that you reserve a tag code (or better yet reserve a bit in each tag) that specifies a deleted/free block/chunk.
Blocks can then be deleted by simply changing a block's current tag code to the deleted tag code or set the tag's deleted bit.
This way you don't need to right away completely restructure your file when you delete a block.
Reserving a bit in the tag provides the the option of possibly undeleting the block
(if you leave the block's data unchanged).
For security, however you might want to zero out the deleted block's data, in this case you would use a special deleted/free tag.
I agree with Stepan, that you should choose an endianess, but I would also have an endianess indicator in the file.
If you use an endianess indicator you might consider using
one of the UniCode Byte Order Marks also as an inidicator of any UniCode text encoding used for any text blocks. The BOM is usually the first few bytes of UniCoded text files, so if your BOM is the first entry in your file there might be a problem of some utility identifying your file as UniCode text (I don't think this is much an issue).
I would treat/reserve the BOM as one of your normal tags (using either the UTF16 BOM if using the 16bit tags or the UTF32 BOM if using 32bit tags) with a 0 length block/chunk.
See also http://en.wikipedia.org/wiki/File_format
I agree with atzz's suggestion of using a Tag Length Value system. For future compatibility, you could store a set of "pointers" to TLV entries at the start (or maybe Tag,Pointer and have the pointer point to a Length,Value; or perhaps Tag,Length,Pointer and then have all the data together elsewhere?).
So, my file could look something like:
magic number/file id
version
tag for first data entry
pointer to first data entry --------+
tag for second data entry |
pointer to second data entry |
... |
length of first data entry <--------+
value for first data entry
...
magic number, version, tags, pointers and lengths would all be a predefined set length, for easy decoding. Say, 2 bytes. Or 4, depending on what you need. They don't all need to be the same (eg, all tags are 1 byte, pointers are 4 etc).
The tag lets you know what is being stored. The pointer tells you where (either an offset or absolute value, in bytes), the length tells you how large the data is, and the value is length bytes of data of type tag.
If you use a MyFileFormat v1 decoder on a MyFileFormat v2 file, the pointers allow you to skip sections which the v1 decoder doesn't understand. If you simply skip invalid tags, you can probably simply use TLV instead of TPLV.
I would either hand code something like that, or maybe define my format in ASN.1 and generate a codec (I work in telecommunications, so ASN.1/TLV makes sense to me :-D)
If you're dealing with variable-length data, it's much more efficient to use pointers: Have an array of pointers to your data, ideally near the start of the file, rather than storing the data in an array directly.
Indirection is preferrable in this instance because it allows random access, which is only possible if all items are the same size. If the data was directly stored in an array, without specifying the locations of any records, data access would take O(n) time in the worst case; in order for your file-reading code to access a particular element it would have to know the length of all previous elements, and the only way to find that out is to look at each one. If you're reading the entire file at once, then you'd be doing this anyway, so it wouldn't be a problem. But if you only want one thing, then this isn't the way to go.
Whereas with an array of pointers, it's O(1) time all around: all you need is an index number, and you can retrieve and follow the pointer to get at your data.
When writing a file using this method, you would of course have to build up your table in memory before doing any writing.

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".