Generating unique account numbers - recursive call - unique

Hi i need to generate 9 digit unique account numbers. Here is my pseudocode:
function generateAccNo()
generate an account number between 100,000,000 and 999,999,999
if the account number already exists in the DB
call generateAccNo() /* recursive call */
else
return new accout number
end if
end function
The function seems to be working well, however I am a bit worried about the recursive call.
Will this cause any memory leaks (PHP 5 under apache)?
Is this an acceptable way to tackle this problem?
Thanks for your input.

You realize this could very well cause a stack overflow, right? As the number of customesr increases, the probability of not finding a an acceptable account number increases.
Also, why can't you just do sequential account numbers and just increase by one every time? With this approach, you'd just have to read the max id currently in the database and just increment it.
Sorry to be so blunt, but your solution is a terrible way to tackle the problem. It'll use tons of memory (as the stack possibly grows infinitely) and it will makes tons of expensive calls to the database.
You should really consider some other approach:
I strongly recommend just incrementing the customer number every time you create a customer. In fact, if you set up your db properly (with auto increment on the id column), you won't even have to set the id. The id will be set for you whenever you insert a new customer.

I really don't think it comes down to recursion vs. looping, both are prone to problems as the dataset grows and if the random number generation is not correctly implemented. Two ideas come to mind:
. GUID
If a truly unique id is required with as little effort as possible, consider a GUID, your DB will most likely be able to assign on for you on insert, if not create one in code. It is guaranteed to be unique although it is not very user friendly. However, in combination with a sequential AccountRecordId generated by the DB on insert you would have a solid combination
. Composite Key: Random + Sequential
One way to address all the needs, although at the surface it feels a bit kludgy, is to create a composite account number from a sequential db key of 5 digits (or more) and then another 5 digits of randomness. If the random number was duplicated it would not matter as the sequential id would guarantee the uniqueness of the entire account number

There's no need to use a recursive call here. Run a simple while loop in the function testing against non-existence as the conditional, e.g.
function generateAccNo()
generate an account number between 100,000,000 and 999,999,999
while ( the account number already exists in the DB ) {
generate new account number;
}
return new account number
end function
Randomly generating-and-testing is a sub-optimal approach to generating unique account numbers, though, if this code is for anything other than a toy.

It seems fine, but I think you need some sort of die condition, how many times are you going to let this run before you give up?
I know this seems unlikely with the huge number range, but something could go wrong that just drops you back to the previous call, which will call itself again, ad-nauseum.

Generating account numbers sequentially is a security risk - you should find some other algorithm to do it.

Alternately, you can maintain a separate table containing a buffer of generated, known to be unique account numbers. This table should have an auto-incrementing integer id. When you want an account number, simply pull the record with the lowest index in the buffer and remove it from that table. Have some process that runs regularly which replenishes the buffer and makes sure it has capacity >> normal usage. The advantage is that the amount of time experienced by the end user spent creating an account number will be essentially constant.
Also, I should note that the processing overhead or risks of recursion or iteration, the real issue is determinism and the overhead of repeating database queries. I like TheZenker's solution of random + sequential. Guaranteed to generate a unique id without adding unnecessary overhead.

You do not need to use recursion here. A simple loop would be just as fast and consume less stack space.

You could put it in a while loop:
function generateAccNo()
while (true) {
generate an account number between 100,000,000 and 999,999,999
if the account number already exists in the DB
/* do nothing */
else
return new accout number
end if
}
end function

Why not:
lock_db
do
account_num <= generate number
while account_num in db
put row with account_num in db
unlock_db

Why not have the database handle this? IN SQL Server, you can just have an identity column that starts at 100000000. Or you could use sql in whatever db that you have. Just get the max id plus 1.

Related

What would be the most efficient way to generate a Discord-like username discriminator?

What's the most efficient way to find a random, yet unique, username discriminator; similar to how Discord and other services have begun using?
For instance, 1000 users may have the username JohnSmith but they'll all have distinct discriminators. So one user may be JohnSmith#3482 while another is JohnSmith#4782. When a new user registers (or changes their username) to JohnSmith, what would be the most efficient way to find an available discriminator?
For this example, let's assume a discriminator is numeric and between 0000-9999, and always 4 digits.
One method would be to fetch the discriminators of all users with the name JohnSmith and loop over them with an incremental counter until it found a number not occupied. However, this would be loading a lot of rows and wouldn't result in a truly random number.
Another option would be the same as the first, but generate a random number and check it against the results until an opening is found. This, however, could result in a very long process if only 1 opening exists, and would require tracking already-tested numbers to know when all options have been exhausted.
A third option, a hybrid of the two, would be to find all unused discriminators, persist them to an array, then randomly select one from the array.
Is there an easier way or more efficient manner than these?
Fetch all would be the slowest and most expensive operation. This kind of application should already be optimized to lookup a user by username for many reasons such as login. So the best way is to generate a random number and lookup if it exists.
If you are expecting a lot of collisions then it could get expensive as well. For that I would suggest increasing the total range for the discriminator so the collisions are less. 100 users => 1000->9999, 1000 users => 10000-99999.
You can try some other hacky ways, like pick the last JohnSmith, add a small random number to it mod the total number, and then check collision. That should give a similar random distribution. But having a large range is the best option.
For something similar would be how git determines the short SHA: https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection

Can I check the size of a record with Rails?

Is there a Rails method to return the data size in bytes of a record?
Let's say I have a table called Item. Is there a method something like #item.data_size that would return "xx bytes"?
I have a mysql database.
Not sure if there's a native way of doing it like C, but try this (it might include the size of the class, which is different from the single SQL row):
require 'objspace'
ObjectSpace.memsize_of(#my_item)
first way
require "rubygems"
require "knjrbfw"
analyzer = Knj::Memory_analyzer::Object_size_counter.new(my_hash_object)
puts "Size: #{analyzer.calculate_size}"
second way
require 'objspace'
h = {"a"=>1, "b"=>2}
p ObjectSpace.memsize_of(h)
Measure the memory taken by a Ruby object (by Robert Klemme)
Fortunately, no. As far as I know it's impossible to determine record size in MySql, as well as in most databases. This is due to following reasons, I'll put only most obvious ones:
Record may include association i.e. link to another record in another table and it's completely unclear how to count this, more over it's unclear how to interpret result of such calculation.
Record has sort of overhead such indexes, should calculation include it or not?
So, this means such record size will be very approximate and average by nature. If such method would exist it could occur lots of confusion. However it doesn't mean this can't be done at all. Referring this SO answer it is possible to get table size. You could try to seed you database with millions of typical records of fake data e.g. using ffaker gem, get size and divide by record number. This should give very good number for your particular situation.
As a next step you may check is average record size related and does it correlate with object size in memory. This may be pretty interesting.
Cheers!
Yes you can count the total number of record accessing your model. As example you can try out this code example
#items = Item.all
#items.size or #items.count or #items.length will return the total number of record holds this #items variable. Or directly you can use count on model like Item.count will return total number of record into database.

Assigning random integer to selected rows within certain range

I want to assign an integer to a set of rows in a MySQL Database table. I want the integer to be unique for each row from 1 to the number of rows in the table. So if there are twenty rows matching specific criteria in the table I want to assign them each a number between 1 and 20 so that if selected by this number they will be returned randomly and not in the order they were entered.
Can anybody recommend a solution for this?
Thanks,
Mick
There's a naive way to solve this and also a very cool and efficient way to solve this.
The naive way would be to make a list in memory of 0-20, shuffle it using a random number generator, and store the results back in the table.
The better way involves something called format preserving encryption, which is used in the real world for things like generating a new, cryptographically secure "random" credit card number and being garaunteed that the number hasn't been used before.
Basically using format preserving encryption, you'd transform each value 0-N into a new number that will also be 0-N using cryptographic techniques. Due to the nature of encryption (mainly, that its reversible) you are guaranteed to not get duplicates (unlike a hash function). This gets to be amazingly more efficient than actually shuffling when N gets very large.
I suggest reading up on the topic. In case it helps, check out my blog post on the subject: http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Unique, numeric, incremental identifier

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?
If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.
This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.
If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.
If you target a Windows platform, did you try Interlocked API ?
Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.
I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?
If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.