mysql random generated value - mysql

I need to generate a random alpha/numeric to give to users that they come to the site to enter. I dont' know much about random numbers and such, I know there are seeding issues and such, but I'm not sure what they are.
So, I used this:
select substrING(md5(concat_ws('-',md5(username_usr),
MD5(zip_usr), MD5(id_usr),
MD5(created_usr))),-12) from users_usr
Is this safe? I used concat_ws because sometimes zip is null, but the others never are.
And yes, I know this is kinda short, but 1. They have to enter the last 4 of their social, 2. It's 1 time use, 3. There's no private data displayed back in the application and 4. I may use captcha, but since there's no private data, thats probably overkill.
THanks

Maybe using the Universal Unique Identifier would suffice? Just to keep it simple?

If you need a random alphanumeric value, why are you using so many variables? Something like the following should be perfectly enough:
md5(rand())
--Flavor: MySql

It'd help to know the purpose of the "random" string. This isn't random - it's repeatable - and fairly easily repeatable, at that. You're not exposing any sensitive information in a way that's easily reversible, but I'm guessing you're really looking for a way to generate a UUID (univeraslly unique ID). Not coincidentally, recent MySQL versions have a function called UUID.
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
That might better solve the problem you're trying to address. If you really want a random number (which can definitely have collisions, by the way) for some reason, don't worry about seeding. If you don't specify a seed, it'll self-seed in a way that's probably better than a fixed seen anyway. You'd then map that random number (or a series of random numbers) to a character (possibly by casting the integer to a char), and repeat that until you have a string of chars long enough. But it bears repeating that a random number is not a guaranteed unique number...

Someone in the deleted duplicate of this question suggested using UUID(), which I think is a good idea. I don't think there's anything greatly wrong with using MD5(RAND()) either.
You'd have to store those, of course, which you don't have to do with your example.

>>SELECT md5(RAND()+CURRENT_TIMESTAMP())

Related

Rails String Unique ID

I want to create a model that has a attribute that holds a string based unique identifier.
I only want the unique string to be 3 characters long and consist of letters of the alphabet (lower case only) and numbers.
How do I implement something like the above? How do I avoid collisions? I have looked into MD5, and that seems along the lines of what I want to accomplish - but shorter. I am willing to also seed it with a time if that make the approach deterministic.
I would love any feedback or pointers on this topic. Thanks!
EDIT:
One solution that has been on my mind is creating a table full of every single permutation, then randomly selecting as needed from the table, and deleting once used. Is this a bad approach?
Check out this SO thread; it's got plenty of good suggestions. Especially the last answer by Simone Carletti which points to this post.
There are quite a few options on the above post. The one I liked and might be useful for you is the use of rufus-mnemo gem
So the solution I decided to roll with after reading some of the questions & answers is quite different than what anyone had suggested.
I created a table to store codes. I wrote a ruby script to populate this table with every 3 letter combo based on the characters I wanted to use. Then on my model I have a before_save method assign a code to the instance if a code has not yet been assigned.
This approach ensures that I will never have a collision when assigning a code in the before_save. The slowest part is the generation of the table, but since I only have to do this once I can deal with this.
This gem called alphadecimal might be able to help you.

Organize address cache

I need to organize cache in mySql database for address - coordinates. What is the best practice to store address? Do i need to compress address string or use it as is?
edit:
Ok, let's I reassert my question.
How to store long (up to 512) string in database if I need to search by exactly this string in future.
If you are absolutely certain your search string can be normalized (e.g.: stripping all the extra spaces, forcing lower case etc.) so to avoid ambiguity and that you need to search for full match (i.e. you either find exactly the normalized string or not, and don't need to search by substring, soundex, partial match, sort by it etc. - this is how I read your "by exactly this string" ) you could consider calculating the hashcode of the string, put it in the DB and indexing that.
If you use an hashcode function that returns a number, you will have a very efficient access index. And of course you can still keep the original string field for printing and different access approaches.
Possible problems: while hashcode can minimize the chance of a hash collision, they cannot be guaranteed not to happen, so you should manage that, too.
Also, unless you have lots and lots of addresses, I doubt that the speedup gain will be worth the trouble.
MySql can manage coordinates and operates on these values, try looking at http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html
If you want something simpler, personnaly I usually store separately the city code, the city name and the rest of the adress string. Then I can index and search on these fields (one by one, or with a combination).
If you want a simple use of coordinates, you can simply store the latitude/longitude and do basic comparisons
Answer can be found here

Are there any inobvious ways of abusing GUIDs?

GUIDs are typically used for uniquely identifying all kinds of entities - requests from external systems, files, whatever. Work like magic - you call a "GiveMeGuid()" (UuidCreate() on Windows) function - and a fresh new GUID is here at your service.
Given my code really calls that "GiveMeGuid()" function each time I need a new GUID is there any not so obvious way to misuse it?
Just found an answer to an old question: How deterministic Are .Net GUIDs?. Requoting it:
It's not a complete answer, but I can tell you that the 13th hex digit is always 4 because it denotes the version of the algorithm used to generate the GUID (id est, v4); also, and I quote Wikipedia:
Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random, given the initial state one can predict up to the next 250 000 GUIDs returned by the function UuidCreate. This is why GUIDs should not be used in cryptography, e.g., as random keys.
So, if you got lucky and get same seed, you'll break 250k mirrors in sequence. To quote another Wikipedia piece:
While each generated GUID is not guaranteed to be unique, the total number of unique keys (2128 or 3.4×1038) is so large that the probability of the same number being generated twice is extremely small.
Bottom line: maybe a misuse form it's to consider GUID always unique.
It depends. Some implementations of GUID generation are time dependant, so calling CreateGuid in quick succession MAY create clashing GUIDs.
edit: I now remember the problem. I was once working on some php code where the GUID generating function was reseeding the RNG with the system time each call. Don't do this.
The only way I can see of misusing a Guid is trying to interpret the value in some logical manner. Not that it really invites you to do so, which is one of the characteristics around Guid's that I really like.
Some GUIDs include some identifier of the machine it was generated on, so it can be used in client/server environments, but some can't. Be sure if yours doesn't to not use them in, for instance, a database multiple clients access.
Maybe the entropy could be manipulated by playing with some parameters used to generate the GUIDs in the first place (e.g. interface identifiers).

MySQL is SELECT with LIKE expensive?

The following question is regarding the speed between selecting an exact match (example: INT) vs a "LIKE" match with a varchar.
Is there much difference? The main reason I'm asking this is because I'm trying to decide if it's a good idea to leave IDs out of my current project.
For example Instead of:
http://mysite.com/article/391239/this-is-an-entry
Change to:
http://mysite.com/article/this-is-an-entry
Do you think I'll experience any performance problems on the long run? Should I keep the ID's?
Note:
I would use LIKE to keep it easier for users to remember. For example, if they write "http://mysite.com/article/this-is-an" it would redirect to the correct.
Regarding the number of pages, lets say I'm around 79,230 and the app. is growing fast. Like lets say 1640 entries per day
An INT comparison will be faster than a string (varchar) comparison. A LIKE comparison is even slower as it involves at least one wildcard.
Whether this is significant in your application is hard to tell from what you've told us. Unless it's really intensive, ie. you're doing gazillions of these comparisons, I'd go with clarity for your users.
Another thing to think about: are users always going to type the URL? Or are they simply going to use a search engine? These days I simply search, rather than try and remember a URL. Which would make this a non-issue for me as a user. What are you users like? Can you tell from your application how they access your site?
Firstly I think it doesn't really matter either way, yes it will be slower as a LIKE clause involves more work than a direct comparison, however the speed is negligible on normal sites.
This can be easily tested if you were to measure the time it took to execute your query, there are plenty of examples to help you in this department.
To move away slighty from your question, you have to ask yourself whether you even need to use a LIKE for this query, because 'this-is-an-entry' should be unique, right?
SELECT id, friendly_url, name, content FROM articles WHERE friendly_url = 'this-is-an-article';
A "SELECT * FROM x WHERE = 391239" query is going to be faster than "SELECT * FROM x WHERE = 'some-key'" which in turn is going to be faster than "SELECT * FROM x WHERE LIKE '%some-key%'" (presence of wild-cards isn't going to make a heap of difference.
How much faster? Twice as fast? - quite likely. Ten times as fast? stretching it but possible. The real questions here are 1) does it matter and 2) should you even be using LIKE in the first place.
1) Does it matter
I'd probably say not. If you indeed have 391,239+ unique articles/pages - and assuming you get a comparable level of traffic, then this is probably just one of many scaling problems you are likely to encounter. However, I'd warrant this is not the case, and therefore you shouldn't worry about a million page views until you get to 1 million and one.
2) Should you even be using LIKE
No. If the page/article title/name is part of the URL "slug", it has to be unique. If it's not, then you are shooting yourself in the foot in term of SEO and writing yourself a maintanence nightmare. If the title/name is unique, then you can just use a "WHERE title = 'some-page'", and making sure the title column has a unique index on.
Edit
You plan of using LIKE for the URL's is utterly utterly crazy. What happens if someone visits
yoursite.com/articles/the
Do you return a list of all the pages starting "the" ? What then happens if:
Author A creates
yoursite.com/articles/stackoverflow-is-massive
2 days later Author B creates
yoursite.com/articles/stackoverflow-is-massively-flawed
Not only will A be pretty angry that his article has been hi-jacked, all the perma-links he may have been sent out will be broken, and Google is going never going to give your articles any reasonable page-rank because the content keeps changing and effectively diluting itself.
Sometimes there is a pretty good reason you've never seen your amazing new "idea/feature/invention/time-saver" anywhere else before.
INT is much more faster.
In the string case I think you should not select query with LIKE but just with = because you look for this-is-an-entry, not for this-is-an-entry-and-something.
There are a few things to consider:
The type of search performed on the DataBase will be an "index seek", search for single row using an index, most of the time.
This type of exact match operation on a single row is not significantly faster using ints than strings, they are basically the same cost, for any practical purpose.
What you can do is the following optimization, search the database using a exact match (no wildcards), this is as fast as using an int index. If there is no match do a fuzzy search (search using wildcards) this is more expensive, but on the other hand is more rare and can produce more than one result. A form of ranking results is needed if you want to go for best match.
Pseudocode:
Search for an exact match using a string: Article Like 'entry'
if (match is found) display page
if (match is not found) Search using wildcards
If (one apropriate match is found) display page
If (more relevant matches) display a "Did you tried to find ... page"
If (no matches) display error page
Note: keep in mind that fuzzy URLs are not recommended from a SEO perspective, because people can link your site using multiple URLs which will split your page rank instead of increase it.
If you put an index on the varchar field it should be ok (performance wise), really depends on how many pages you are going to have. Also you have to be more careful and sanitize the string to prevent sql injections, e.g. only allow a-z, 0-9, -, _, etc in your query.
I would still prefer an integer id as it is faster and safer, change the format to something nicer like:
http://mysite.com/article/21-this-is-an-entry.html
As said, comparing INT < VARCHAR, and if the table is indexed on the field you're searching then that'll help too, as the server won't have to create a manual index on the fly.
One thing which will help validate your queries for speed and sense is EXPLAIN. You can use this to show which indexes your query is using, as well as execution times.
To answer your question, if it's possible to build your system using exact matches on the article ID (ie an INT) then it'll be much "lighter" than if you're trying to match the whole url using a LIKE statement. LIKE will obviously work, but I wouldn't want to run a large, high traffic site on it.

Do you use particular conventions for naming complementary variables?

I often find myself trying to come up with good names for complementary pairs of variables; where two variables denote opposing concepts, two participants in some sort of duologue, and so on.
This might be better explained by a counter-example - I maintain an app that prints two graphics as part of a print advertisement. They're stored in the database as TopLogo and LowerLogo, which I have to stop and double-check every time I use them because I'm expecting top to complement bottom, and lower should complement upper.
There's some obvious examples that I think work well:
client / server
source / target for copying/moving data or files from one variable to another
minimum / maximum
but there's some concepts that just don't lend themselves to such neat naming schemes. For example, when paging through records, does 'last' mean 'final' or 'previous' ? I recently saw some code that used firstPage, previousPage, nextPage and finalPage to avoid the ambiuous lastPage completely, which I thought was very beat, hence this question.
Do you have any particularly neat variable name pairs you'd care to share with us? (Bonus points if they're the same length, which makes the code so much neater in monospaced fonts.)
Like with all kinds of code style conventions, consistency is what you should strive for.
I would have the development team agree on "standard" pairs of prefixes for common scenarios like "source/destination" or "from/to" and then stick with them for the whole project. As long as every developer is aware of what is meant with a particular prefix in the codebase, it is easier to avoid misunderstandings.
Exceptions to the rule should be clarified in the documentation if the variable is part of a public API, or in comments within the code, if it's visibility is restricted to a single class or method.
In my databases you'll find many valid-state temporal ("history") tables containing a pair of columns named start_date and end_date. No bonus points for me, then, because I'd rather use the commonly used 'end' than try to come up with an intuitive alternative with the same number of characters as the word 'start'.
I tend to prefer these generic terms even when more context-specific terms may be viable e.g. preferring employee_start_date over employee_hire_date (what if their employment started for a reason other than being formally hiring e.g. their company was the subject of an acquisition). That said, I'd prefer person_birth_date over person_start_date :)
While one does try to be semantically coherent in obvious cases -- e.g., maximum goes with minimum, and not "lowest" -- in well-structured OO code (which isn't all code, I know) the problem disappears with a good IDE. Classes are short, methods are short, and variables are few in each method. So it doesn't matter what you call the variable pairs so long as they're clear. Your code might not look professional, but real quality is in the code, not in the look of your code.
The problem further disappears if there is good JavaDoc or whatever the documentation system is, and if have good Class names that go with them. For instance, if you have an instance of a Connection class and it has a method a method called setDestination, that's okay, but if you know that setDestination takes one parameter called destination and it's of the Server class, you're cool... even though you might prefer to call it target, aimHere, placeToSendTheData, or whatever (and the corresponding names, source, comingFromHere, and placeToGetTheDataFrom). Plus the doc system says what the thing is for, and that is priceless.
This next thing might sound stupid and I'm sure I'll get voted down here on StackOverflow, but unique non-professional sounding variable names have a great advantage: I know that my variables have names like placeWeWantTheDataToGo (and the IDE takes care of typing it), but the "serious" guys who do the JDK would never use such silly names. So I know immediately that the variable is one of mine. Incidentally, when I worked with developers in Spain and Italy, they write code with Spanish variable names (not always, but usually). This causes the same effect: we can quickly see that the Conexion class is ours, but the Connection class is not.
[Also, instead of typing your variable names, assign them a constant String somewhere in your code and use that, so if they called it lower or downer instead of low, you're still okay.]
Yes, I do try to name complementary sets of variables systematically so that the symmetry is clear. It is not always easy; sometimes, not even possible. Well, not possible using the rules I lay down for myself - which means I usually try to have the names the same length. The 'top' and 'lower' example would drive me batty (assuming I'm not batty already, which is far from certain); I'd probably use 'upper' and 'lower' because those are the same length; 'top' and 'bottom' would frustrate me too because of the difference in length.