I am new in SQL and I was wondering what is the right way to write the functions. I know the norm for statements like SELECT is uppercase, but what is the norm for functions? I've seen people write them with lowercase and others in uppercase.
Thanks for the help.
There's no norm on that, there are standards, but those can change from company to company.
SQL code is not case sensitive.
Usually you should write SQL code (and SQL reserved code) in UPPERCASE and fields and other things in lowercase. But it is not necessary.
It depends on the function a bit, but there is no requirement. There is really no norm but some vendors have specific standards / grammar, which in their eyes aids with readability (/useability) of the code.
Usually 'built-in' functions (non vendor specific) are shown in uppercase.
/* On MySQL */
SELECT CURRENT_TIMESTAMP;
-> '2001-12-15 23:50:26'
Vendor specific functions are usually shown in lowercase.
A good read, on functions in general, can be found here.
SQL keywords are routinely uppercase, but there is no, one "correct" way. Lowercase works just as well. (N.B. Many think that using uppercase keywords in SQL improves readability, although not to say that using lowercase is difficult to read.) For the most part, however, no one will be averted to something like:
SELECT * FROM `users` WHERE id="3"
Although, if you prefer, this will work as well:
select * from `users` where id='3'
Here's a list of them. Note that they are written in uppercase, but it is not required:
http://developer.mimer.com/validator/sql-reserved-words.tml
Here's another good resource that I keep in my "interesting articles to read over periodically." It elaborates a bit on some somewhat interesting instances when case should be taken into consideration:
http://dev.mysql.com/doc/refman/5.0/en/identifier-case-sensitivity.html
Related
I want to detect possible SQL injection atack by checking the SQL query. I am using PDO and prepared statement, so hopefully I am not in the danger of getting attacked by someone. However, what I want to detect is the possibility of input/resulting query string that may become a dangerous query. For example, my app--properly--will never generate "1=1" query, so I may check the generated query string for that, and flag the user/IP producing that query. Same thing with "drop table", but maybe I can check only by looping the input array; or maybe I should just check to the generated query all over again. I am using MySQL, but pattern for other drivers are also appreciated.
I have read RegEx to Detect SQL Injection and some of the comments are heading in this direction. To my help, I'm developing for users that rarely use English as input, so a simple /drop/ match on the query may be enough to log the user/query for further inspection. Some of the pattern I found while researching SQL injection are:
semicolon in the middle of sentence -- although this may be common
double dash/pound sign for commenting the rest of the query
using quote in the beginning & ending of value
using hex (my target users have small to low chance for inputting 0x in their form)
declare/exec/drop/1=1 (my app should not generate these values)
html tag (low probability coming from intended user/use case)
etc.
All of the above are easier to detect by looping the input values before the query string is generated because they haven't been escaped. But how much did I miss? (a lot, I guess) Any other obscure pattern I should check? What about checking the generated query? Any pattern that may emerge?
tl;dr: What pattern to match an SQL query (MySQL) to check for possible injection? I am using PDO with prepared statement and value binding, so the check is for logging/alert purposes.
In my shop we have two rules.
Always use parameters in SQL queries.
If for some reason you can't follow rule one, then every piece of data put into a query must be sanitized, either with intval() for integer parameters or an appropriate function to sanitize a string variable according to its application data type. For example, a personal name might be Jones or O'Brien or St. John-Smythe but will never have special characters other than apostrophe ', hyphen -, space, or dot. A product number probably contains only letters or numbers. And so forth.
If 2 is too hard follow rule 1.
We inspect code to make sure we're doing these things.
But how much did I miss?
You guess right. Creating a huge blacklist wouldn't make your code immune. This approach is history. The other questions follow the same idea.
Your best bets are:
Validating input data (input doesn't necessarily come from an external party)
Using prepared statements.
Few steps but bulletproof.
Not possible.
You will spend the rest of your life in an armament race -- you build a defense, they build a better weapon, then you build a defense against that, etc, etc.
It is probably possible to write a 'simple' SELECT that will take 24 hours to run.
Unless you lock down the tables, they can look, for example, at the encrypted passwords and re-attack with a root login.
If you allow any type of string, it will be a challenge to handle the various combinations of quoting.
There are nasty things that can be done with semi-valid utf8 strings.
And what about SET statements.
And LOAD DATA.
And Stored procs.
Instead, decide on the minimal set of queries you allow, then parameterize that so you can check, or escape, the pieces individually. Then build the query.
I have noticed that on SO a lot of people seem to prefer CASE ... WHEN to other alternatives.
For example all of the answers in this question use CASE ... WHEN whereas I would have used a simple IF. IF is quite a bit less to type and is prevalent in all programming languages so it seems kind of weird to me that not a single answer uses it. (I would also expect that IF is a bit faster though I did not measure it).
Even more interesting are the answers to this question. 2 out of 3 answers (among them the accepted answer) suggest using CASE ... WHEN when from my point of view COALESCE is the better solution (after all COALESCE was created for exactly the problem the OP has). (Also, in this case I am almost certain that COALESCE would be faster.)
So, my question is, is there any benefit to CASE ... WHEN (that offsets the additional typing) that I am missing or is it a case of "To a man with a hammer, everything looks like a nail"?
One reason, a good one actually, is that a CASE WHEN expression is ANSI compliant while IF is not. Were someone to face porting a MySQL query to another database the IF calls in MySQL would probably all have to be rewritten.
MySQL, like most databases, extended ANSI by introducing the IF() function. Perhaps IF, or something similar to it, will become part of the standard some day.
CASE WHEN is in the SQL standard. IF is not. As SQL databases do have vastly different dialects, it is not the worst idea to stick to code that will work on most databases for the following reasons:
If you build the habit of using code that is specific to one database, you will have troubles when working on another.
If you use code that is specific to one database, you cannot test your query with other databases by simply copy pasting them. You can also not migrate your application to other databases without changing your SQL queries.
CASE WHEN is the ANSI standard expression for conditional expressions. IF() is a function specific to MySQL.
In general, I prefer ANSI standard functionality when available -- although there are occasional exceptions.
Specifically about IF() as a function. It is easily confused with IF as a statement in MySQL. Using it as a function seems like unnecessary confusion (admittedly, there are other databases where CASE can be confused with a CASE statement in the scripting language, but that is not an issue in MySQL).
In addition, IF() is pretty close to control flow, which makes it different from most other functions anyway.
I've seen most coders use Capital letters while writing MySQL queries, something like
"SELECT * FROM `table` WHERE `id` = 1 ORDER BY `id` DESC"
I've tried writing the queries in small caps and it still works.
So is there any particular reason for not using small caps or is it just a matter of choice?
It's just a matter of readability. Keeping keywords in upper case and table/column names lower case means it's easier to separate the two when scan reading the statement .: better readability.
Most SQL implementations are case-insensitive, so you could write your statement in late 90s LeEt CoDeR StYLe, if you felt so inclined, and it would still work.
The case make no difference of the SQL engine. It is just a convention followed, just like coding conventions use in any of the programming languages
You have to have a system - there are already a few questions on the site that deal with conventions and approaches. Try:
SQL formatting standards
This is for readability sake.
Sometimes we have to write queries like that because our project (official) recommends.
But everything for readability and uniformity in a project.
For a given word I'd like to find the n closest misspellings. I was wondering if an open source spell checker like aspell would be useful in that context unless you have other suggestions.
For example: 'health'
would give me: ealth, halth, heallth, healf, ...
Spelling correction tools take misspelled words and offer possible correctly spelled alternatives. You seem to want to go in the other direction.
Going from a correctly spelled word to a set of possible misspellings could probably be performed by applying a set of mutation heuristics to common words. These heuristics might do things like:
randomly adding or removing single characters
randomly apply transpositions of pairs of characters
changing characters to other characters based on keyboard layouts
application of common "point" misspellings; e.g. transposing "ie" to "ei", doubling or undoubling "l"s.
Going from a correctly spelled word to a set of common misspellings is really hard. Probably the only reliable way to do this would be to instrument a spelling checker package used by a large community of users, record the actual spelling corrections made using the spelling checker, and aggregate the results. That is probably (!) beyond the scope of your project.
On revisiting my answer, I think I've missed something.
My heuristics above are mostly for typing error rather than misspellings. A typing error is where the user knows the correct spelling but mistyped the word. A misspelling is where the person doesn't know the correct spelling of a word, and uses either incorrect knowledge or intuition (i.e. a guess). Typical guesses are based on listening to what the word sounds like, and then pick a spelling that (if correct) would most likely be pronounced that way.
So an good heuristic for predicting misspellings would need to be based what the word actually sounds like when spoken. That requires a phonetic dictionary (to go from the actual word to its pronunciation) and a set of rules for generating plausible spellings for the phonetic word. That's more complicated than simple heuristics for typing errors.
I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
Some examples:
Searches for "car", also add "cars" to the query.
Searches for "carry", also add "carries" and "carried" to the query.
Searches for "small", also add "smaller" and "smallest" to the query.
Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
Drawbacks:
Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
I'm not sure about the performance of searching the thesaurus file.
Generate the alternate forms algorithmically, based on some heuristics.
Some examples:
If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
Try adding "s", "es", "er" and "est" to the word.
Drawbacks:
Generates lots of non-words for most inputs.
Feels like a hack.
Looks like something you'd find on TheDailyWTF.com. :)
Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter as a guide, it is easy to integrate with Lucene.
Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.
Some of those look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the builtin FuzzyQuery, which uses Levenshtein edit distances, which would help for mispellings.
Using fuzzy search 'query tags', Levenshtein is also used. Consider a search for 'car'. If you change the query to 'car~', it will find 'car' and 'cars' and so on. There are other transformations to the query that should handle almost everything you need.
If you're working in a specialised field (I did this with horticulture) or with a language that does't play nicely with normal stemming methods you could use the query logging to create a manual stemming table.
Just create a word -> stem mapping for all the mismatches you can think of / people are searching for, then when indexing or searching replace any word that occurs in the table with the appropriate stem. Thanks to query caching this is a pretty cheap solution.
Stemming is a pretty standard way to address this issue. I've found that the Porter stemmer is way to aggressive for standard keyword search. It ends up conflating words together that have different meanings. Try the KStemmer algorithm.