How to use HAVING clause using RAND() alias - mysql

Trying to generate a unique code to prevent duplicates i use the following query using HAVING clause so i can use the alias but i get duplicate key errors:
SELECT
FLOOR(100 + RAND() * 899) AS random_code
FROM product_codes
HAVING random_code NOT IN (values)
LIMIT 1
The following code did not work and is what i need:
https://stackoverflow.com/a/4382586
Is there a better way to accomplish this or there is something wrong in my query?

If you want a unique code that is guaranteed to be unique, use the mySQL function UUID()
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
"A UUID is designed as a number that is globally unique in space and time. Two calls to UUID() are expected to generate two different values, even if these calls are performed on two separate computers that are not connected to each other."
If a UUID is too long (e.g. it has to be exactly a certain number of digits), then hash the UUID (with md5 or sha-256 for example), take a certain number of bits and turn that into a decimal integer. The hashing is important since it's the whole UUID that guarantees uniqueness, not any one part of it. However, you will now be able to get hash collisions, which will be likely as soon as you have more than sqrt(2^bits) entries. E.g. if you use 10 bits for 0-1023, then after about 32 entries a hash collision becomes likely. If you use this few bits, consider an incrementing sequence instead.

Wanted to use a MYSQL query to get random numbers left between 0-999 as a code but tryed that query and also i ended filling the values condition from 0 to 999 and still got always duplicate codes, strange behaviour so i ended up using PHP.
The steps i use now:
Create an array populated with 0 to 999, in the future if i need more codes will use 0 to 9999.
$ary_1 = array(0, 1, 2, ...., 999)
Create other array populated with all codes in the table.
$ary_2 = array(4, 5, 985, 963, 589)
Get a resulting array using array_diff.
$ary_3 = array_diff($ary_1, $ary_2)
Get an array key using array_rand from the resulting array_diff.
$new_code_key = array_rand($ary_3, 1)
Use that key to get the code and create the MYSQL query.
$ary_3[$new_code_key]
Unset that key from the new codes array so i speed up the process and just have to get anoher array_rand key and unset later.
unset($ary_3[$new_code_key])

Related

SQL Query giving wrong results

The query executed should match the story_id with the provided string but when I execute the query it's giving me a wrong result. Please refer to the screenshot.
story_id column in your case is of INT (or numeric) datatype.
MySQL does automatic typecasting in this case. So, 5bff82... gets typecasted to 5 and thus you get the row corresponding to story_id = 5
Type Conversion in Expression Evaluation
When an operator is used with operands of different types, type
conversion occurs to make the operands compatible. Some conversions
occur implicitly. For example, MySQL automatically converts strings to
numbers as necessary, and vice versa.
Now, ideally your application code should be robust enough to handle this input. If you expect the input to be numeric only, then your application code can use validation operations on the data (to ensure that it is only a number, without typecasting) before sending it to MySQL server.
Another way would be to explicitly typecast story_id as string datatype and then perform the comparison. However this is not recommended approach as this would not be able to utilize Indexing.
SELECT * FROM story
WHERE (CAST story_id AS CHAR(12)) = '5bff82...'
If you run the above query, you would get no results.
you can also use smth like this:
SELECT * FROM story
WHERE regexp_like(story_id,'^[1-5]{1}(.*)$');
for any story_ids starting with any number and matching any no of charatcers after that it wont match with story_id=5;
AND if you explicitly want to match it with a string;

Strange behaviour of BIT(64) column in MySQL

Can anyone help me understand the following problem with a BIT(64) column in MySQL (5.7.19).
This simple example works fine and returns the record from the temporary table:
CREATE TEMPORARY TABLE test (v bit(64));
INSERT INTO test values (b'111');
SELECT * FROM test WHERE v = b'111';
-- Returns the record as expected
When using all the 64 bits of the column it no longer works:
CREATE TEMPORARY TABLE test (v bit(64));
INSERT INTO test values (b'1111111111111111111111111111111111111111111111111111111111111111');
SELECT * FROM test WHERE v = b'1111111111111111111111111111111111111111111111111111111111111111';
-- Does NOT return the record
This only happens when using a value with 64 bits. But I would expect that to be possible.
Can anyone explain this to me?
Please do not respond by advising me not to use BIT columns. I am working on a database tool that should be able to handle all the data types of MySQL.
The problem seems to be, that the value b'11..11' in the WHERE clause is considered to be a SIGNED BIGINT which is -1 and is compared to the value in your table which is considered to be an UNSIGNED BIGINT which is 18446744073709551615. This is always an issue when the first of 64 bits is 1. IMHO this is a bug or a design flaw, because I expect an expression in the WHERE clause to match a row if the same expression has been used in the INSERT satement (at least in this case).
One workaround would be to cast the value to UNSIGNED:
SELECT *
FROM test
WHERE v = CAST(b'1111111111111111111111111111111111111111111111111111111111111111' as UNSIGNED);
Or (if your application language supports it) convert it to something like long uint or decimal:
SELECT * FROM test WHERE v = 18446744073709551615;
Bits are returned as binary, so to display them, either add 0, or use a function such as HEX, OCT or BIN to convert them https://mariadb.com/kb/en/library/bit/ or Bit values in result sets are returned as binary values, which may not display well. To convert a bit value to printable form, use it in numeric context or use a conversion function such as BIN() or HEX(). High-order 0 digits are not displayed in the converted value. https://dev.mysql.com/doc/refman/8.0/en/bit-value-literals.html

Creating a table of duplicates from SAS data set with over 50 variables

I have a large SAS data set (54 variables and over 10 million observations) I need to load into Teradata. There are duplicates that must also come along, and my machine is not configured for MultiLoad. I want to simply create a table of the 300,000 duplicates I can append to the original load that did not accept them. The logic I've read in other posts seems good for tables with just a few variables. Is there another way that will create a new table where each observation having the same combination of all 54 variables is listed. I'm trying to avoid the proc sort...by logic using 54 variables. The query builder method seemed inefficient as well. Thanks.
Using proc sort is a good way to do it, you just need to create a nicer way to key off of it.
Create some test data.
data have;
x = 1;
y = 'a';
output;
output;
x = 2;
output;
run;
Create a new field that is basically equivalent to appending all of the fields in the row together and then running them though the md5() (hashing) algorithm. This will give you a nice short field that will uniquely identify the combination of all the values on that row.
data temp;
length hash $16;
set have;
hash = md5(cats(of _all_));
run;
Now use proc sort and our new hash field as the key. Output the duplicate records to the table named 'want':
proc sort data=temp nodupkey dupout=want;
by hash;
run;
You can do something like this:
proc sql;
create table rem_dups as
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit;
proc sql;
create table target as
select dp.* from duplicates dp
left join rem_dups rd
on <Key_fields>
where <key_fields> is null;
quit;
If there are more than 300K duplicates, this option does not work. And also, I am afraid to say that I dont know about Teradata and the way you load tables.
First, a few sort related suggestions, then the core 'fast' suggestion after the break.
If the table is entirely unsorted (ie, the duplicates can appear anywhere in the dataset), then proc sort is probably your simplest option. If you have a key that will guarantee putting duplicate records adjacent, then you can do:
proc sort data=have out=uniques noduprec dupout=dups;
by <key>;
run;
That will put the duplicate records (note noduprec not nodupkey - that requires all 54 variables to be identical) in a secondary dataset (dups in the above). However, if they are not physically adjacent (ie, you have 4 or 5 duplicates by the key but only two are completely duplicated), it may not catch that if they are not adjacent physically; you would need a second sort, or you would need to list all variables in your by statement (which might be messy). You could also use Rob's md5 technique to simplify this some.
If the table is not 'sorted' but the duplicate records will be adjacent, you can use by with the notsorted option.
data uniques dups;
set have;
by <all 54 variables> notsorted;
if not (first.<last variable in the list>) then output dups;
else output uniques;
run;
That tells SAS not to complain if things aren't in proper order, but lets you use first/last. Not a great option though particularly as you need to specify everything.
The fastest way to do this is probably to use a hash table for this, iff you have enough RAM to handle it, or you can break your table up in some fashion (without losing your duplicates). 10m rows times 54 (say 10 byte) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available to SAS to make a hash table with.
If you know that a subset of your 54 variables are sufficient for verifying uniqueness, then the unique hash only has to contain those subset of variables (ie, it might only be four or five index variables). The dups hash table does have to contain all variables (since it will be used to output the duplicates).
This works by using modify to quickly process the dataset, not rewriting the majority of the observations; using remove to remove them and the hash table output method to output the duplicates to a new dataset. The unq hash table is only used for lookup - so, again, it could contain a subset of variables.
I also use a technique here for getting the full variable list into a macro variable so you don't have to type 54 variables out.
data class; *make some dummy data with a few true duplicates;
set sashelp.class;
if age=15 then output;
output;
run;
proc sql;
select quote(name)
into :namelist separated by ','
from dictionary.columns
where libname='WORK' and memname='CLASS'
; *note UPCASE names almost always here;
quit;
data class;
if 0 then set class;
if _n_=1 then do; *make a pair of hash tables;
declare hash unq();
unq.defineKey(&namelist.);
unq.defineData(&namelist.);
unq.defineDone();
declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
dup.defineKey(&namelist.);
dup.defineData(&namelist.);
dup.defineDone();
end;
modify class end=eof;
rc_c = unq.check(); *check to see if it is in the unique hash;
if rc_c ne 0 then unq.add(); *if it is not, add it;
else do; *otherwise add it to the duplicate hash and mark to remove it;
dup.add();
delete_check=1;
end;
if eof then do; *if you are at the end, output the dups;
rc_d = dup.output(dataset:'work.dups');
end;
if delete_check eq 1 then remove; *actually remove it from unique dataset;
run;
Instead of trying to avoid proc sort, I would recommend you to use Proc sortwith index.
Read the document about index
I am sure there must be identifier(s) to distinguish observation other than _n_,
and with the help of index, sorting by noduprecs or nodupkey dupout = dataset would be an efficient choice. Furthermore, indexing could also facilitate other operation such as merging / reporting.
Anyway, I do not think a dataset with 10 million observations ( each?) is a good dataset, and not to mention the 54 variables.

Set same random value twice in one query for encryption in MySql

I need to update and encrypted table ( the whole table ) without using stored procedure.
The encryption requires both key, and random array byte. the random array byte will be stored in the same table
update employee set name = aes_encrypt(name,'key',#RANDOM_BYTES), random_bytes = #RANDOM_BYTES;
the first and second #RANDOM_BYTES should match. so we encrypt the value and store the random value in the same table for decryption later.
I wonder if that is possible at all.
I can use multiple queries, but not stored procedures.
I am not 100% absolutely, positively sure, but I am fairly certain that MySQL evaluates SETs left to right, so this slight modification should work:
UPDATE employee
SET random_bytes = #RANDOM_BYTES
, name = aes_encrypt(name,'key',random_bytes)
;
Edit: You can actually use this aspect to do stuff like swap integer values between two columns in the same row (though, obviously, the SET for that is much more complicated.)
IMPORTANT NOTE: I'm not a cryptographer. I'm not recommending this approach to securing your data by "encrypting" it in this way.
As far as updating a row, you may be able to take advantage of a MySQL deviation from ("extension to?") from the SQL standard.
With a statement like this:
UPDATE mytable
SET col1 = 'foo'
, col2 = col1
col1 and col2 end up with the same value, 'foo', because the reference to col1 in the second assignment returns the value that was just assigned to col1 (the previous assignment. The order is important here!)
I think the MySQL AES_ENCRYPT function takes two arguments, not three, so I'm a little confused by the example you show.
As an example of the approach to the update (setting aside for a moment how you would replace #RANDOM_BYTES with something that gets you a value you can use):
UPDATE employee
SET random_bytes = #RANDOM_BYTES
, encrypted_name = AES_ENCRYPT(name,random_bytes)
I strongly recommend you test this, and verify it does what you want, on test data in a test environment, before relying on it someplace important.
Be sure that the encrypted_name column is defined as "binary" type, e.g. VARBINARY (of sufficient length to hold the encrypted/padded value) or BLOB. Either that, or, you'd need to encode the binary value, using for example HEX or base64 encoding, so it can be stored in a character column.
As far as returning a value in place of #RANDOM_BYTES, that's going to look a little complicated.
We can use repeated calls to RAND() to return pseudo-random values, but we need to get that into 128-bit binary value, or, whatever key length is required for AES_ENCRYPT.
I'm thinking out loud here...
FLOOR(RAND()*256.0) will get us an integer value between 0 and 255, that'd be enough for a random byte.
The MD5() function will return us 128-bit value (16-byte), represented in 32 hex digits.
The SHA2() function returns values longer than that, we can certainly trim that to whatever length we need.
The trick is going to be getting enough randomness into those hash functions, I'm thinking a single float value isn't going to be enough randomness. We might be able to make use of some other value(s) from the row as a "salt", in combination with some random bytes.
We need to be careful about which functions return hex character strings, and which return BINARY.
We could get two pseudo-random binary bytes like this:
CHAR(FLOOR(RAND()*256*256) USING BINARY)
If we concatenate eight repetitions of that together, that would get us 16 bytes of binary data. We can get a little less randomness, generating 4 bytes at a time, and concatenating a series of four of these:
CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
I'm thinking something like this, though we definitely want to test that expression intended to return a random 16-byte binary value
random_binary VARBINARY(16)
UPDATE employee
SET random_bytes = -- 16-byte pseudo-random binary string
CONCAT( CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
)
, encrypted_name = AES_ENCRYPT(name,random_bytes)
With this approach, it is critically important that the random_bytes column be defined a "binary" type, e.g. VARBINARY(16).
Also, the value stored in encrypted_name will be BINARY, so that needs to be defined as BINARY as well... the AES encryption adds padding (there's a formula somewhere for that, depends on the key length... that column will certainly need to be longer than name.
If we think that storing BINARY might present a problem somewhere along the way, e.g. a characterset translation gets inadvertently applied later, we can encode the binary as hex digits, and store that
random_hex VARCHAR(32)
UPDATE employee
SET random_hex = -- 16-byte pseudo-random binary as 32 hex digits
HEX(
CONCAT( CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
, CHAR(FLOOR(RAND()256*256*256*256) USING BINARY)
)
)
, encrypted_name_hex = HEX(AES_ENCRYPT(name,UNHEX(random_bytes)))
NOTE: I'm not sure if AES_ENCRYPT returns binary or hex, that may depend on the version of MySQL that you're running. I'm also not sure about the key length.
A longer key length would be "more secure" than a shorter one, but it depends on the encryption algorithm. Also, the RAND() function isn't "truly" random...
I think the much bigger issue security-wise is storing the decryption key on the same row with encrypted value. The encrypted value is only secure if the key is secured.
You can use cursors to loop through each row. Please see https://dev.mysql.com/doc/refman/5.0/en/cursors.html
and all you will need to do is simply define a variable inside the loop scope, that way it will be generated once.
SET Bytes = #RANDOM_BYTES
From that point on, you can simply update the whole table with the same random bytes for both columns.

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).