Related
I have a table on my MySQL db named membertable. The table consists of two fields which are memberid and membername. The memberid field has the type of integer and uses auto_increment function starting from 2001. The membername table has the type of varchar.
The membertable has two records with the same order as described above. The records look like this :
memberid : 2001
membername : john smith
memberid : 2002
membername : will smith
I found something weird when I ran a SELECT statement against the memberid field. Running the following statement :
SELECT * FROM `membertable` WHERE `memberid` = '2001somecharacter'
It returned the first data.
Why did that happen? There's no record with memberid = 2001somecharacter. It looks like MySQL only search the first 4 character (2001) and when It's found related data, which is the returned data above, it denies the remaining characters.
How could this happen? And is there any way to turn off this behavior?
--
membertable uses innodb engine
This happens because mysql tries to convert "2001somecharacter" into a number which returns 2001.
Since you're comparing a number to a string, you should use
SELECT * FROM `membertable` WHERE CONVERT(`memberid`,CHAR) = '2001somecharacter';
to avoid this behavior.
OR to do it properly, is NOT put your search variable in quotes so that it has to be a number otherwise it'll blow up because of syntax error and then in front end making sure it's a number before passing in the query.
sqlfiddle
Your finding is an expexted MySQL behaviour.
MySQL converts a varchar to an integer starting from the beginning. As long as there are numeric characters wich can easily be converted, they are icluded in the conversion process. If there's a letter, the conversion stops returning the integer value of the numeric string read so far...
Here's some description of this behavior on the MySQL documentation Site. Unfortunately, it's not mentioned directly in the text, but there's an example which exactly shows this behaviour.
MySQL is very liberal in converting string values to numeric values when evaluated in numeric context.
As a demonstration, adding 0 causes the string to evaluated in a numeric context:
SELECT '2001foo' + 0 --> 2001
, '01.2-3E' + 0 --> 1.2
, 'abc567g' + 0 --> 0
When a string is evaluated in a numeric context, MySQL reads the string character by character, until it encounters a character where the string can no longer be interpreted as a numeric value, or until it reaches the end of the string.
I don't know of a way to "turn off" or disable this behavior. (There may be a setting of sql_mode that changes this behavior, but likely that change will impact other SQL statements that are working, which may stop working if that change is made.
Typically, this kind of check of the arguments is done in the application.
But if you need to do this in the SELECT statement, one option would be cast/convert the column as a character string, and then do the comparison.
But that can have some significant performance consequences. If we do a cast or convert (or any function) on a column that's in a condition in the WHERE clause, MySQL will not be able to use a range scan operation on a suitable index. We're forcing MySQL to perform the cast/convert operation on every row in the table, and compare the result to the literal.
So, that's not the best pattern.
If I needed to perform a check like that within the SQL statement, I would do something like this:
WHERE t.memberid = '2001foo' + 0
AND CAST('2001foo' + 0 AS CHAR) = '2001foo'
The first line is doing the same thing as the current query. And that can take advantage of a suitable index.
The second condition is converting the same value to a numeric, then casting that back to character, and then comparing the result to the original. With the values shown here, it will evaluate to FALSE, and the query will not return any rows.
This will also not return a row if the string value has a leading space, ' 2001'. The second condition is going to evaluate as FALSE.
When comparing an INT to a 'string', the string is converted to a number.
Converting a string to a number takes as many of the leading characters as it can and still be a number. So '2001character' is treated as the number 2001.
If you want non-numeric characters in member_id, make it VARCHAR.
If you want only numeric ids, then reject '200.1character'
I have a string like the following in the column of a hive external table
<id>^<count>^<distinct_count>|<id>^<count>^<distinct_count>|...
There are two delimiters. | on an entity level and ^ on sub-entity level
I have a metric which is defined by the sum of counts of non-zero distinct_counts or counts, which means given a string I have check whether the distinct count (or the count - I can check either) is non zero and if it mark a flag as 1. Then the metric would be sum(flags). I have to store this metric in an aggregated table in the next step.
Please suggest a way for me to do this in Hive
I think it's not possible. Ended up using an external python mapper for the same.
If you want to count number of non-zero count in a string s, it seems to be solved by
length(
regexp_replace(
regexp_replace(s, "[^^|]*\\^0\\^[^^|]*\\|?", ""),
"[^^|]*\\^[^^|]*\\^[^^|]*\\|?",
"1"
)
)
First regexp_replace removes parts with zero count, second regexp_replace replaces remaining parts with single symbols (it should not necessarily be "1", any symbol would suffice), and hence length returns number of parts with non-zero count.
I have a table called media with a column called accounts_used in which the rows appear in the following format
68146, 67342, 60577, 61506, 67194, 67034, 63484, 49113, 61518, 66971, 67511,
67351, 63621, 67725, 63638, 68141, 66114, 67262, 67537, 67537, 61765, 63701,
67087, 62641, 61294, 67063, 67049, 67038, 67170, 67147, 67289, 61264, 67091,
63690, 63505, 63505, 49172, 52313, 67070, 66945, 67234, 62265, 61368, 67870,
67211, 67586, 49240, 67538, 67538, 67809, 67183, 67164, 62712, 67519, 66895,
67693, 60266, 60266, 67593, 67031, 67137, 62570, 60682, 61195, 67569, 67569,
67069, 62082, 67345, 61748, 61553, 52029, 66877, 62630, 67196, 67196, 67196,
67196, 67196, 67196, 66873, 63677, 68174, 67127, 63594, 67107, 60419, 66601,
68156, 67203, 68161, 60233, 66586, 52654, 63570, 66887, 67191, 60877, 52108,
67131, 61784, 67566, 67162, 67073, 67092, 67064, 60133, 66907, 67559, 66846,
60490, 60347, 66558, 48737, 61539, 67236, 68135, 67238 , 63656, 67585, 67512
If the row has a comma at the end I want to remove this, so for example if the row looks like the following
1,2,3,4,5,6,
I want to replace it to just this
1,2,3,4,5,6
Is this possible to do using just a simple query?
It is a bad idea to store lists of ids in rows. But, you are doing it. You can fix this by doing:
update media
set accounts_used = left(accounts_used, length(accounts_used) - 1)
where accounts_used = '%,';
Instead, you should have a MediaAccounts table, with one row per "media" and one row per account.
EDIT:
Possibly, the row ends with a ', ' rather than just a comma:
update media
set accounts_used = left(accounts_used, length(accounts_used) - 2)
where accounts_used = '%, ';
We faced a similar string-replacement issue with a large dataset of bibliographic entries, where we also needed to trim extraneous punctuation from a large number of strings stored in the database which had been imported verbatim from another system. Many of the records in our dataset also contained Unicode characters, as such we needed to find a suitable SQL query that would allow us to find the relevant records that needed to be updated, and then to update them in a way that was Unicode (multibyte character) compatible under MySQL.
In testing with our dataset, I found performing a search for the relevant records we needed to update using MySQL's LEFT() and RIGHT() substring methods, performed better than using a LIKE pattern-match query. Additionally, MySQL's LENGTH() method returns the number of bytes in a string, rather than the number of characters, and the distinction is important when dealing with string fields that potentially contain multibyte character sequences as MySQL's substring methods operate on the number of characters to select, rather than the number of bytes. Thus using the LENGTH() method did not work in our case where many of strings under test contained multibyte characters. These requirements resulted in an UPDATE query with the format presented below:
UPDATE media
SET accounts_used = LEFT(accounts_used, CHAR_LENGTH(accounts_used) - 1)
WHERE RIGHT(accounts_used, 1) = ',';
The query selects records in the media table where the accounts_used column ends with a comma , (found here using the WHERE RIGHT(accounts_used, 1) = ',' clause to perform the filtering where the RIGHT() method returns a substring of specified length starting on the right of the provided string/column), and then uses the LEFT(accounts_used, CHAR_LENGTH(accounts_used) - 1) method call to perform the string trim operation, here trimming the last character from the accounts_used column value, where LEFT() returns a substring of specified length starting on the left of the provided string/column).
Here the use of the multibyte-aware CHAR_LENGTH() method – rather than the basic LENGTH() method – was important in our case due to the countless records in our dataset that contained multibyte characters. If you are only dealing with an ASCII-encoded or another single-byte encoded character set then the LENGTH() method would work perfectly, and indeed in that case CHAR_LENGTH() and LENGTH() would return the same length count, and could even be used interchangeably. When dealing with data that could contain multibyte characters, or if in doubt use the CHAR_LEGNTH() method instead as it will return an accurate character length count in either case.
Please note that the column and field names used in the example query above match those noted in the original question, and should be modified as needed to suit your own dataset needs.
I have two databases, both containing phone numbers. I need to find all instances of duplicate phone numbers, but the formats of database 1 vary wildly from the format of database 2.
I'd like to strip out all non-digit characters and just compare the two 10-digit strings to determine if it's a duplicate, something like:
SELECT b.phone as barPhone, sp.phone as SPPhone FROM bars b JOIN single_platform_bars sp ON sp.phone.REGEX = b.phone.REGEX
Is such a thing even possible in a mysql query? If so, how do I go about accomplishing this?
EDIT: Looks like it is, in fact, a thing you can do! Hooray! The following query returned exactly what I needed:
SELECT b.phone, b.id, sp.phone, sp.id
FROM bars b JOIN single_platform_bars sp ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(b.phone,' ',''),'-',''),'(',''),')',''),'.','') = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')',''),'.','')
MySQL doesn't support returning the "match" of a regular expression. The MySQL REGEXP function returns a 1 or 0, depending on whether an expression matched a regular expression test or not.
You can use the REPLACE function to replace a specific character, and you can nest those. But it would be unwieldy for all "non-digit" characters. If you want to remove spaces, dashes, open and close parens e.g.
REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')','')
One approach is to create user defined function to return just the digits from a string. But if you don't want to create a user defined function...
This can be done in native MySQL. This approach is a bit unwieldy, but it is workable for strings of "reasonable" length.
SELECT CONCAT(IF(SUBSTR(sp.phone,1,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,1,1),'')
,IF(SUBSTR(sp.phone,2,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,2,1),'')
,IF(SUBSTR(sp.phone,3,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,3,1),'')
,IF(SUBSTR(sp.phone,4,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,4,1),'')
,IF(SUBSTR(sp.phone,5,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,5,1),'')
) AS phone_digits
FROM sp
To unpack that a bit... we extract a single character from the first position in the string, check if it's a digit, if it is a digit, we return the character, otherwise we return an empty string. We repeat this for the second, third, etc. characters in the string. We concatenate all of the returned characters and empty strings back into a single string.
Obviously, the expression above is checking only the first five characters of the string, you would need to extend this, basically adding a line for each position you want to check...
And unwieldy expressions like this can be included in a predicate (in a WHERE clause). (I've just shown it in the SELECT list for convenience.)
MySQL doesn't support such string operations natively. You will either need to use a UDF like this, or else create a stored function that iterates over a string parameter concatenating to its return value every digit that it encounters.
Consider the string "55,33,255,66,55"
I am finding ways to count number of occurence of a specific characters ("55" in this case) in this string using mysql select query.
Currently i am using the below logic to count
select CAST((LENGTH("55,33,255,66,55") - LENGTH(REPLACE("55,33,255,66,55", "55", ""))) / LENGTH("55") AS UNSIGNED)
But the issue with this one is, it counts all occurence of 55 and the result is = 3,
but the desired output is = 2.
Is there any way i can make this work correct? please suggest.
NOTE : "55" is the input we are giving and consider the value "55,33,255,66,55" is from a database field.
Regards,
Balan
You want to match on ',55,', but there's the first and last position to worry about. You can use the trick of adding commas to the frot and back of the input to get around that:
select LENGTH('55,33,255,66,55') + 2 -
LENGTH(REPLACE(CONCAT(',', '55,33,255,66,55', ','), ',55,', 'xxx'))
Returns 2
I've used CONCAT to pre- and post-pend the commas (rather than adding a literal into the text) because I assume you'll be using this on a column not a literal.
Note also these improvements:
Removal of the cast - it is already numeric
By replacing with a string one less in length (ie ',55,' length 4 to 'xxx' length 3), the result doesn't need to be divided - it's already the correct result
2 is added to the length because of the two commas added front and back (no need to use CONCAT to calculate the pre-replace length)
Try this:
select CAST((LENGTH("55,33,255,66,55") + 2 - LENGTH(REPLACE(concat(",","55,33,255,66,55",","), ",55,", ",,"))) / LENGTH("55") AS UNSIGNED)
I would do an sub select in this sub select I would replace every 255 with some other unique signs and them count the new signs and the standing 55's.
If(row = '255') then '1337'
for example.