Hive: Count non zero characters in substrings of a string - mysql

I have a string like the following in the column of a hive external table
<id>^<count>^<distinct_count>|<id>^<count>^<distinct_count>|...
There are two delimiters. | on an entity level and ^ on sub-entity level
I have a metric which is defined by the sum of counts of non-zero distinct_counts or counts, which means given a string I have check whether the distinct count (or the count - I can check either) is non zero and if it mark a flag as 1. Then the metric would be sum(flags). I have to store this metric in an aggregated table in the next step.
Please suggest a way for me to do this in Hive

I think it's not possible. Ended up using an external python mapper for the same.

If you want to count number of non-zero count in a string s, it seems to be solved by
length(
regexp_replace(
regexp_replace(s, "[^^|]*\\^0\\^[^^|]*\\|?", ""),
"[^^|]*\\^[^^|]*\\^[^^|]*\\|?",
"1"
)
)
First regexp_replace removes parts with zero count, second regexp_replace replaces remaining parts with single symbols (it should not necessarily be "1", any symbol would suffice), and hence length returns number of parts with non-zero count.

Related

MySQL Query conditional find nth element in column string

I have a MySQL table setup where one column's values are a string of comma-separated True/False values (1s or 0s). For example, in the column, one field's value may be "0,1,0,0,0,0,1,1,0" and another may be "1,0,0,1,1,1,0,0,0" (note: these are NOT 9 separate columns, but a string in one column). I need to QUERY the MySQL table for elements that are "true"(1) for the "nth element" of that column's value/string.
So, if I was looking for rows, with a specific column, where the 3rd element of the column's value was 1, it would produce a list of results. So, in this case, I would only be searching for "1" in the fth place (12345 = X,X,X...) of the string (X,X,1,X,X,X,X,X,X,X). How can I query this?
This is a crude example of what I am trying to do ...
"SELECT tfcolumn FROM mytable WHERE substr({tfcolumn}, 0, 5)=1"
{tfcolumn} represents the column value
5 represents the 5th position of the string
=1 represents what I need that position to equal to.
Please help. Thanks
You can't. Once you put a serialized data type into a column in SQL (like comma separated lists, or JSON objects) you are preventing yourself from performing any query on the data in those columns. You have to pull the data in a different way and then use a program like python, VB, etc to get the comma separated values you are looking for.
Unless you want to deal with trying to make this mess of a query work...
I would recommend changing your table structure before it's too late. Although it is possible, it is not optimized in a format that a DBMS recognizes. Because of that the DBMS will spend a significant amount of time going through every record to parse the csv values which is something that it was not meant to be doing. Doing the query in SQL will take as much time (if not more time) than just pulling all the records and searching with a tool that can do it properly.
If the column contains values exactly like the ones you posted, then the Nth element is at the 2 * N - 1 position in the comma separated list.
So do this:
SELECT tfcolumn
FROM tablename
WHERE substr(tfcolumn, 2 * 5 - 1, 1) = '1'
Replace 5 with the index that you search for.
See the demo.
Or remove all commas and get the Nth char:
SELECT tfcolumn
FROM tablename
WHERE substr(replace(tfcolumn, ',', ''), 5, 1) = '1'
See the demo.
Try this
if substring_index(substring_index('0,1,0,0,0,0,1,1,0',',',3),',',-1)='1'
The first argument can be your column name. The second argument (',') tells the function that the string is comma-separated. The third argument takes the first 3 elements of the string. So, the output of inner substring_index is '0,1,0'.
The outer substring_index has -1 as the last argment. So, it starts counting in reverse direction & takes only 1 element starting from right.
For example, if the value in a particular row is '2,682,7003,14,185', then the value of substring_index(substring_index('2,682,7003,14,185',',',3),',',-1) is '7003'.

Select statement returns data although given value in the where clause is false

I have a table on my MySQL db named membertable. The table consists of two fields which are memberid and membername. The memberid field has the type of integer and uses auto_increment function starting from 2001. The membername table has the type of varchar.
The membertable has two records with the same order as described above. The records look like this :
memberid : 2001
membername : john smith
memberid : 2002
membername : will smith
I found something weird when I ran a SELECT statement against the memberid field. Running the following statement :
SELECT * FROM `membertable` WHERE `memberid` = '2001somecharacter'
It returned the first data.
Why did that happen? There's no record with memberid = 2001somecharacter. It looks like MySQL only search the first 4 character (2001) and when It's found related data, which is the returned data above, it denies the remaining characters.
How could this happen? And is there any way to turn off this behavior?
--
membertable uses innodb engine
This happens because mysql tries to convert "2001somecharacter" into a number which returns 2001.
Since you're comparing a number to a string, you should use
SELECT * FROM `membertable` WHERE CONVERT(`memberid`,CHAR) = '2001somecharacter';
to avoid this behavior.
OR to do it properly, is NOT put your search variable in quotes so that it has to be a number otherwise it'll blow up because of syntax error and then in front end making sure it's a number before passing in the query.
sqlfiddle
Your finding is an expexted MySQL behaviour.
MySQL converts a varchar to an integer starting from the beginning. As long as there are numeric characters wich can easily be converted, they are icluded in the conversion process. If there's a letter, the conversion stops returning the integer value of the numeric string read so far...
Here's some description of this behavior on the MySQL documentation Site. Unfortunately, it's not mentioned directly in the text, but there's an example which exactly shows this behaviour.
MySQL is very liberal in converting string values to numeric values when evaluated in numeric context.
As a demonstration, adding 0 causes the string to evaluated in a numeric context:
SELECT '2001foo' + 0 --> 2001
, '01.2-3E' + 0 --> 1.2
, 'abc567g' + 0 --> 0
When a string is evaluated in a numeric context, MySQL reads the string character by character, until it encounters a character where the string can no longer be interpreted as a numeric value, or until it reaches the end of the string.
I don't know of a way to "turn off" or disable this behavior. (There may be a setting of sql_mode that changes this behavior, but likely that change will impact other SQL statements that are working, which may stop working if that change is made.
Typically, this kind of check of the arguments is done in the application.
But if you need to do this in the SELECT statement, one option would be cast/convert the column as a character string, and then do the comparison.
But that can have some significant performance consequences. If we do a cast or convert (or any function) on a column that's in a condition in the WHERE clause, MySQL will not be able to use a range scan operation on a suitable index. We're forcing MySQL to perform the cast/convert operation on every row in the table, and compare the result to the literal.
So, that's not the best pattern.
If I needed to perform a check like that within the SQL statement, I would do something like this:
WHERE t.memberid = '2001foo' + 0
AND CAST('2001foo' + 0 AS CHAR) = '2001foo'
The first line is doing the same thing as the current query. And that can take advantage of a suitable index.
The second condition is converting the same value to a numeric, then casting that back to character, and then comparing the result to the original. With the values shown here, it will evaluate to FALSE, and the query will not return any rows.
This will also not return a row if the string value has a leading space, ' 2001'. The second condition is going to evaluate as FALSE.
When comparing an INT to a 'string', the string is converted to a number.
Converting a string to a number takes as many of the leading characters as it can and still be a number. So '2001character' is treated as the number 2001.
If you want non-numeric characters in member_id, make it VARCHAR.
If you want only numeric ids, then reject '200.1character'

MySQL REPLACE string with regex

I have a table with about 50,000 records. One of the fields is a "imploaded" field consisting of variable number of parameters from 1 to 800. I need to replace all parameters to 0.
Example:
1 parameter 3.45 should become 0.00
2 parameters 2.27^11.03 should become 0.00^0.00
3 parameters 809.11^0.12^3334.25 should become 0.00^0.00^0.00
and so on.
Really I need to replace anything between ^ with 0.00 ( for 1 parameter it should be just 0.00 without ^).
Or I need somehow count number of ^, generate string like 0.00^0.00^0.00 ... and replace it. The only tool available is MySqlWorkbench.
I would appreciate any help.
There is no regex replace capability built in to MySQL.
You can, however, accomplish your purpose by doing what you suggested -- counting the number of ^ and crafting a string of replacement values, with this:
TRIM(TRAILING '^' FROM REPEAT('0.00^',(LENGTH(column) - LENGTH(REPLACE(column,'^','')) + 1)));
From inside to outside, we calculate the number of values by counting the number of delimiters, and adding 1 to that count. We count the delimiters by comparing the length of the original string, against the length of the same string with the delimiters stripped out using REPLACE(...,'^','') to replace every ^ with nothing.
The REPEAT() function builds a string by repeating a string expression n number of times.
This results in a spurious ^ at the end of the string, which we remove easily enough with TRIM(TRAILING '^' FROM ...).
SELECT t1.*, ... the expression above ... FROM table_name t1, from your table to verify the results of this logic (replacing column with the actual name of the column), then you can UPDATE table SET column = ... to modify the values. once you are confident in the logic.
Note, of course, that this is indicative of a problematic database design. Each column should contain a single atomic value, not a "list" of values, as this question seems to suggest.

SQL Query Counting Instances of a Word in a Record

I want to find out how many times a word occurs in a single row.
For Example:
I have table sentences and it has only one column call words which is a string data type. The table has only one row with the value "The mans interest in raising the flag flagged."
I want to get the number of times 'the' occurs which is 2
And if I want to get the number of times 'flag' appears it would be 2
There is no internal mysql function counting occurences of a substring in a string, but you can compare length of a string to a string with your word replaced by empty strings, as REPLACE() works for all occurences.
SELECT
(CHAR_LENGTH(sentence)-CHAR_LENGTH(REPLACE(LOWER(sentence),'the','')))/CHAR_LENGTH('the')
AS occurences
FROM yourtable;

Finding number of occurence of a specific string in MYSQL

Consider the string "55,33,255,66,55"
I am finding ways to count number of occurence of a specific characters ("55" in this case) in this string using mysql select query.
Currently i am using the below logic to count
select CAST((LENGTH("55,33,255,66,55") - LENGTH(REPLACE("55,33,255,66,55", "55", ""))) / LENGTH("55") AS UNSIGNED)
But the issue with this one is, it counts all occurence of 55 and the result is = 3,
but the desired output is = 2.
Is there any way i can make this work correct? please suggest.
NOTE : "55" is the input we are giving and consider the value "55,33,255,66,55" is from a database field.
Regards,
Balan
You want to match on ',55,', but there's the first and last position to worry about. You can use the trick of adding commas to the frot and back of the input to get around that:
select LENGTH('55,33,255,66,55') + 2 -
LENGTH(REPLACE(CONCAT(',', '55,33,255,66,55', ','), ',55,', 'xxx'))
Returns 2
I've used CONCAT to pre- and post-pend the commas (rather than adding a literal into the text) because I assume you'll be using this on a column not a literal.
Note also these improvements:
Removal of the cast - it is already numeric
By replacing with a string one less in length (ie ',55,' length 4 to 'xxx' length 3), the result doesn't need to be divided - it's already the correct result
2 is added to the length because of the two commas added front and back (no need to use CONCAT to calculate the pre-replace length)
Try this:
select CAST((LENGTH("55,33,255,66,55") + 2 - LENGTH(REPLACE(concat(",","55,33,255,66,55",","), ",55,", ",,"))) / LENGTH("55") AS UNSIGNED)
I would do an sub select in this sub select I would replace every 255 with some other unique signs and them count the new signs and the standing 55's.
If(row = '255') then '1337'
for example.