contains function for String in presto Athena - mysql

I have a table with ORC Serde in Athena. The table contains a string column named greeting_message. It can contain null values as well. I want to find how many rows in the table have a particular text as the pattern.
Let's say my sample data looks like below:
|greeting_message |
|-----------------|
|hello world |
|What's up |
| |
|hello Sam |
| |
|hello Ram |
|good morning, hello |
| |
|the above row has null |
| Good morning Sir |
Now for the above table, if we see there are a total of 10 rows. 7 of them are having not null values and 3 of them just has null/empty value.
I want to know what percentage of rows contain a specific word.
For example, consider the word hello. It is present in 4 rows, so the percentage of such rows is 4/10 which is 40 %.
Another example: the word morning is present in 2 messages. So the percentage of such rows is 2/10 which is 20 %.
Note that I am considering null also in the count of the denominator.

SELECT SUM(greeting_message LIKE '%hello%') / COUNT(*) AS hello_percentage,
SUM(greeting_message LIKE '%morning%') / COUNT(*) AS morning_percentage
FROM tablename

The syntax of prestoDB (Amazon Athena engine) is different than MySQL. The following example is creating a temp table WITH greetings AS and then SELECT from that table:
WITH greetings AS
(SELECT 'hello world' as greeting_message UNION ALL
SELECT 'Whats up' UNION ALL
SELECT '' UNION ALL
SELECT 'hello Sam' UNION ALL
SELECT '' UNION ALL
SELECT 'hello Ram' UNION ALL
SELECT 'good morning, hello' UNION ALL
SELECT '' UNION ALL
SELECT 'the above row has null' UNION ALL
SELECT 'Good morning Sir')
SELECT count_if(regexp_like(greeting_message, '.*hello.*')) / cast(COUNT(1) as real) AS hello_percentage,
count_if(regexp_like(greeting_message, '.*morning.*')) / cast(COUNT(1) as real) AS morning_percentage
FROM greetings
will give the following results
hello_percentage
morning_percentage
0.4
0.2
The regex_like function can support many regex options including spaces (\s) and other string matching requirements.

Related

MySQL - how to get count of a single item frequency in a table of CSV values

I have a mysql table called "projects" with a single field containing CSV lists of project Ids. Assume that I cannot change the table structure.
I need a query that will allow me to quickly retrieve a count of rows that contain a particular project id, for example:
select count(*) from projects where '4' in (project_ids);
This returns just 1 result, which is incorrect (should be 3 results), but I think that it illustrates what I'm attempting to do.
CREATE TABLE `projects` (
`project_ids` varchar(255) DEFAULT NULL
);
INSERT INTO `projects` (`project_ids`)
VALUES
('1,2,4'),
('1,2'),
('4'),
('4,5,2'),
('1,2,5');
I was hoping that there might be a simple mysql function that would achieve this so that I don't have to anything complex sql-wise.
You could use this approach:
SELECT COUNT(*)
FROM projects
WHERE CONCAT(',', project_ids, ',') LIKE '%,4,%';
Or use FIND_IN_SET for a built-in way:
SELECT COUNT(*)
FROM projects
WHERE FIND_IN_SET('4', project_ids) > 0;
But, as to that which Gordon's comment alludes, a much better table design would be to have a junction table which relates a primary key in one table to all projects in another table. That junction table, based off your sample data, would look like this:
PK | project_id
1 | 1
1 | 2
1 | 4
2 | 1
2 | 2
3 | 4
4 | 4
4 | 5
4 | 2
5 | 1
5 | 2
5 | 5
With this design, if you wanted to find the count of PK's having a project_id of 4, you would only need a much simpler (and sargable) query:
SELECT COUNT(*)
FROM junction_table
WHERE project_id = 4;
You would need to use a like condition as follows
select count(*)
from projects
where concat(',',project_ids,',') like '%,4,%';

How to extract strings occurring after a certain character in MySQL?

If, I have a string:
'#name#user#user2#laugh#cry'
I would like to print,
name
user
user2
laugh
cry
All the strings are different and have a different number of '#'.
I have tried using Regex but it's not working. What logic has to be applied for this query?
The first thing to say is that storing delimited list of values in text columns is, in many ways, not a good database design. You should basically rework your database structure, or prepare for a potential world of pain.
A quick and dirty solution is to use a numbers table, or an inline suquery, and to cross join it with the table ; REGEXP_SUBSTR() (available in MySQL 8.0), lets you select a given occurence of a particular pattern.
Here is a query that will extract up to 10 values from the column:
SELECT
REGEXP_SUBSTR(t.val, '[^#]+', 1, numbers.n) name
FROM
mytable t
INNER JOIN (
SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7
UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10
) numbers
ON REGEXP_SUBSTR(t.val, '[^#]+', 1, numbers.n) IS NOT NULL
Regexp [^#]+ means: as many consecutive characters as possible other than #.
Ths demo on DB Fiddle, when given input string '#name#user#user2#laugh#cry', returns:
| name |
| ----- |
| name |
| user |
| user2 |
| laugh |
| cry |

SQL query to select columns with exact like?

Consider this SQL table
id | name | numbers
------------------------
1 | bob | 1 3 5
2 | joe | 7 2 15
This query returns the whole table as its result:
SELECT * FROM table WHERE numbers LIKE '%5%'
Is there an SQL operator so that it only returns row 1 (only columns with the number 5)?
Use regexp with word boundaries. (But you should ideally follow Gordon's comment)
where numbers REGEXP '[[:<:]]5[[:>:]]'
It's a pity that you are not using the comma as a separator in your numbers column, because it would be possible to use the FIND_IN_SET function, but you can use it together with REPLACE, like this:
SELECT * FROM table WHERE FIND_IN_SET(5, REPLACE(numbers, ' ', ','));

Generating "Fake" Records Within A Query

I have a very basic statement, e.g.:
SELECT pet, animal_type, number_of_legs
FROM table
However, where table currently is, I want to insert some fake data, along the lines of:
rufus cat 3
franklin turtle 1
norm dog 5
Is it possible to "generate" these fake records, associating each value with the corresponding field, from within a query so that they are returned as the result of the query?
SELECT pet, animal_type, number_of_legs FROM table
union select 'rufus', 'cat', 3
union select 'franklin', 'turtle', 1
union select 'norm', 'dog', 5
This gives you the content of table plus the 3 records you want, avoiding duplicates, if duplicates are OK, then replace union with union all
edit: per your comment, for tsql, you can do:
select top 110 'franklin', 'turtle', 1
from sysobjects a, sysobjects b -- this cross join gives n^2 records
Be sure to chose a table where n^2 is greater than the needed records or cross join again and again
I'm not entirely sure what you're trying to do, but MySQL is perfectly capable of selecting "mock" data and printing it in a table:
SELECT "Rufus" AS "Name", "Cat" as "Animal", "3" as "Number of Legs"
UNION
SELECT "Franklin", "Turtle", "1"
UNION
SELECT "Norm", "Dog", "5";
Which would result in:
+----------+--------+----------------+
| Name | Animal | Number of Legs |
+----------+--------+----------------+
| Rufus | Cat | 3 |
| Franklin | Turtle | 1 |
| Norm | Dog | 5 |
+----------+--------+----------------+
Doing this query this way prevents actually having to save information in a temporary table, but I'm not sure if it's the correct way of doing things.

sql find rows partially matching a string

I want to find rows in table having rows which contains a string
For example, I m having rows in a column names 'atest' in a table named 'testing' -
test
a
cool
another
now I want to select the rows having a word from the string 'this is a test' using a sql
select * from testing where instr(atext, 'this is a test') >0;
but this is not selecting any row.
Reverse the arguments to INSTR.
WHERE INSTR('this is a test', atext)
with full text index -
select * from anti_spam where match (atext) against ("this is a test" in boolean mode);
This is a 'reversed' like:
select * from testing where 'this is a test' LIKE CONCAT('%',atext,'%');
It can be slow on tables having a lot of records.
This returns the rows, where the value of the atext column can be found in the given string.
(for example matches when atext = 'is a t' because it can be found in the given string)
Or you can write a regex.
select * from testing where atext REGEXP '^(this|is|a|test)$';
This matching all rows what contains exactly the specified words.
In your scripting or programming language, you should only replace spaces with | and add ^ to the begining of the string and $ to the ending of the string, and REGEXP, not equation.
("this is a test" -> ^this|is|a|test$ )
If you have a lot of records in the table, this queries can be slow. Because the sql engine does not use indexes in regexp queries.
So if you have a lot of rows on your table and does not have more than 4 000 000 words i recommend to make an indexing table. Example:
originalTable:
tid | atext (text)
1 | this is
2 | a word
3 | a this
4 | this word
5 | a is
....
indexTable:
wid | word (varchar)
1 | this
2 | is
3 | a
4 | word
switchTable:
tid | wid
1 | 1
1 | 2
2 | 3
2 | 4
3 | 1
3 | 3
...
You should set indexes, tid, wid and word fields.
Than the query is:
SELECT o.*
FROM originalTable as o
JOIN switchTable as s ON o.tid = s.tid
JOIN indexTable as i on i.wid=s.wid
WHERE i.word = 'this' or i.word='is' or i.word='a' or i.word='test'
This query can be mutch faster if your originalTable have 'a lot' records, because here the sql engine can make indexed searches. But there is a bit more work when insert a row in the original table you must make insertions in the other two tables.
The result between the runtime of the 3 queries depends on your database table size. And that you want to optimize for insertions or selections. ( the rate between insert/update and select queryes )