concatenate DISTINCT string values in pentaho data integration - mysql

I am new to pentaho data integration. How can i concatenate distinct string values ?
bse_id values
100 A1
100 A1
100 A2
150 A1
150 B1
150 C1
150 C1
putput should be
bse_id values
100 A1,A2
150 A1,B1,C1
In Mysql, i can use
select bse_id,group_concat(distinct values) from table group by 1;
In SPOON, i have tried group_by step and memory group_by
both are resulting in duplicate values.
I'm getting output as
bse_id values
100 A1,A1,A2
150 A1,B1,C1,C1
Please help me in removing the duplicates.

You can do this easily with a Group by step. Be sure the input to the step is sorted on the bse_id field, then select values as the subject of an aggregate field and set the type to 'Concatenate strings separated by,'. That should give you exactly what you want.

You need to have 2 Group by Steps:
Try the following three steps after input:
Step: Sort by BOTH - 'bsi_id' and 'values'
Step: Group by BOTH - 'bsi_id' and 'values' (no aggregation here)
Step: Group by 'bsi_id'; aggregate 'values' with Type "Concatenate strings separated by ,"
Output is:
bse_id; values
100; A1, A2
150; A1, B1, C1
This should work fine.
Bye

Related

How to properly format overlapping mySQL IN and NOT IN conditions

I have the following mySQL table:
data
1
2
3
4
5
6
7
8
9
I would like to supply my select statement with two seperate lists
Exculde List:
1,4,5,7
Include List:
1,2,3,4,5,6,7
I tried the following statement:
Select * FROM table WHERE data NOT IN ('1,4,5,7') AND data IN ('1,2,3,4,5,6,7)
Expecting the following output:
data
2
3
6
But I received no results. I realize I passed an impossible condition but I don't know how to format my query to return the expected results.
Can anyone tell me what I'm doing wrong here?
IN takes a list of values, not a string that holds a delimited list of values.
Examples:
x IN (1, 2, 3)
x IN ('a', 'b', 'c')
Use IN (1,2,3) and not IN ('1,2,3') as the former compares to individual values 1, 2 and 3 while the latter is against the literal string 1,2,3.
Select * FROM ( (Select * FROM table WHERE data NOT IN ('1,4,5,7') ) AS table WHERE data IN ('1,2,3,4,5,6,7)
you try againt

SQL calculate the number of row in a specific range

There's two table in my database:
Table A
Column_A1 column_A2
A1 10
A2 20
A3 30
Table B
Column_B1 column_B2
B1 11
B2 21
B3 31
B4 29
B5 30.5
I want to calculate how many row of table B match the following condition:
range:
A1±1,
A2±1,
A3±1,
...
for example:
B1∈[A1-1,A1+1]
count these row, return value 1.
B2∈[A2-1,A2+1]
count these row, return value 1.
B3∈[A3-1,A3+1]
B4∈[A3-1,A3+1]
B5∈[A3-1,A3+1]
count these row, return value 3.
Result should be like this:
Column_A1 column_A2 num_match
A1 10 1
A2 20 1
A3 30 3
It's easy to use a loop to do this in other programming language, but what's the simplest way to make it in SQL ? Thanks.
I would do this with a correlated subquery:
select a.*,
(select count(*)
from b
where b.column_b2 between a.column_a2 - 1 and a.column_a2 + 1
) as num_match
from a;
Note: between is used suggesting that the bounds are included in the range. If this is not the intention, then use explicit < and > logic.
Many databases would be able to take advantage of an index on b(column_b2) for this query. You can test on MySQL to see if this is the case.
You can use a GROUP BY statement and filter on inequalities:
SELECT Column_A1,Column_A2,COUNT(*)
FROM A JOIN B ON column_A2-1 <= column_B2 AND column_B2 <= column_A2+1
GROUP BY Column_A1,Column_A2
A simple query that matches with the way the OP expressed the goal of the statement:
SELECT
a.`Column_A1`,
COUNT(*) as `NumMatch`
FROM `Table_A` a
JOIN `Table_B` b
ON b.`column_b2` BETWEEN a.`column_A2` - 1 AND a.`column_A2` + 1
GROUP BY a.`Column_A1`;

Group by values ignoring commas

I have a table which contains rows that, mostly, have the name of a single country assigned to each row. Unfortunately, at some point, the “country” field had multiple values which were separated by a comma. Most of the rows now have a single value, but there are residual commas left in the some of the fields. For instance, some rows that pertain to Afghanistan have “Afghanistan” and some have “,Afghanistan”. My current SELECT query treats those values as two separate groups. I am not allowed to fiddle with the database to get rid of the commas.
What do I do to have my SELECT query to disregard the commas and group the countries values together. As an added complication, there are a few rows that have multiple country values, which, again, I can’t edit. Ideally I would like to exclude those entirely from the SELECT query (as well as rows that have a negative value in another field.
Example data of what my current query gives me:
,Afghanistan 66
,Albania 1
,Angola 25
,Bangladesh 2225
,Bolivia 824
,Bosnia 1
,Bosnia And Herzegovina 291
,Bosnia and Herzogovina 181
,France, Germany 1
Afghanistan32
Albania 3
Bangladesh 132
Bolivia 295
Bosnia and Herzegovina 79
Botswana 2
Here is my query:
/* Group by country and count instances selecting the resources has a positive number in the ref ID */
SELECT field3 "Country", COUNT(field3) FROM `resource` WHERE ref > 0 GROUP BY field3;
SELECT REPLACE(field3, ",", "") AS Country, COUNT(field3)
FROM `resource`
WHERE ref > 0
GROUP BY Country;

Collect all non-numeric values from a column after GROUP BY clause in MYSQL

I have a MySQL Table that defines causality relation between the columns. Here Event_B happens because of Event_A. Any value present in Event_A is not present in Event_B for that Row_ID.
Row_ID Event_A Event_B
-------------------------
1 A1 B1
2 A2 B3
3 A1 B2
4 A4 A1
When considering A1 from Event_A, its values will be all those values from Event_B {B1, B2}, but shall never include A1 in any case.
The values in the Event_A and Event_B columns are repetitive.
On applying GROUP BY clause on Event_A, I would like to collect all values of Event_B column for that respective Event_A into a variable/collection/set.
Need some directions on SQL Scripts to proceed ahead.
[EDIT]:
The solution would be like :
A1 -- {B1, B2}
A2 -- {B3}
A4 -- {A1}
Do you just want group_concat()?
select event_a, group_concat(event_b) as event_bs
from table t
group by event_a;
Try this:
SELECT Event_A, GROUP_CONCAT(Event_B)
FROM your_table
GROUP BY Event_A;

Populate column with number of substrings in another column

I have two tables "A" and "B". Table "A" has two columns "Body" and "Number." The column "Number" is empty, the purpose is to populate it.
Table A: Body / Number
ABABCDEF /
IJKLMNOP /
QRSTUVWKYZ /
Table "B" only has one column:
Table B: Values
AB
CD
QR
Here is what I am looking for as a result:
ABABCDEF / 3
IJKLMNOP / 0
QRSTUVWKYZ / 1
In other words, I want to create a query that looks up, for each string in the "Body" column, how many times the substrings in the "Values" column appear.
How would you advise me to do that?
Here's the finished query; explanation will follow:
SELECT
Body,
SUM(
CASE WHEN Value IS NULL THEN 0
ELSE (LENGTH(Body) - LENGTH(REPLACE(Body, Value, ''))) / LENGTH(Value)
END
) AS Val
FROM (
SELECT TableA.Body, TableB.Value
FROM TableA
LEFT JOIN TableB ON INSTR(TableA.Body, TableB.Value) > 0
) CharMatch
GROUP BY Body
There's a SQL Fiddle here.
Now for the explanation...
The inner query matches TableA strings with TableB substrings:
SELECT TableA.Body, TableB.Value
FROM TableA
LEFT JOIN TableB ON INSTR(TableA.Body, TableB.Value) > 0
Its results are:
BODY VALUE
-------------------- -----
ABABCDEF AB
ABABCDEF CD
IJKLMNOP
QRSTUVWKYZ QR
If you just count these you'll only get a value of 2 for the ABABCDEF string because it just looks for the existence of the substrings and doesn't take into consideration that AB occurs twice.
MySQL doesn't appear to have an OCCURS type function, so to count the occurrences I used the workaround of comparing the length of the string to its length with the target string removed, divided by the length of the target string. Here's an explanation:
REPLACE('ABABCDEF', 'AB', '') ==> 'CDEF'
LENGTH('ABABCDEF') ==> 8
LENGTH('CDEF') ==> 4
So the length of the string with all AB occurrences removed is 8 - 4, or 4. Divide the 4 by 2 (LENGTH('AB')) to get the number of AB occurrences: 2
String IJKLMNOP will mess this up. It doesn't have any of the target values so there's a divide by zero risk. The CASE inside the SUM protects against this.
You want an update query:
update A
set cnt = (select sum((length(a.body) - length(replace(a.body, b.value, '')) / length(b.value))
from b
)
This uses a little trick for counting the number of occurrence of b.value in a given string. It replaces each occurrence with an empty string and counts the difference in length of the strings. This is divided by the length of the string being replaced.
If you just wanted the number of matches (so the first value would be "2" instead of "3"):
update A
set cnt = (select count(*)
from b
where a.body like concat('%', b.value, '%')
)