MySQL 8 update with Replace() or Regex - mysql

How can i update data using replace or regex-like method from
id | jdata
---------------
01 | {"name1":["number","2"]}
02 | {"val1":["number","12"],"val2":["number","22"]}
to
id | jdata
---------------
01 | {"name1":2 }
02 | {"val1": 12,"val2":22 }
I need to make a proper json entry for numbers and replace an array with a number from that array. Column "jdata" can have any number of similar attributes from the example. Something similar to this would do:
UPDATE table SET jdata = REPLACE(jdata, '["number","%d"]', %d);

Two ways:
The long, more clumsy way, using JSON_ARRAY:
UPDATE table1,
(
SELECT
id,
JSON_EXTRACT(jdata, "$.name1[0]") as A,
JSON_EXTRACT(jdata, "$.name1[1]") as B,
JSON_EXTRACT(jdata, "$.val1[0]") as C,
JSON_EXTRACT(jdata, "$.val1[1]") as D,
JSON_EXTRACT(jdata, "$.val2[0]") as E,
JSON_EXTRACT(jdata, "$.val2[1]") as F
FROM table1
) x
SET jdata = CASE WHEN table1.id=1 THEN JSON_ARRAY("name1",x.B)
ELSE JSON_ARRAY("val1",x.D,"val2",F) END
WHERE x.id=table1.id;
Or using JSON_REPLACE:
update table1
set jdata = JSON_REPLACE(jdata, "$.name1",JSON_EXTRACT(jdata,"$.name1[1]"))
where id=1;
update table1
set jdata = JSON_REPLACE(jdata, "$.val1",JSON_EXTRACT(jdata,"$.val1[1]"),
"$.val2",JSON_EXTRACT(jdata,"$.val2[1]"))
where id=2;
see: DBFIDDLE for both options
EDIT: To get more depth in the query, you can start with below, and create a new JSON message from this stuff without the number:
WITH RECURSIVE cte1 as (
select 0 as x
union all
select x+1 from cte1 where x<10
)
select
id,
x,
JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))) j,
JSON_EXTRACT(jdata,CONCAT("$.",JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))))) v,
JSON_UNQUOTE(JSON_EXTRACT(jdata,CONCAT("$.",JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))),"[0]"))) v1,
JSON_UNQUOTE(JSON_EXTRACT(jdata,CONCAT("$.",JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))),"[1]"))) v2
from table1
cross join cte1
where x<JSON_DEPTH(jdata)
and not JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]")) is null
order by id,x;
output:
id
x
j
v
v1
v2
1
0
name1
["number", "2"]
number
2
2
0
val1
["number", "12"]
number
12
2
1
val2
["number", "22"]
number
22
This should take care of JSON message which also contains values like val3, val4, etc, until a maximum depth which is now fixed to 10 in cte1.
EDIT2: When it is just needed to remove the "number" from the JSON message, you can also repeat this UPDATE until all "number" tags are removed (you can repeat this in a stored procedure, I am not going to write the stored procedure for you 😉)
update
table1,
( WITH RECURSIVE cte1 as (
select 0 as x
union all
select x+1 from cte1 where x<10
) select * from cte1 )x
set jdata = JSON_REMOVE(table1.jdata, CONCAT("$.",JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))),"[0]"))
where JSON_UNQUOTE(JSON_EXTRACT(jdata,CONCAT("$.",JSON_UNQUOTE(JSON_EXTRACT(JSON_KEYS(jdata),CONCAT("$[",x,"]"))),"[0]"))) = "number"
An example, where I do run the update 2 times, is in this DBFIDDLE

Related

Dynamically fetch columns as JSON in oracle

I have to get some columns as is and some columns from a query as JSON document. The as is column names are known to me but rest are dynamic columns so there are not known beforehand.
Below is the query like
select col1,col2,col3,
sum(col4) as col4,
sum(col5) as col5
from my_table
group by col1,col2,col3;
here col4,col5 names are unknown to me as they are been fetched dynamically.
Suppost my_table data looks like
The expected result is like below
I tried
select JSON_OBJECT(*) from
(
select col1,col2,col3,
sum(col4) as col4,
sum(col5) as col5
from my_table
group by col1,col2,col3
);
But obviously it does not yield expected output.
I'm on 19c DB version 19.17
Any help or suggestion would be great help!
It's kinda hacky, but you could:
Use json_object(*) to convert the whole row to json
Pass the result of this json_transform*, which you can use to remove unwanted attributes
So you could do something like:
with rws as (
select mod ( level, 2 ) col1, mod ( level, 3 ) col2,
level col3, level col4
from dual
connect by level <= 10
), grps as (
select col1,col2,
sum(col3) as col3,
sum(col4) as col4
from rws
group by col1,col2
)
select col1,col2,
json_transform (
json_object(*),
remove '$.COL1',
remove '$.COL2'
) json_data
from grps;
COL1 COL2 JSON_DATA
---------- ---------- ------------------------------
1 1 {"COL3":8,"COL4":8}
0 2 {"COL3":10,"COL4":10}
1 0 {"COL3":12,"COL4":12}
0 1 {"COL3":14,"COL4":14}
1 2 {"COL3":5,"COL4":5}
0 0 {"COL3":6,"COL4":6}
json_transform is a 21c feature that's been backported to 19c in 19.10.
You may achieve this by using Polymorphic Table Function available since 18c.
Define the function that will project only specific columns and serialize others into JSON. An implementation is below.
PTF package (function implementation).
create package pkg_jsonify as
/*Package to implement PTF.
Functions below implement the API
described in the DBMS_TF package*/
function describe(
tab in out dbms_tf.table_t,
keep_cols in dbms_tf.columns_t
) return dbms_tf.describe_t
;
procedure fetch_rows;
end pkg_jsonify;
/
create package body pkg_jsonify as
function describe(
tab in out dbms_tf.table_t,
keep_cols in dbms_tf.columns_t
) return dbms_tf.describe_t
as
add_cols dbms_tf.columns_new_t;
new_col_cnt pls_integer := 0;
begin
for i in 1..tab.column.count loop
/*Initially remove column from the output*/
tab.column(i).pass_through := FALSE;
/*and keep it in the row processing (to be available for serialization*/
tab.column(i).for_read := TRUE;
for j in 1..keep_cols.count loop
/*If column is in a projection list, then remove it
from processing and pass it as is*/
if tab.column(i).description.name = keep_cols(j)
then
tab.column(i).pass_through := TRUE;
/*Skip column in the row processing (JSON serialization)*/
tab.column(i).for_read := FALSE;
end if;
end loop;
end loop;
/*Define new output column*/
add_cols := dbms_tf.columns_new_t(
1 => dbms_tf.column_metadata_t(
name => 'JSON_DOC_DATA',
type => dbms_tf.type_clob
)
);
/*Return the list of new cols*/
return dbms_tf.describe_t(
new_columns => add_cols
);
end;
procedure fetch_rows
/*Process rowset and serialize cols*/
as
rowset dbms_tf.row_set_t;
num_rows pls_integer;
new_col dbms_tf.tab_clob_t;
begin
/*Get rows*/
dbms_tf.get_row_set(
rowset => rowset,
row_count => num_rows
);
for rn in 1..num_rows loop
/*Calculate new column value in the same row*/
new_col(rn) := dbms_tf.row_to_char(
rowset => rowset,
rid => rn,
format => dbms_tf.FORMAT_JSON
);
end loop;
/*Put column to output*/
dbms_tf.put_col(
columnid => 1,
collection => new_col
);
end;
end pkg_jsonify;
/
PTF function definition based on the package.
create function f_cols_to_json(tab in table, cols in columns)
/*Function to serialize into JSON using PTF*/
return table pipelined
row polymorphic using pkg_jsonify;
/
Demo.
create table sample_tab
as
select
trunc(level/10) as id
, mod(level, 3) as id2
, level as val1
, level * level as val2
from dual
connect by level < 40
with prep as (
select
id
, id2
, sum(val1) as val1_sum
, max(val2) as val2_max
from sample_tab
group by
id
, id2
)
select *
from table(f_cols_to_json(prep, columns(id, id2)))
ID
ID2
JSON_DOC_DATA
0
1
{"VAL1_SUM":12, "VAL2_MAX":49}
0
2
{"VAL1_SUM":15, "VAL2_MAX":64}
0
0
{"VAL1_SUM":18, "VAL2_MAX":81}
1
1
{"VAL1_SUM":58, "VAL2_MAX":361}
1
2
{"VAL1_SUM":42, "VAL2_MAX":289}
1
0
{"VAL1_SUM":45, "VAL2_MAX":324}
2
2
{"VAL1_SUM":98, "VAL2_MAX":841}
2
0
{"VAL1_SUM":72, "VAL2_MAX":729}
2
1
{"VAL1_SUM":75, "VAL2_MAX":784}
3
0
{"VAL1_SUM":138, "VAL2_MAX":1521}
3
1
{"VAL1_SUM":102, "VAL2_MAX":1369}
3
2
{"VAL1_SUM":105, "VAL2_MAX":1444}
with prep as (
select
id
, id2
, sum(val1) as val1_sum
, max(val2) as val2_max
from sample_tab
group by
id
, id2
)
select *
from table(f_cols_to_json(prep, columns(id)))
ID
JSON_DOC_DATA
0
{"ID2":1, "VAL1_SUM":12, "VAL2_MAX":49}
0
{"ID2":2, "VAL1_SUM":15, "VAL2_MAX":64}
0
{"ID2":0, "VAL1_SUM":18, "VAL2_MAX":81}
1
{"ID2":1, "VAL1_SUM":58, "VAL2_MAX":361}
1
{"ID2":2, "VAL1_SUM":42, "VAL2_MAX":289}
1
{"ID2":0, "VAL1_SUM":45, "VAL2_MAX":324}
2
{"ID2":2, "VAL1_SUM":98, "VAL2_MAX":841}
2
{"ID2":0, "VAL1_SUM":72, "VAL2_MAX":729}
2
{"ID2":1, "VAL1_SUM":75, "VAL2_MAX":784}
3
{"ID2":0, "VAL1_SUM":138, "VAL2_MAX":1521}
3
{"ID2":1, "VAL1_SUM":102, "VAL2_MAX":1369}
3
{"ID2":2, "VAL1_SUM":105, "VAL2_MAX":1444}
fiddle

Compare JSON values in MariaDB

How can I compare two JSON values in MariaDB? Two values such as {"b": 1, "a": 2} and {"a": 2, "b": 1} should be equal. Does MariaDB contain function to reorder elements of a JSON value?
If you expect to need this (uncommon) kind of comparison, build the JSON in some canonical way before storing it. The obvious way for a simple JSON like yours is to alphabetize the keys. How to do that will depend on the "encode" library you are using for JSON.
Just use JSON_EXTRACT, JSON_EXTRACT doesnt care about the position of a digit within a JSON string.
Query
SELECT
JSON_EXTRACT(#json_string_1, '$.a') AS a1
, JSON_EXTRACT(#json_string_2, '$.a') AS a2
, JSON_EXTRACT(#json_string_1, '$.b') AS b1
, JSON_EXTRACT(#json_string_2, '$.b') AS b2
FROM (
SELECT
#json_string_1 := '{"b":1,"a":2}'
, #json_string_2 := '{"a":2,"b":1}'
)
AS
json_strings
Result
a1 a2 b1 b2
------ ------ ------ --------
2 2 1 1
Now use this result as delivered table so we can check if a1 is equal to a2 and b1 is equal to b2.
Query
SELECT
1 AS json_equal
FROM (
SELECT
JSON_EXTRACT(#json_string_1, '$.a') AS a1
, JSON_EXTRACT(#json_string_2, '$.a') AS a2
, JSON_EXTRACT(#json_string_1, '$.b') AS b1
, JSON_EXTRACT(#json_string_2, '$.b') AS b2
FROM (
SELECT
#json_string_1 := '{"b":1,"a":2}'
, #json_string_2 := '{"a":2,"b":1}'
)
AS
json_strings
)
AS json_data
WHERE
json_data.a1 = json_data.a2
AND
json_data.b1 = json_data.b2
Result
json_equal
------------
1
Disclaimer: I work for MariaDB
See my answer at https://dba.stackexchange.com/a/300235/208895 for an example how to use JSON_EQUALS available as of 10.7.

mysql between show matched value

I have a table with columns showing ranges, like
id from to
1 10 100
2 200 300
I have a query which will be a list of values, like 17, 20, 44, 288 etc.
Is it possible to have a result set which would include the where condition, so I get:
id from to input
1 10 100 7
1 10 100 20
1 10 100 144
2 200 300 288
Right now the code runs one query per where value and it works, and I'm looking to increase performance by combing it into one large multiple where clause, like
SELECT *
FROM table
WHERE (from<=7 AND start>=7)
OR (from<=20 AND start>=20)
OR (from<=144 AND start>=144)
OR (from<=288 AND start>=288)
What you want makes no sense regarding ranges.
7 and 144 has no compatible range yet you want to put then into the first range.
In a result set with lots of values listing you will probably get to many conditions.
What you can do is to put those values that isn't in a range to show without correspondence. Like this:
With the structure being:
create table test (
id integer,
vfrom integer,
vto integer
);
insert into test values
(1, 10, 100),
(2, 200, 300);
create table vals(
val integer
);
insert into vals values (7), (20), (144), (288);
You can use this query:
select val, id, vfrom, vto
from vals v left join
test t on ( t.vfrom <= v.val and t.vto >= v.val )
It will bring you:
7 null null null
20 1 10 100
144 null null null
288 2 200 300
see it here on fiddle: http://sqlfiddle.com/#!2/f68fd/8
Maybe it isn't what you want but it is more logical.
Sure there is a query for this. Trouble is we need a table for specific values to show up; and then there are sub-queries and union selects:
SELECT table.*, values.val AS input
FROM (SELECT 7 AS val UNION SELECT 20 AS val UNION SELECT 144 AS val UNION SELECT 288 AS val) as values
JOIN table ON table.from <= values.val AND table.to >= values.val
This should do the trick. Note that you only have to specify the column name in the first SELECT with in a UNION SELECT.
I will suppose you are using Java as your application language. You could build your query this way:
public String buildQuery(int[] myList) {
String queryToReturn = "";
for (int queryIndex = 0; queryIndex < myList.length; queryIndex++) {
queryToReturn += ((queryIndex == 0) ? ("") : (" union ")) +
"(select `id`, `from`, `to`, " + myList[queryIndex] + " as input
from MyTable
where `from` < " + myList[queryIndex] + " and " + myList[queryIndex] " < `to`)";
}
return queryToReturn;
}
Then run the returned query.

SQL Server: calculate field data from fields in same table but different set of data

I was looking around and found no solution to this. I´d be glad if someone could help me out here:
I have a table, e.g. that has among others, following columns:
Vehicle_No, Stop1_depTime, Segment_TravelTime, Stop_arrTime, Stop_Sequence
The data might look something like this:
Vehicle_No Stop1_DepTime Segment_TravelTime Stop_Sequence Stop_arrTime
201 13000 60 1
201 13000 45 2
201 13000 120 3
201 13000 4
202 13300 240 1
202 13300 60 2
...
and I need to calculate the arrival time at each stop from the departure time at the first stop and the travel times in between for each vehicle. What I need in this case would look like this:
Vehicle_No Stop1_DepTime Segment_TravelTime Stop_Sequence Stop_arrTime
201 13000 60 1
201 13000 45 2 13060
201 13000 120 3 13105
201 13000 4 13225
202 13300 240 1
202 13300 60 2 13540
...
I have tried to find a solution for some time but was not successful - Thanks for any help you can give me!
Here is the query that still does not work - I am sure I did something wrong with getting the table from the database into this but dont know where. Sorry if this is a really simple error, I have just begun working with MSSQL.
Also, I have implemented the solution provided below and it works. At this point I mainly want to understand what went wrong here to learn about it. If it takes too much time, please do not bother with my question for too long. Otherwise - thanks a lot :)
;WITH recCTE
AS
(
SELECT ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.PlanAbfahrtStart, ZAEHL_2011.dbo.L32.Fahrzeit, ZAEHL_2011.dbo.L32.Sequenz, ZAEHL_2011.dbo.L32.PlanAbfahrtStart AS Stop_arrTime
FROM ZAEHL_2011.dbo.L32
WHERE ZAEHL_2011.dbo.L32.Sequenz = 1
UNION ALL
SELECT t. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, t. ZAEHL_2011.dbo.L32.PlanAbfahrtStart, t. ZAEHL_2011.dbo.L32.Fahrzeit,t. ZAEHL_2011.dbo.L32.Sequenz, r.Stop_arrTime + r. ZAEHL_2011.dbo.L32.Fahrzeit AS Stop_arrTime
FROM recCTE AS r
JOIN ZAEHL_2011.dbo.L32 AS t
ON t. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id = r. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id
AND t. ZAEHL_2011.dbo.L32.Sequenz = r. ZAEHL_2011.dbo.L32.Sequenz + 1
)
SELECT ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.PlanAbfahrtStart, ZAEHL_2011.dbo.L32.Fahrzeit, ZAEHL_2011.dbo.L32.Sequenz, ZAEHL_2011.dbo.L32.PlanAbfahrtStart,
CASE WHEN Stop_arrTime = ZAEHL_2011.dbo.L32.PlanAbfahrtStart THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.Sequenz
A recursive CTE solution - assumes that each Vehicle_No appears in the table only once:
DECLARE #t TABLE
(Vehicle_No INT
,Stop1_DepTime INT
,Segment_TravelTime INT
,Stop_Sequence INT
,Stop_arrTime INT
)
INSERT #t (Vehicle_No,Stop1_DepTime,Segment_TravelTime,Stop_Sequence)
VALUES(201,13000,60,1),
(201,13000,45,2),
(201,13000,120,3),
(201,13000,NULL,4),
(202,13300,240,1),
(202,13300,60,2)
;WITH recCTE
AS
(
SELECT Vehicle_No, Stop1_DepTime, Segment_TravelTime,Stop_Sequence, Stop1_DepTime AS Stop_arrTime
FROM #t
WHERE Stop_Sequence = 1
UNION ALL
SELECT t.Vehicle_No, t.Stop1_DepTime, t.Segment_TravelTime,t.Stop_Sequence, r.Stop_arrTime + r.Segment_TravelTime AS Stop_arrTime
FROM recCTE AS r
JOIN #t AS t
ON t.Vehicle_No = r.Vehicle_No
AND t.Stop_Sequence = r.Stop_Sequence + 1
)
SELECT Vehicle_No, Stop1_DepTime, Segment_TravelTime,Stop_Sequence, Stop1_DepTime,
CASE WHEN Stop_arrTime = Stop1_DepTime THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY Vehicle_No, Stop_Sequence
EDIT
Corrected version of OP's query - note that it's not necessary to fully qualify the column names:
;WITH recCTE
AS
(
SELECT Zaehl_Fahrt_Id, PlanAbfahrtStart, Fahrzeit, L32.Sequenz, PlanAbfahrtStart AS Stop_arrTime
FROM ZAEHL_2011.dbo.L32
WHERE Sequenz = 1
UNION ALL
SELECT t.Zaehl_Fahrt_Id, t.PlanAbfahrtStart, t.Fahrzeit,t.Sequenz, r.Stop_arrTime + r.Fahrzeit AS Stop_arrTime
FROM recCTE AS r
JOIN ZAEHL_2011.dbo.L32 AS t
ON t.Zaehl_Fahrt_Id = r.Zaehl_Fahrt_Id
AND t.Sequenz = r.Sequenz + 1
)
SELECT Zaehl_Fahrt_Id, PlanAbfahrtStart, Fahrzeit, Sequenz, PlanAbfahrtStart,
CASE WHEN Stop_arrTime = PlanAbfahrtStart THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY Zaehl_Fahrt_Id, Sequenz
I'm quite sure this works:
SELECT a.Vehicle_No, a.Stop1_DepTime,
a.Segment_TravelTime, a.Stop_Sequence, a.Stop1_DepTime +
(SELECT SUM(b.Segment_TravelTime) FROM your_table b
WHERE b.Vehicle_No = a.Vehicle_No AND b.Stop_Sequence < a.Stop_Sequence)
FROM your_table a
ORDER BY a.Vehicle_No

Perl (or R, or SQL): Count how often string appears across columns

I have a text file that looks like this:
gene1 gene2 gene3
a d c
b e d
c f g
d g
h
i
(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).
What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:
a 1 gene1
b 1 gene1
c 2 gene1 gene3
d 3 gene1 gene2 gene3
e 1 gene2
f 1 gene2
g 2 gene2 gene3
h 1 gene2
i 1 gene2
I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.
This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr and reshape packages, though I'm sure you could find base R alternatives. The trick is that function melt lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.
library(reshape)
library(plyr)
#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
gene2 = letters[4:9],
gene3 = c("c", "d", "g", NA, NA, NA)
)
#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)
#Tabulate counts
counts <- as.data.frame(table(dat.m$value))
#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))
#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Gives:
> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Var1 Freq V1
1 a 1 gene1
2 b 1 gene1
3 c 2 gene1 gene3
4 d 3 gene1 gene2 gene3
....
In perl, assuming the proteins in each column don't have duplicates that need to be removed. (If they do, a hash of hashes should be used instead.)
use strict;
use warnings;
my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
$column_genes{$-[1]} = "$1";
}
my %proteins;
while (my $line = <>) {
while ($line =~ /(\S+)/g) {
if (exists $column_genes{$-[1]}) {
push #{ $proteins{$1} }, $column_genes{$-[1]};
}
else {
warn "line $. column $-[1] unexpected protein $1 ignored\n";
}
}
}
for my $protein (sort keys %proteins) {
print join("\t",
$protein,
scalar #{ $proteins{$protein} },
join(' ', sort #{ $proteins{$protein} } )
), "\n";
}
Reads from stdin, writes to stdout.
A one liner (or rather 3 liner)
ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize,
len = length(variable),
var = paste(variable, collapse = " "))
If it's not a lot of columns, you can do something like this in sql. You basically flatten out the data into a 2 column derived table of protein/gene and then summarize it as needed.
;with cte as (
select gene1 as protein, 'gene1' as gene
union select gene2 as protein, 'gene2' as gene
union select gene3 as protein, 'gene3' as gene
)
select protein, count(*) as cnt, group_concat(gene) as gene
from cte
group by protein
In mysql, like so:
select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;
assuming data like:
create table gene_protein (gene varchar(255) not null, protein varchar(255) not null);
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d');
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i');
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');