Compare two DNA-like strings with MySQL

Compare two DNA-like strings with MySQL - mysql

I'm trying to find a way to compare two DNA-like strings with MySQL, stored functions are no problem. Also the string may be changed, but needs to have the following format: [code][id]-[value] like C1-4. (- may be changed aswell)
Example of the string:
C1-4,C2-5,C3-9,S5-2,S8-3,L2-4
If a value not exists in the other string, for example S3-1 it will score 10 (max value). If the asked string has C1-4 and the given string has C1-5 the score has to be 4 - 5 = -1 and if the asked string is C1-4 and the given string has C1-2 the score has to be 4 - 2 = 2.
The reason for a this is that my realtime algorithm is getting slow with 10.000 results. (already optimized with stored functions, indexes, query optimalizations) Because 10.000 x small and quick queries will make a lot.
And the score has to be calculated before I can order my query and get the right limit.
Thanks and if you have any questions let me know by comment.
** EDIT **
I'm thinking that it's also possible to not use a string but a table where the DNA-bits are stored as a 1-n relation table.
ID | CODE | ID | VALUE
----------------------
1. | C... | 2. | 4....

Related

Unexpected result in WHERE clause on AI ID field

I have a table which's name is users in my MySQL database, and I am using this DB with Ruby on Rails application with ORM structure for years. The table has id field and this field is configured as AI (auto-increment), BIGINT.
Example of my users table;
+----+---------+
| id | name |
+----+---------+
| 1 | John |
| 2 | Tommy |
| 3 | ... |
| 4 | ... |
| 5 | ... |
| 6 | ... |
+----+---------+
The problem I am facing is when I execute the following query I get unexpected rows.
SELECT * FROM users WHERE id = '1AW3F4SEFR';
This query is returning the exact same value with the following query,
SELECT * FROM users WHERE id = 1;
I do not know why SQL let me use strings in WHERE clause on a data type INT. And as we can see from the example, my DB converts the strings I gave to the integer at position 0. I mean, I search for 1AW3F4SEFR and I expect not to get any result. But SQL statement returns the results for id = 1.
In Oracle SQL, the behavior of this exact same query is completely different. So, I believe there is something different on MySQL. But I am not sure about what causes this.

As has been explained in the request comments, MySQL has a weird way of converting strings to numbers. It simply takes as much of a string from the left as is numeric and ignores the rest. If the string doesn't start with a number the conversion defaults to 0.
Examples: '123' => 123, '12.3' => 12.3, '.123' => 0.123, '12A3' => 12, 'A123' => 0, '.1A1.' => 0.1
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=55cd18865fad4738d03bf28082217ca8
That MySQL doesn't raise an error here as other DBMS do, can easily lead to undesired query results that get a long time undetected.
The solution is easy though: Don't let this happen. Don't compare a numeric column with a string. If the ID '1AW3F4SEFR' is entered in some app, raise an error in the app or even prevent this value from being entered. When running the SQL query, make sure to pass a numeric value, so '1AW3F4SEFR' cannot even make it into the DBMS. (Look up how to use prepared statements and pass parameters of different types to the database system in your programming language.)
If for some reason you want to pass a string for the ID instead (I cannot think of any such reason though) and want to make your query fail-safe by not returning any row in case of an ID like '1AW3F4SEFR', check whether the ID string represents an integer value in the query. You can use REGEXP for this.
SELECT * FROM users WHERE id = #id AND #id REGEXP '^[0-9]+$';
Thus you only consider integer ID strings and still enable the DBMS to use an index when looking up the ID.
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=56f8ee902342752933c20b8762f14dbb

Is there a way in MySQL to use aggregate functions in a sub section of binary column?

Suppose we have 2 numbers of 3 bits each attached together like '101100', which basically represents 5 and 4 combined. I want to be able to perform aggregation functions like SUM() or AVG() on this column separately for each individual 3-bit column.
For instance:
'101100'
'001001'
sum(first three column) = 6
sum(last three column) = 5
I have already tried the SUBSTRING() function, however, speed is the issue in that case as this query will run on millions of rows regularly. And string matching will slow the query.
I am also open for any new databases or technologies that may support this functionality.

You can use the function conv() to convert any part of the string to a decimal number:
select
sum(conv(left(number, 3), 2, 10)) firstpart,
sum(conv(right(number, 3), 2, 10)) secondpart
from tablename
See the demo.
Results:
| firstpart | secondpart |
| --------- | ---------- |
| 6 | 5 |

With the current understanding I have of your schema (which is next to none), the best solution would be to restructure your schema so that each data point is its own record instead of all the data points being in the same record. Doing this allows you to have a dynamic number of data points per entry. Your resulting table would look something like this:
id | data_type | value
ID is used to tie all of your data points together. If you look at your current table, this would be whatever you are using for the primary key. For this answer, I am assuming id INT NOT NULL but yours may have additional columns.
Data Type indicates what type of data is stored in that record. This would be the current tables column name. I will be using data_type_N as my values, but yours should be a more easily understood value (e.g. sensor_5).
Value is exactly what it says it is, the value of the data type for the given id. Your values appear to be all numbers under 8, so you could use a TINYINT type. If you have different storage types (VARCHAR, INT, FLOAT), I would create a separate column per type (val_varchar, val_int, val_float).
The primary key for this table now becomes a composite: PRIMARY KEY (id, data_type). Since your previously single record will become N records, the primary key will need to adjust to accommodate that.
You will also want to ensure that you have indexes that are usable by your queries.
Some sample values (using what you placed in your question) would look like:
1 | data_type_1 | 5
1 | data_type_2 | 4
2 | data_type_1 | 1
2 | data_type_2 | 1
Doing this, summing the values now becomes trivial. You would only need to ensure that data_type_N is summed with data_type_N. As an example, this would be used to sum your example values:
SELECT data_type,
SUM(value)
FROM my_table
WHERE id IN (1,2)
GROUP BY data_type
Here is an SQL Fiddle showing how it can be used.

MySQL select from INT column

I was doing some system testing and expecting empty results from MySQL(5.7.21) but got surprised to get results.
My transactions table looks like this:
Column Data type
----------------------------
id | INT
fullnames | VARCHAR(40)
---------------------------
And I have some records
--------------------------------
id | fullnames
--------------------------------
20 | Mutinda Boniface
21 | Boniface M
22 | Some-other Guy
-------------------------------
My sample queries:
select * from transactions where id = "20"; -- gives me 1 record which is fine
select * from transactions where id = 20; -- gives me 1 record - FINE as well
Now it gets interesting when I try with these:
select * from transactions where id = "20xxx"; -- gives me 1 record - what is happening here?
What does MySQL do here??

MySQL plays fast and loose with type conversions. When implicitly converting a char to a number, it will take characters from the beginning of the string as long as they are digits, and ignore the rest. In your example, xxx aren't digits, so MySQL only takes the initial "20".
One way around this (which is horrible for performance, since you lose the usage on the index you may have on your column), is to explicitly cast the numeric side to a character:
SELECT * FROM transactions WHARE (CAST id AS CHAR) = 20;
EDIT:
Referencing the discussion about performance from the comments - performing the cast to a number on the client-side is probably the best approach, as it will allow you to avoid sending queries to the database when you know no rows should be returned (i.e., when your input is not a valid number, such as "20x").
An alternative hack could be to cast the input to a number and back again to a string, and compare the lengths. If the lengths are the same it means the input string was fully converted into a number and no characters were omitted. This should be OK WRT performance, since this comparison is performed on an inputted string, not on a value from the column, and the column's index can still be used if the condition passes the short-circuit evaluation of the input:
SELECT *
FROM transactions
WHERE LENGTH(:input) = LENGTH(CAST(:input AS SIGNED)) AND id = :input;

Postgresql 9.2 trigger to separate subfields in a stored string

Postgresql 9.2 DB which automatically collects data from various machines.
The DB stores all the data including the machine id, the firmware, the manufacturer id etc as well as the actual result data. In one stored field (varchar) there are 5 sub fields which are separated by the ^ character.
ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1 (Machine 1)
The order of this data seems to vary from one machine to another. Eg machine 1 2 and 3. The string above shows the firmware version, in this case "7.0" and it appears in sub-field 2. However, another machine sends the data in a different sub-field - in this case sub-field 3 and the value is "1"
BACT/ALERT^A.00^1^^ (Machine 2)
I want to store the values "7.0" and "1" in a different field in a separate table using a CREATE TRIGGER t_machine_id AFTER INSERT function where I can choose which sub-field is used depending on the machine the data has come from.
Is split_part the best function to do this? Can anyone supply an example code that will do this? I can't find anything in the documentation.

You need to (a) split the data using something like regexp_split_to_table then (b) match which parts are which using some criteria, since you don't have field position-order to rely on. Right now I don't see any reliable rule to decide what's the firmware version and what's the machine number; you can't really say where field <> machine_number because if machine 1 had firmware version 1 you'd get no results.
Given dummy data:
CREATE TABLE machine_info(data text, machine_no integer);
INSERT INTO machine_info(data,machine_no) (VALUES
('ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1',1),
('BACT/ALERT^A.00^1^^',2)
);
Something like:
SELECT machine_no, regexp_split_to_table(data,'\^')
FROM machine_info;
will give you a table of split data elements with machine number, but then you need to decide which fields are which:
machine_no | regexp_split_to_table
------------+------------------------------
1 | ACT18!!!8246-EN-2.00013151!1
1 | 7.00
1 | F5260046959
1 | H1P1O1R1C1Q1L1
1 | 1
2 | BACT/ALERT
2 | A.00
2 | 1
2 |
2 |
(10 rows)
You may find the output of substituting regexp_split_to_array more useful, depending on whether you can get any useful info from field order and how you intend to process the data.
regress=# SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info;
machine_no | regexp_split_to_array
------------+------------------------------------------------------------------
1 | {ACT18!!!8246-EN-2.00013151!1,7.00,F5260046959,H1P1O1R1C1Q1L1,1}
2 | {BACT/ALERT,A.00,1,"",""}
(2 rows)
Say there are two firmware versions; version 1 sends code^blah^fwvers^^ and version 2 and higher sends code^fwvers^blah^blah2^machineno. You can then differentiate between the two because you know that version 1 leaves the last two fields blank:
SELECT
machine_no,
CASE WHEN info_arr[4:5] = ARRAY['',''] THEN info_arr[3] ELSE info_arr[2] END AS fw_vers
FROM (
SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info
) string_parts(machine_no, info_arr);
results:
machine_no | fw_vers
------------+---------
1 | 7.00
2 | 1
(2 rows)
Of course, you've only provided two sample data, so the real matching rules are likely to be more complex. Consider writing an SQL function to extract the desired field(s) and return them from the array passed.

Compare MySQL string and get the allowed values

I am facing a problem regarding a string comparison in MySQL.
I have the following table,
res_id | image_min_allowed_dimension | canvas_dimension
1 400x500 8x10
2 800x600 11x14
As you can see in this table,
image_min_allowed_dimension column has 2 sets of record. Ans also canvas_dimension has 2 sets
Now, my goal is to get these 2 sets of record with a given value for image_min_allowed_dimension.
Say, if I give 1024x768 for image_min_allowed_dimension in the PHP script it will give me the 2 sets of record from canvas_dimension field.
The probable algo would be,
Fetch All Records as canvas_dimension
IF image_min_allowed_dimension is Less than or equal to a given value(i.e, 1024x768)
ELSE IF the given value is greater than image_min_allowed_dimension then return nothing.
But as the fields are varchar, how can I achieve that.?
Please help.

Refactor your schema to store your resolutions in a sane manner.
res_id | image_min_allowed_width | image_min_allowed_height | canvas_width | canvas_height
Your future self will thank you for the extra effort.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008