MySQL logic for prioritizing matching pattern/text - mysql

I have a table of incidents that have a short_description. I'm trying to assign them to a category based on the text in that short_description (not ideal, I know, but I'm working with an existing system and don't have much control). So I created a lookup table with search_text to look for, and the category value that should be assigned to the incident. In some cases more than one search_text value matches the short_description. I want to use the priority field to choose the highest priority (lowest priority value, such as 1) when this happens. I feel like maybe this involves a window function or something, but I'm not sure how to approach it.
Can someone help me with the changes needed in the logic below?
Thanks!
The query below returns two results, but I want it to just return one (Cluster) because Cluster is priority 1, and Disk is priority 2. I only want one record per snumber (incident).
CREATE TABLE snow_incident_s_2 (
SNUMBER varchar(40) DEFAULT NULL,
SHORT_DESCRIPTION varchar(200) DEFAULT NULL
);
INSERT INTO snow_incident_s_2 (snumber, short_description) values ('INC15535802','Prognosis::ADMINCLUSTER [5251]::CmaDiskPartitionNearFull, PROGNOSIS:ADMINCLUSTER');
CREATE TABLE lkp_incident_category_2 (
incident_category_id smallint(6) NOT NULL AUTO_INCREMENT,
incident_type varchar(10) DEFAULT NULL,
category varchar(100) NOT NULL,
search_text varchar(200) NOT NULL,
priority smallint(6) DEFAULT NULL,
PRIMARY KEY (incident_category_id)
);
INSERT INTO lkp_incident_category_2 (incident_type, category, search_text, priority) values ('INC','Cluster','Cluster',1);
INSERT INTO lkp_incident_category_2 (incident_type, category, search_text, priority) values ('INC','Disk','Disk',2);
SELECT
inc.snumber,
inc.short_description,
ic.search_text,
ic.category
FROM
snow_incident_s_2 inc
LEFT JOIN
lkp_incident_category_2 ic ON inc.short_description LIKE CONCAT('%', ic.search_text, '%')
AND ic.incident_type = 'INC'

window functions don't exist in mysql. but you can mimic them using variables.
I have been trying to test this but sqlfiddle is not cooperating so here is what I believe should work for you give it a try.
SELECT *
FROM (
SELECT
inc.snumber
,inc.short_description
,ic.search_text
,ic.category
,(#rownum := IF(#incnum=inc.snumber,#rownum+1,1)) as IncidentRowNum
,#incnum := inc.snumber
FROM
snow_incident_s_2 inc
LEFT JOIN lkp_incident_category_2 ic
ON inc.short_description LIKE CONCAT('%', ic.search_text, '%')
AND ic.incident_type = 'INC'
CROSS JOIN (SELECT #rownum := 0, #incnum := '') var
ORDER BY
ic.priority
) t
WHERE
t.IncidentRowNum = 1
;
Fully tested and functional here is a sqlfiddle of it http://sqlfiddle.com/#!9/75ff1/7

Here is how to use a window function.
I have to say ON inc.short_description LIKE CONCAT('%', ic.search_text, '%') will be slow. If this is is done once a while (like every day) as an ad-hoc it will be fine, but if this is done often (every hour, every minute or more often) you'll want to materialize the existence of these values since you can't make an index for them unless you use full a text query or NOSQL solution.
SELECT snumber, short_description, search_text, category
FROM (
SELECT snumber, short_description, search_text, category,
ROW_NUMBER() OVER (partition by snumber order by priority asc) as rn
FROM (
SELECT
inc.snumber,
inc.short_description,
ic.search_text,
ic.category,
ic.priority
FROM snow_incident_s_2 inc
LEFT JOIN lkp_incident_category_2 ic
ON inc.short_description LIKE CONCAT('%', ic.search_text, '%')
AND c.incident_type = 'INC'
) X
) Y
WHERE RN = 1

Related

Create an MySQL query

I have a table like this
http://sqlfiddle.com/#!9/052381/1
I need to create a request that will find VIN codes that meet the following conditions:
VIN starts with XTA%
I have registration history: date_reg_last values: 1306440000,1506715200,1555963200. You need to select only those VIN codes that have exactly these values. If there are more or less records - VIN does not match
I have an owner_type that matches the values ​​1306440000,1506715200,1555963200: 2, 2, 2. Ie. for record 1306440000 owner_type must be 2, for record 1506715200 also 2, etc. The type can be different for each entry.
Similarly to the third point, I have regions: УЛЬЯНОВСК Г.,УЛЬЯНОВСК Г.,С РУНГА
I have a year, it should be in all records.
I tried making a request like this
SELECT *
FROM `ac_gibdd_shortinfo`
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
But it finds records that have a different number of registration records. There is only one thought: get all the matching records and then filter them by number with an additional request.
Test this:
SELECT *
FROM `ac_gibdd_shortinfo` t0
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t1
WHERE t0.vin = t1.vin
AND t1.date_reg_first NOT IN (0,1506715200,1555963200) )
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t2
WHERE t0.vin = t2.vin
AND t2.date_reg_last NOT IN (1306440000,1506715200,1555963200) )
AND NOT EXISTS ( SELECT NULL
FROM ac_gibdd_shortinfo t3
WHERE t0.vin = t3.vin
AND t3.location NOT IN ('УЛЬЯНОВСК Г.','С РУНГА') )
PS. According indices will improve.
and have count (1306440000,1506715200,1555963200) - 3 records in total by VIN – blood73
SELECT vin, model, date_reg_first, date_reg_last, `year`, location
FROM `ac_gibdd_shortinfo` t0
WHERE `vin` LIKE 'XTA%'
AND `model` LIKE '%1119%'
AND `date_reg_first` IN (0,1506715200,1555963200)
AND `date_reg_last` IN (1306440000,1506715200,1555963200)
AND `year` LIKE '2011'
AND `location` IN ('УЛЬЯНОВСК Г.','С РУНГА')
AND 3 = ( SELECT COUNT(*)
FROM ac_gibdd_shortinfo t1
WHERE t0.vin = t1.vin );

Quickly Select Random Rows With Where Condition

Is it possible to quickly select random rows from a table, while also using a where condition?
Example:
SELECT * FROM geo WHERE placeRef = 1 ORDER BY RAND() LIMIT 1
This can take 10+ seconds.
I found this, which is sometimes quick, sometimes very slow:
(SELECT *
FROM geo
INNER JOIN ( SELECT RAND() * ( SELECT MAX( nameRef ) FROM geo ) AS ID ) AS t ON geo.nameRef >= t.ID
WHERE geo.placeRef = 1
ORDER BY geo.nameRef
LIMIT 1)
This provides a quick result, only if there is no extra where condition.
This is the create table:
CREATE TABLE `geo` (
`nameRef` int(8) DEFAULT NULL,
`placeRef` mediumint(7) unsigned DEFAULT NULL,
`category` enum('continent','country','region','subregion') COLLATE utf8_bin DEFAULT NULL,
`parentRef` mediumint(7) DEFAULT NULL,
`incidence` int(9) unsigned NOT NULL,
`percent` decimal(11,9) unsigned DEFAULT NULL,
`ratio` int(11) NOT NULL,
`rank` mediumint(7) unsigned DEFAULT NULL,
KEY `placeRef_rank` (`placeRef`,`rank`),
KEY `nameRef_category` (`nameRef`,`category`),
KEY `nameRef_parentRef` (`nameRef`,`parentRef`),
KEY `nameRef_placeRef` (`nameRef`,`placeRef`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin
N.B. this table has around 550 million rows.
Desired query: query the table where placeRef = x; and then quickly return one row.
Issue: a query like SELECT * FROM geo WHERE placeRef = 1 can provide up to about 15 million results. So selecting a single random row is slow.
That technique is variable because it depends on where the matching rows happen to lie in the table.
The quick fix may be to add this index, assuming that nameRef is the PRIMARY KEY for the table:
INDEX(placeRef, nameRef)
Let's discuss this further after
You provide SHOW CREATE TABLE geo
You read http://mysql.rjweb.org/doc.php/random
There are (currently) 3 indexes that make this subquery very fast (because of the leading nameRef):
( SELECT MAX( nameRef ) FROM geo )
After that, my suggestion of (placeRef, nameRef) will kick in for these:
WHERE geo.placeRef = 1
geo.nameRef >= t.ID
I think the resulting query should be consistently fast.
This is pulling a result in 1/100th of a second:
SELECT * FROM geo where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( nameRef ) FROM forenameGeo ) ) LIMIT 1
This works well if you have an index on both the columns you would like to query. However, you may need to make a new table that is randomly ordered. In my table the nameRefs tend to be grouped by country. This causes the random results to be selected from a handful of results as most of the resulted are grouped around the same Id. I needed to create a new table ordered randomly ORDER BY RAND() where each row had a unique Id. Now I search this much smaller summary table with:
SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= CEIL( RAND() * ( SELECT MAX( id ) FROM geoSummary ) ) LIMIT 1
Though to cut that SELECT MAX query running all the time I have saved the maximum Id in the server-side code, generate the random number there and run:
SELECT * FROM geoSummary where placeRef = 1 AND nameRef >= :random_number LIMIT 1
This provides truly random results.

Efficient way to list ranges of consecutive records

I have a table set up like so:
CREATE TABLE `cn` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`type` int(3) unsigned NOT NULL,
`number` int(10) NOT NULL,
`desc` varchar(64) NOT NULL,
`datetime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
number is usually but not necessarily unique.
Most of the table consists of rows with consecutive number entries.
e.g.
101010, 101011, 101012, etc.
I've been trying to find an efficient way to list ranges of consecutive numbers so I can find out where numbers are "missing" easily. What I'd like to do is list the start number, end number, and number of consecutive rows. Since there can be duplicates, I am using SELECT DISTINCT(number) to avoid duplicates.
I've not been having much luck - most of the questions of this type deal with dates and have been hard to generalize. One query was executing forever, so that was a no go. This answer is sort of close but not quite. It uses a CROSS JOIN, which sounds like a recipe for disaster when you have millions of records.
What would the best way to do this be? Some answers use joins, which I'm skeptical of performance wise. Right now there are only 50,000 rows, but it will be millions of records within a few days, and so every ounce of performance matters.
The eventual pseudoquery I have in mind is something like:
SELECT DISTINCT(number) FROM cn WHERE type = 1 GROUP BY [consecutive...] ORDER BY number ASC
This is a gaps-and-islands problem. You can solve by using the difference between row_number() and number to define groups; gaps are identified by changes in the difference:
select type, min(number) first_number, max(number) last_number, count(*) no_records
from (
select cn.*, row_number() over(order by number) rn
from cn
where type = 1
) c
group by type, number - rn
Note: window functions avalailable in MySQL 8.0 and MariaDB 10.3 onwards.
In earlier versions, you can emulate row_number() with a session variable:
select type, min(number) first_number, max(number) last_number, count(*) no_records
from (
select c.*, #rn := #rn + 1 rn
from (select * from cn where type = 1 order by number) c
cross join (select #rn := 0) r
) c
group by number - rn

Get random single row from mysql table with enumerated rows

I've got a table with auto-incremented ID in Mysql. I am always adding to this table, never deleting and setting the ID value to NULL so that I am pretty sure there are no holes. This is the table structure:
CREATE TABLE mytable (
id smallint(5) unsigned NOT NULL AUTO_INCREMENT,
data1 varchar(200) DEFAULT NULL,
data2 varchar(30) DEFAULT NULL,
PRIMARY KEY (id),
UNIQUE KEY data (data1,data2)
)
I want to pick up a random row from the table. I am using this:
select * from mytable where id=(select floor(1 + rand() * ((select max(id) from mytable) - 1)));
But sometimes I get nothing, sometimes one row, sometimes two. Replacing max(id) with count(*) or count(id) did not help. I understand it may be because rand() is evaluated for each row. As suggested in a similar question, I used this query:
select * from mytable cross join (select #rand := rand()) const where id=floor(1 + #rand*((select count(*) from mytable)-1));
But I still get an empty set sometimes. Same goes for this:
select * from mytable cross join (select #rand := rand()) const where id=floor(#rand*(select count(*) from mytable)+1);
I am looking for a fast way to do this, so that it won't take a long on big tables. ORDER BY rand() LIMIT 1 is not an option for me. Can't that be done with one query, can be?

MySQL query optimization with group by clause

I want to calculate total and unique clickouts based on country,partner and retailer.
I have achieved the desired result but i think its not a optimal solution and for longer data sets it will take longer time. how can I improve this query?
here is my test table, designed query and expected output:
"country_id","partner","retailer","id_customer","id_clickout"
"1","A","B","100","XX"
"1","A","B","100","XX"
"2","A","B","100","XX"
"2","A","B","100","GG"
"2","A","B","100","XX"
"2","A","B","101","XX"
DROP TABLE IF EXISTS x;
CREATE TEMPORARY TABLE x AS
SELECT test1.country_id, test1.partner,test1.retailer, test1.id_customer,
SUM(CASE WHEN test1.id_clickout IS NULL THEN 0 ELSE 1 END) AS clicks,
CASE WHEN test1.id_clickout IS NULL THEN 0 ELSE 1 END AS unique_clicks
FROM test1
GROUP BY 1,2,3,4
;
SELECT country_id,partner,retailer, SUM(clicks), SUM(unique_clicks)
FROM x
GROUP BY 1,2,3
Output:
"country_id","partner","retailer","SUM(clicks)","SUM(unique_clicks)"
"1","A","B","2","1"
"2","A","B","4","2"
And here is DDL and input data:
CREATE TABLE test (
country_id INT(11) DEFAULT NULL,
partner VARCHAR(256) CHARACTER SET utf8 DEFAULT NULL,
retailer VARCHAR(256) CHARACTER SET utf8 DEFAULT NULL,
id_customer BIGINT(20) DEFAULT NULL,
id_clickout VARCHAR(256) CHARACTER SET utf8 DEFAULT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO test VALUES(1,'A','B','100','XX'),(1,'A','B','100','XX'),
(2,'A','B','100','XX'),(2,'A','B','100','GG'),
(2,'A','B','100','XX'),(2,'A','B','101','xx')
SELECT
country_id,
partner,
retailer,
COUNT(id_clickout) AS clicks,
COUNT(DISTINCT CASE WHEN id_clickout IS NOT NULL THEN id_customer END) AS unique_clicks
FROM
test1
GROUP BY
1,2,3
;
COUNT(a_field) won't count any NULL values.
So, COUNT(id_clickout) will only count the number of times that it is NOT NULL.
Equally, the CASE WHEN statement in the unique_clicks only returns the id_customer for records where they clicked, otherwise it returns NULL. This means that the COUNT(DISTINCT CASE) only counts distinct customers, and only when they clicked.
EDIT :
I just realised, it's potentially even simpler than that...
SELECT
country_id,
partner,
retailer,
COUNT(*) AS clicks,
COUNT(DISTINCT id_customer) AS unique_clicks
FROM
test1
WHERe
id_clickout IS NOT NULL
GROUP BY
1,2,3
;
The only material difference in the results will be that any country_id, partner, retailed that previously showed up with 0 clicks will now not appear in the results at all.
With an INDEX on country_id, partner, retailed, id_clickout, id_customer or country_id, partner, retailed, id_customer, id_clickout, however, this query should be significantly faster.
I think this is what you are after:
SELECT country_id,partner,retailer,COUNT(retailer) as `sum(clicks)`,count(distinct id_clickout) as `SUM(unique_clicks)`
FROM test1
GROUP BY country_id,partner,retailer
Result:
COUNTRY_ID PARTNER RETAILER SUM(CLICKS) SUM(UNIQUE_CLICKS)
1 A B 2 1
2 A B 4 2
See result in SQL Fiddle.