I have a LONGTEXT column that I use to store paragraphs of text. Let's say that within these paragraphs there is something like:
Text text text text
COUNT: 75 text text
text text text text
text text text text
Would it be possible to do a comparative query the small string "COUNT: 75" out of all that text?
My first thought was something like
SELECT * FROM `books`
WHERE `paragraphs` LIKE '%COUNT: >0%'
Is this even possible?
Your SELECT will only find rows where the text contains exaclty the bit between the wildcard characters: you can't combine a LIKE with comparative logic like that.
What you can do, though, is to strip out the relevant sections of text using a regular expression and then analyse that.
Bear in mind, though, that combining
large amounts of text
textual content logic
regex
all at once will not provide the best performance! I would suggest the following:
use a trigger to strip out a subsection of text so that you have
something manageable (i.e. 50 characters or so) to work with,
inserting this subtext into a separate table
use MySql regex or fulltext functions to analyse your COUNTs
So your trigger would have something like:
select
ltrim(rtrim(substring(paragraphs, instr(paragraphs, 'count:') + 6, 10)))
from books
where instr(paragraphs, 'count:') > 0
which would get you the next 10 characters after 'count:', trimmed of whitespace. you could then further refine that by e.g.
select substring(text_snippet, 1, instr(text_snippet, ' ')) as count_value
from
(
select
ltrim(rtrim(substring(paragraphs, instr(paragraphs, 'count:') + 6, 10)))
as text_snippet
from books
where instr(paragraphs, 'count:') > 0
) x
where isnumeric(substring(text_snippet, 1, instr(text_snippet, ' '))) = 1
to get rows where a numerical value follows the COUNT bit.
You can then extract numerical values next to COUNT, saving them as numbers in a separate table, and then use a JOIN like this:
select b.*
from books b inner join books_count_values v
on b.id = v.books_id
where v.count_value > 0
See here
http://dev.mysql.com/doc/refman/5.1/en/string-functions.html
for the functions at your disposal.
Related
I am doing a large union select and all of the union selects will retrieve a result except for one. One of them has a collation error. For example
SELECT word FROM words WHERE english = 'hello'
UNION
SELECT word FROM words WHERE english = 'no'
UNION
SELECT word FROM words WHERE english = 'пыук';
The last one would produce a collation error and therefore the whole select fails. Is there a way that I can include the select for the one that will return the error while still getting the results for the rest of them?
One way to do this would be to CONVERT() the collation of the string you’re looking up to the collation of the column:
SELECT word FROM words WHERE english = CONVERT('пыук' USING utf8);
Side Note: Be sure to change utf8 to the same collation as your english column.
This is suboptimal, but it will do what you need.
I have a column in which is stored nothing but text separated by one space. There may be one to maybe 5 words in each field of the column. I need a query to return all the distinct words in that column.
Tried:
SELECT DISTINCT tags FROM documents ORDER BY tags
but does not work.
To Elaborate.
I have a column called tags. In it I may have the following entries:
Row 1 Red Green Blue Yellow
Row 2 Red Blue Orange
Row 3 Green Blue Brown
I want to select all the DISTINCT words in the entire column - all fields. It would return:
Red Green Blue Yellow Orange Brown
If I counted each it would return:
2 Red
2 Green
3 Blue
1 Yellow
1 Brown
1 Orange
To fix this I ended up creating a second table where all keywords where inserted on their own row each along with a record key that tied them back to the original record in the main data table. I then just have to SELECT DISTINCT to get all tags or I can SELECT DISTINCT with a WHERE clause specifying the original record to get the tags associated with a unique record. Much easier.
There is not a good solution for this. You can achieve this with JSON functions as of 5.6, I think, but it's a little tricky until 8.0, when mySQL added the JSON_TABLE function, which can convert json data to a table like object and perform selects on it, but how it will perform is dependent on your actual data. Here's a working example:
CREATE TABLE t(raw varchar(100));
INSERT INTO t (raw) VALUES ('this is a test');
You will need to strip the symbols (commas, periods, maybe others) from your text, then replace any white text with ",", then wrap the whole thing in [" and "] to json format it. I'm not going to give a full featured example, because you know better than I do what your data looks like, but something like this (in its simplest form):
SELECT CONCAT('["', REPLACE(raw, ' ', '","'), '"]') FROM t;
With JSON_TABLE, you can do something like this:
SELECT CONCAT('["', REPLACE(raw, ' ', '","'), '"]') INTO #delimited FROM t;
SELECT *
FROM JSON_TABLE(
#delimited,
"$[*]"
COLUMNS(Value varchar(50) PATH "$")
) d;
See this fiddle: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=7a86fcc77408ff5dfec7a805c6e4117a
At this point you have a table of the split words, and you can replace SELECT * with whatever counting query you want, probably SELECT Value, count(*) as vol. You will also need to use group_concat to handle multiple rows. Like this:
insert into t (raw) values ('this is also a test'), ('and you can test it');
select concat(
'["',
replace(group_concat(raw SEPARATOR '","'), ' ', '","'),
'"]'
) into #delimited from t;
SELECT Value, count(*) as vol
FROM JSON_TABLE(
#delimited,
"$[*]"
COLUMNS(Value varchar(50) PATH "$")
) d
GROUP BY Value ORDER BY count(*) DESC;
If you are running <8.0, you can still accomplish this, but it will take some hackiness, like generating an arbitrary list of numbers and constructing the paths dynamically from that.
(BIDS on SQL Server 2008)
I have a flat file (pipe-delimited) which I have successfully parsed to the following format:
AccountID FreeText1 FreeText2 FreeText3 FreeText4
1 Some text More text Other text Different Text
2 Some text More text Other text Different Text
3 Some text More text Other text Different Text
I need the end result to look like this:
AccountID Title TheData
1 FreeText1 Some text
1 FreeText2 More text
1 FreeText3 Other text
1 FreeText4 Different Text
2 FreeText1 Some text
2 FreeText2 More text
2 Freetext3 Other text
2 FreeText4 Different Text
3 FreeText1 Some text
3 FreeText2 More text
3 FreeText3 Other text
3 FreeText4 Different Text
I am still rather new to SSIS so learning as I go. Everything I found on the Unpivot transformation seems to be what I need, but I haven't been able to figure out how to get it to Unpivot based on the NAME of the column ("FreeText1", etc), nor have I been able to fully grasp how to set up the Unpivot transform to even get close to the desired results.
I haven't yet found any SSIS formulas I could use in a Derived Column to get the column name programmatically, thinking maybe I could generate the column names in a Derived Column and then Merge Join the two together... but that doesn't seem like a very efficient method and I couldn't make it work anyway. I have tried setting up a Derived Column to return the column names in hard code (using "FreeText1" as a formula, for example), however I remain unsure as to how to combine this with the Unpivoted results.
Any input would be greatly appreciated!
You could use the UNPIVOT transformation, which should look something like
Or you could load the data to a staging table and use the TSQL UNPIVOT function:
SELECT upvt.AccountID, upvt.Title, upvt.TheData
FROM dbo.StagingTable AS t
UNPIVOT (Title FOR TheData IN (FreeText1, FreeText2, FreeText3, FreeText4)) AS upvt;
Or slightly longer winded, but more flexible is to use CROSS APPLY along with a table value constructor to unpivot data. e.g.
SELECT t.AccountID, upvt.Title, upvt.TheData
FROM dbo.StagingTable AS t
CROSS APPLY
(VALUES
('FreeText1', FreeText1),
('FreeText2', FreeText2),
('FreeText3', FreeText3),
('FreeText4', FreeText4)
) AS upvt (Title, TheData);
I have found that for some people UnPivot is more confusing to learn than it is worth so I have always avoided it when possible. If you want to take a different approach, you can do this:
Load the data into a temp table and run the following query against it:
Select * From
(
Select AccountID, 'FreeText1' Title, FreeText1 TheData From TableA
Union All
Select AccountID, 'FreeText2' Title, FreeText2 TheData From TableA
Union All
Select AccountID, 'FreeText3' Title, FreeText3 TheData From TableA
Union All
Select AccountID, 'FreeText4' Title, FreeText4 TheData From TableA
) A
Order By AccountID, Title
What would be the right SQL statement so that when I search two words, like for example 'text field' in a text box, it will return all results that has 'text' and 'field' in it using the LIKE statement? I cant find the right terms to make a search. If possible, I want to make it dynamic. Like if a user search 5 words, all 5 words would be in the Like statement. I am trying to achieve a statement something like this.
SELECT *
FROM TABLE
WHERE SEARCH (LIKE %searchterm1%)
OR (LIKE %searchterm2%)
OR (LIKE %searchterm3%) ....
Try This. http://dev.mysql.com/doc/refman/5.1/en/regexp.html#operator_regexp
SELECT *
FROM TABLE
WHERE SEARCH
REGEXP 'searchterm1|searchterm2|searchterm3'
Here's an example of a SQL SELECT statement that uses the LIKE comparison operator
SELECT t.*
FROM mytable t
WHERE t.col LIKE CONCAT('%','cdef','%')
AND t.col LIKE CONCAT('%','hijk','%')
AND t.col LIKE CONCAT('%','mnop','%')
Only rows that have a value in the col column that contains all of the strings 'cdef', 'hijk', and 'mnop' will be returned.
You specifically asked about the LIKE comparison operator. There's also a REGEXP operator that matches regular expressions. And the Full-Text search feature may be a good fit your use case.
First off there seems to be no way to get an exact match using a full-text search. This seems to be a highly discussed issue when using the full-text search method and there are lots of different solutions to achieve the desired result, however most seem very inefficient. Being I'm forced to use full-text search due to the volume of my database I recently had to implement one of these solutions to get more accurate results.
I could not use the ranking results from the full-text search because of how it works. For instance if you searched for a movie called Toy Story and there was also a movie called The Story Behind Toy Story that would come up instead of the exact match because it found the word Story twice and Toy.
I do track my own rankings which I call "Popularity" each time a user access a record the number goes up. I use this datapoint to weight my results to help determine what the user might be looking for.
I also have the issue where sometimes need to fall back to a LIKE search and not return an exact match. I.e. searching Goonies should return The Goonies (most popular result)
So here is an example of my current stored procedure for achieving this:
DECLARE #Title varchar(255)
SET #Title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #Title2 varchar(255)
SET #Title2 = REPLACE(#title, '"', '')
--get top 100 results using full-text search and sort them by popularity
SELECT TOP(100) id, title, popularity As Weight into #TempTable FROM movies WHERE CONTAINS(title, #Title) ORDER BY [Weight] DESC
--check if exact match can be found
IF EXISTS(select * from #TempTable where Title = #title2)
--return exact match
SELECT TOP(1) * from #TempTable where Title = #title2
ELSE
--no exact match found, try using like with wildcards
SELECT TOP(1) * from #TempTable where Title like '%' + #title2 + '%'
DROP TABLE #TEMPTABLE
This stored procedure is executed about 5,000 times a minute, and crazy enough it's not bringing my server to it's knees. But I really want to know if there was a more efficient approach to this? Thanks.
You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.
It sounds like you'd like to ORDER BY
exact match of the phrase (=)
the fully matched phrase (LIKE)
higher value for the Popularity column
the Rank from the CONTAINSTABLE
But you can toy around with the exact order you prefer.
In SQL that looks something like:
DECLARE #title varchar(255)
SET #title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE #title2 varchar(255)
SET #title2 = REPLACE(#title, '"', '')
SELECT
m.ID,
m.title,
m.Popularity,
k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, #title, 100) as [k]
ON m.ID = k.[Key]
ORDER BY
CASE WHEN m.title = #title2 THEN 0 ELSE 1 END,
CASE WHEN m.title LIKE #title2 THEN 0 ELSE 1 END,
m.popularity desc,
k.rank
See SQLFiddle
This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.
SELECT
m.[ID],
m.[Popularity],
k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]
Note the above would also give you "The Goonies Return" if you searched "The Goonies".
If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.
Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.
In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)
Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).
Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics