searching some words in a paragraph using mysql query - mysql

I am working on a project.I have a paragraph and I have some tags like C#,mysql,.net,ajax etc.I want to check whether my paragraph contains these tags or not and if it contains which one it contains and how many tags matches.Depending on the number of tags matched I have to give a score.I am not getting how to do this i can't use in clause here neither i can use find_in_set().Please help me how should I achieve this.

You might want to look at the REGEXP feature of mysql:
CREATE TABLE texts (
paragraph text,
FULLTEXT INDEX( paragraph ))
engine = myisam;
INSERT INTO texts ( paragraph ) values
( "this is a very uninteresting paragraph "),
( "MySQL can be fun and useful, with or without PHP!"),
(" I have misspelled phpone, but I didn't mean the programming language!");
select paragraph FROM texts
where paragraph REGEXP( "[[:<:]]php[[:>:]]");
It's not as efficient as a FULLTEXT search but may better fit your needs. Depends.

MySQL's Full text search seems to be exactly what you're looking for, although by default it might have some troubles searching for C#, because of both the short length and the special character.
Alternatively, you can just use LIKE, to search for paragraph LIKE '%PHP%'.

Related

use regex to select words between html tags

thanks for visiting my questions here. I'm trying to match sentences between tags. for example:
<h1> Most flavors, except the ones discussed below, have only one
metacharacter that matches both before a word and after a word. <p>
This is because any position between characters can never be both at
the start and at the end of a word. Using only one operator makes
things easier for you.<p>Word boundaries, as described above, are
supported by most regular expression flavors.
I'm trying to get 10 words from each tag.
output:
Most flavors, except the ones discussed below, have only one
This is because any position between characters can never be
Word boundaries, as described above, are supported by most regular
I find it's so tricky. Thanks for your help here!!!
As has already been linked in the comment, one of the most well-known answers of all time on this site is about how you using regular expressions to parse HTML is probably not a good idea. For a more detailed and balanced overview of when it is and isn't a good idea to do so, check out this question as well.
But briefly, the answer depends on what you're trying to do. It's likely that you'll be better off finding an HTML/XML-parsing library for whatever language you're using, and extracting the text with that.
I'm a bit confused as to what your task actually is, as your code as shown isn't valid HTML, since <h1> at least requires a closing tag. But if you do need to use regex to do this, you will want to look at word boundaries and interval operators for limiting to 10, and perhaps lookbehind (or just capture groups) to match the tag without returning it.
But again: if you're trying to parse actual HTML, you'd be better of using an HTML parser to get the tag content, and then getting the first 10 words using string operators. An example in Javascript, which is a bit of a cheat because you get the HTML parsing for free, but it makes for an easy example:
for(const tag of document.querySelectorAll('body *')) {
console.log(`${tag.tagName}: ${tag.innerText.split(' ').slice(0,5).join(' ')}`)
}
<h1>This is an h1 tag with a bunch of text in it that is really long</h1>
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long

MySQL query to return sentence containing a search word from a text column

I need a MySQL query to return a full sentence from a text column that contains a specified search word.
Currently I am able to get the 20 characters before and after the search word using this query:
select id, MID(body,(LOCATE('search_word', body)-20),40) from content where body like "%search_word%" limit 1
, but that's as far as I've got.
I want to get an entire sentence (between two dots) which contains my search word.
Any ideas? Regex? How do I go about doing this?
Why don't you just get the whole field with mysql and filter out the sentence in an actual programming language.
A javascript example would look like this: https://jsfiddle.net/n0wfgjoc/
var text = "Lorem Ipsum is simply dummy text ... versions of Lorem Ipsum.";
var search = "popularised in the";
var pattern = new RegExp('\. ([^.]*' + search + '[^.]*\.)', 'i');
document.getElementsByTagName('body')[0].innerHTML = text.match(pattern)[1];
You should not hve a problem adapting i to your needs - and your language.
It should be much more performant than doing this it in pure SQL.
EDIT:
As #David pointed out, it might be a problem, if there were dots used in the text in other contexts - for abbreviations or dates maybe.
Solving that would be a hard task. My example does not cover that use case.
In PostgreSQL you can do this using regexp_matches, and I believe in MySQL this would be REGEXP_SUBSTR, see also: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr.

MySQL: Search for words that may have interferring characters inbetween

I store lyrics of songs and also allow chords to be added by putting them between square brackets (e.g: [Dm7]). Here's an example of lyrics stored in my database:
Left my fear [Dm7]by the side of the [B]road
Hear You[C] speak won't let[E] go
Fall to my knees
...
What I want to do is search for lyrics in songs. For example I might want to search for the lyrics fear by the side . The problem is the [Dm7] in my example above does not allow a simple LIKE search.
Is it possible to do a search (REGEX?) that excludes text such as [Dm7] from a query? If so how? Please note that the chords between the square brackets can vary.
You might like to consider a fulltext index, and then use match() against() in your where clause. Example:
create fulltext index ftx on songs(lyrics);
select *
from songs
where match(lyrics) against('fear by the side');
demo here
The matching is a little fuzzy, and you can't use the boolean mode matching because the chords don't have whitespace on both sides, but the normal mode should be sufficient.
The 'fuzziness' of the match can be used to provide a match ranking - works best on english language, which this seems to be. For example:
select match(lyrics) against('fear by the side') rank,
lyrics from songs
where match(lyrics) against('fear by the side')
order by match(lyrics) against('fear by the side') desc;
Would sort the results by best match, and also return the matching rank.
updated demo
The fulltext index also has a boolean mode, which as the same suggests, can be used to force the results to include or exclude certain words like so:
match(column) against('+word -otherword' in boolean mode) would return all rows for which column contains word but does not have otherword.
your fulltext index can also be multi column, if you desire.
Thanks to #SvenB and his suggestion of this post, this was my answer.
REPLACE(col, SUBSTRING(col, (LOCATE('[', col)), LOCATE(']', col) - (LOCATE('[', col)) + 1), '') LIKE '%fear by the side%'
It's a bit messy but works! I think in the long term FULL TEXT search is the way to go based on others comments.

Regex to match text longer than x characters between html tags?

I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.
Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.
So my question is can I match text that is longer then x characters between html tags?
Ideally it would also ignore <br/> and <br>
Here is a link to an example of the html I am dealing with
Note it is just the description I am processing, not the whole page.
Group 1 of this regex will match n+ chars between tags (n = 100 in this example):
<[^>]+>([^<]{100,})<[^>]+>
Notes:
I have deliberately not matched for a matching closing tag (<([^>]+)>([^<]{100,})<\1>) because of OP's sloppy HTML - a tag is a tag
I have avoided using a lookbehind ((?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).
Scanning through the site a little, it looks like many of the descriptions fall short of 100 characters. You might try a multi-pass approach, where in the first iteration, you capture all content from the first table following 'div id="tab1"'. From that starting point, it may be easier to identify and eliminate the parts you don't want, rather than extracting the parts you do want.

MySQL Select Like

I've got a field in my table for tags on some content, separated by spaces, and at the moment I'm using:
SELECT * FROM content WHERE tags LIKE '%$selectedtag%'"
But if the selected tag is elephant, it will select content tagged with bigelephant and elephantblah etc...
How do I get it to just select what I want precisely?
SELECT * FROM content WHERE tags RLIKE '[[:<:]]elephant[[:>:]]'
If your table is MyISAM, you can use this:
SELECT * FROM content WHERE MATCH(tags) AGAINST ('+elephant' IN BOOLEAN MODE)
, which can be drastically improved by creating a FULLTEXT index on it:
CREATE FULLTEXT INDEX ix_content_tags ON content (tags)
, but will work even without the index.
For this to work, you should adjust ##ft_min_wold_len to index tags less than 4 characters long if you have any.
You could use MySQL's regular expressions.
SELECT * FROM content WHERE tags REGEXP '(^| )selectedtag($| )'
Be aware, though, that the use of regular expressions adds an overhead and might perform poorly in some circumstances.
Another simple way, if you can alter your database data, is to ensure that there is an empty space before the first tag and after the last one; A little like: " elephant animal ". That way you can use wildcards.
SELECT * FROM content WHERE tags LIKE '% selectedtag %'
I would consider a different design here. This constitutes a many-to-many relationship, so you could have a tags table and a join table. In general, atomicity of data saves you from a lot of headaches.
An added bonus of this approach is that you can rename a tag without having to edit every entry containing that tag.
A: You have to create a separate tags table which points to the content with contentid and contains a keyword, then:
select a.*
from content a,tags b
where a.id=b.contentid
group by a.id;
B: Put a comma between the tags and befor and afther them, like ",bigelephant,elephant,", then use like "%,elephant,%"
WHERE tags LIKE '% '{$selectedtag}' %'
OR tags LIKE ''{$selectedtag}' %'
OR tags LIKE '% '{$selectedtag}''
The '%' is a wildcard in SQl. Remove the wildcards, and you will get precisely what you ask for.
Redo the table design but for now if you use spaces to delimit between tags you COULD do this:
SELECT * FROM content WHERE tags LIKE '% elephant %';
Just make sure that you lead and end with a space as well (or replace the spaces with commas if you're doing it that way)
Again though, the best option is to set up a many-to-many relationship in your database but I suspect you're looking for a quick and dirty one-off fix.