Calculate similarity between to records in MySQL - mysql

I am currently trying to figure out how to calculate the similarity between two records. My first record would be from a deactivated advertisement - so I want to find e.g. the 10 most similar advertisement regarding to some VARCHAR-fields equalness.
The thing, I can't figure out is, if there is any MySQL function, that can help me in any way - or if I need to compare the strings in some weird way?
EDIT #1
Similarity would be defined by these fields:
Title (weight: 50 %)
Content (weight: 40 %)
Category (weight: 10 %)
EDIT #2
I want the calculation to be like this:
Title: Words that match in the title field (only words >2 letters are matched).
Description: Words that match in the title field (only words >2 letters are matched).
Catgory: Match the category and if that doesn't match match the parent category with less weight :)
An equation of this could be:
#1 is the old, inactive post, #2 is the active post:
#2 title matches #1 title in 3 words out of #2's total of 10 words.
That gives 30 % match = 30 points.
#2 description matches #1 description in 10 words out of #2's total
of 400 words. That gives a 4 % match = 4 points.
#2 category doesn't match #1's category, therefore 0 % match. That
gives 0 points.
Then the sum would be 34 points for #2. :)
Edit #3
Here's my query - but it doesn't return different rows, but a lot of the same row.
SELECT
a.AdvertisementID as A_AdvertisementID,
IF(a.Topic LIKE a2.Topic, 50, 0) + IF(a.Description LIKE a2.Description, 40, 0) + IF(a.Cate_CategoryID LIKE a2.Cate_CategoryID, 10, 0) as A_Score,
a.AdvertisementID as A_AdvertisementID,
a.Topic as A_Topic,
LEFT(a.Description, 300) as A_Description,
a.Price as A_Price,
a.Type as A_Type
FROM
".DB_PREFIX."A_Advertisements a2,
".DB_PREFIX."A_Advertisements a
WHERE
a2.AdvertisementID <> a.AdvertisementID
AND
a.AdvertisementID = :a_id
ORDER BY
A_Score DESC

If you can literally compare the fields you are interested in, you could have MySQL perform a simple scoring calculation using the IF() function, for example
select
foo.id,
if (foo.title='wantedtitle', 50, 0) +
if (foo.content='wantedcontent', 40, 0) +
if (foo.category='wantedcategory', 10, 0) as score
from foo
order by score desc
limit 10
A basic 'find a fragment' could be achieved using like
select
foo.id,
if (foo.title like '%wantedtitlefragment%', 50, 0) +
if (foo.content like '%wantedcontentfragment%', 40, 0) +
if (foo.category like '%wantedcategoryfragment%', 10, 0) as score
from foo
order by score desc
limit 10
There are other techniques, but they might be slow to implement in MySQL. For example, you could calculate the Levenstein distance between two string - see this post for an example implementation.

Related

occurrence to score

I get a frequency of words I would like to convert number_of_occuerence to a number between 0-10.
word number_of_occurrence score
and 200 10
png 2 1
where 50 6
news 120 7
If you want to rate terms frequencies in a corpus, I suggest you to read this wikipedia article : Term frequency–inverse document frequency.
There are many ways to count the term frequency.
I understood want to rate it between 0 to 10.
I didn't get how you calculated you score values examples.
Anyway I suggest you an usual method: the log function.
#count the occurrences of you terms
freq_table = {}
words = tokenize(sentence)
for word in words:
word = word.lower()
#stem the word if you can, using nltk
if word in stopWords:#do you really want to count the occurrences of 'and'?
continue
if word in freq_table:
freq_table[word] += 1
else:
freq_table[word] = 1
#log normalize the occurrences
for wordCount in freq_table.values():
wordCount = 10*math.log(1+wordCount)
of course instead of log normalization you can use a normalization by the maximum.
#ratio max normalize the occurrences
max = max(freq_table.values())
for wordCount in freq_table.values():
wordCount = 10*wordCount/max
Or if you need a threshold effect,you can use a sigmoid function you could customize:
For more word processing check the Natural Language Toolkit. For a good term frequency count stematisation is good choice (stopwords are also useful)!
Score is between 0-10. The maximum score is 10 for occurence 50, therefore anything higher than that should also has score 10. On the other hand, minimum score is 0, while the score is 1 for occurence 5, so assume anything lower than that has score 0.
Interpolation is based on your given condition only:
If a word appear 50 times it should be closer to 10 and if a word
appear 5 times it should be closer to 1.
df['score'] = df['number_of_occurrence'].apply(lambda x: x/5 if 5<=x<=50 else (0 if x< 5 else 10))
Output:

Return rows matching one condition and if there aren't any then another in MYSQL

I have the following table as an example:
numbers type
--------------
1 1
5 2
6 1
8 2
9 3
14 2
3 1
From this table I would like to select the closest number that is less or equal to 5 AND of type 1 and if there is no such row matching, then (and only then) I would like to return the first closest number larger than 5 of type 2
I can solve this by running two queries:
SELECT number FROM numbers WHERE number <= 5 AND type = 1 ORDER BY number LIMIT 1
and if above query returns 0 results, I simply run the second query:
SELECT number FROM numbers WHERE number > 5 AND type = 2 ORDER BY number LIMIT 1
But is it possible, to achieve the same result by only using one query?
I was thinking something like
SELECT number FROM numbers WHERE (number <= 5 AND type = 1) OR (number > 5 AND type = 2) ORDER BY number LIMIT 1
But that would only work, if mysql first checks the first conditional in the parentheses against all rows and if it finds a match, it returns it, and if not, then it checks all rows against the second parenthesed conditional. It will not work, if it checks each row against both parentheses and only then moves to the next row, which is how I suspect it works.
This query will do what you want. It selects all numbers that match your two query constraints, and orders the results first by type (so that if there is a result for type 1 it will appear first) and then by either -number or number dependent on type (so that numbers <= 5 sort in descending order but numbers > 5 sort in ascending order):
SELECT number
FROM numbers
WHERE ( number <= 5 AND type = 1 )
OR ( number > 5 AND type = 2 )
ORDER BY type, CASE WHEN type = 1 THEN -number ELSE number END
LIMIT 1
Output:
3
Demo on dbfiddle
Combine the two, and you always prefer type 1 over type 2, hence the ORDER BY and LIMIT. The ABS means whichever is first by type, is the closes to the number 5.
SELECT number, type
FROM numbers
WHERE (number <=5 AND type=1) OR
(number > 5 AND type=2)
ORDER BY type ASC, ABS(number-5) ASC
LIMIT 1

Mysql recursive substracting and multiplying grouped values

Couldn't really explain my problem with words, but with an example I can show it clearly:
I have a table like this:
id num val flag
0 3 10 1
1 5 12 2
2 7 12 1
3 11 15 2
And I want to go through all the rows, and calculate the increase of the "num", and multiply that difference with the "val" value. And when I calculated all of these, I want to add these results together, but grouped based on the "flag" values.
This is the mathematical equation, that I want to run on the table:
Result_1 = (3-0)*10 + (7-3)*12
Result_2 = (5-0)*12 + (11-5)*15
78 = Result_1
150 = Result_2
Thank you.
Interesting question. Unfortunately MYSQL doesn't support recursive queries, so you'll need to be a little creative here. Something like this could work:
select flag,
sum(calc)
from (
select flag,
(num-if(#prevflag=flag,#prevnum,0))*val calc,
#prevnum:=num prevnum,
#prevflag:=flag prevflag
from yourtable
join (select #prevnum := 0, #prevflag := 0) t
order by flag
) t
group by flag
SQL Fiddle Demo

MySQL query to assign values to a field based in an iterative manner

I am using a MySql table with 500,000 records. The table contains a field (abbrevName) which stores a two-character representation of the first two letters on another field, name.
For example AA AB AC and so on.
What I want to achieve is the set the value of another field (pgNo) which stores a value for page number, based on the value of that records abbrevName.
So a record with an abbrevName of 'AA' might get a page number of 1, 'AB' might get a page number of 2, and so on.
The catch is that although multiple records may have the same page number (after all multiple entities might have a name beginning with 'AA'), once the amount of records with the same page number reaches 250, the page number must increment by one. So after 250 'AA' records with a page number of 1, we must assign futher 'AA records with a page number of 2, and so on.
My Pseudocode looks something like this:
-Count distinct abbrevNames
-Count distinct abbrevNames with more than 250 records
-For the above abbrevNames count the the sum of each divided by 250
-Output a temporary table sorted by abbrevName
-Use the total number of distinct page numbers with 250 or less records to assign page numbers incrementally
I am really struggling to put anything together in a query that comes close to this, can anyone help with my logic or some code ?
Please have a try with this one:
SELECT abbrevNames, CAST(pagenumber AS signed) as pagenumber FROM (
SELECT
abbrevNames
, IF(#prev = abbrevNames, #rows_per_abbrev:=#rows_per_abbrev + 1, #pagenr:=#pagenr + 1)
, #prev:=abbrevNames
, IF(#rows_per_abbrev % 250 = 0, #pagenr:=#pagenr + 1, #pagenr) AS pagenumber
, IF(#rows_per_abbrev % 250 = 0, #rows_per_abbrev := 1, #rows_per_abbrev)
FROM
yourTable
, (SELECT #pagenr:=0, #prev:=NULL, #rows_per_abbrev:=0) variables_initialization
ORDER BY abbrevNames
) subquery_alias
UPDATE: I had misunderstood the question a bit. Now it should work

How can I fetch "partial matches" with mysql?

I need to find best matches from a mysql table given a set of attributes.
For example, given ATTRIBUTE1, ATTRIBUTE2 and ATTRIBUTE3, I want to get the results as follows:
rows with all attributes matched
rows with 2 attributes matched
rows with 1 attribute matched
so far I only know how to accomplish the first statement:
SELECT * FROM Users
WHERE ATTRIBUTE1="aValue", ATTRIBUTE2="aValue", ATTRIBUTE3="aValue"
LIMIT 20
N.B. I need 2 lists. A list with fully matching rows and a list with partial matches
you can consider to build an composite index in ATTRIBUTE{1..3}
this will benefits for List A
SELECT *
FROM Users
WHERE ATTRIBUTE1="aValue" AND ATTRIBUTE2="aValue" AND ATTRIBUTE3="aValue"
LIMIT 20
and might help some row in List B
SELECT *,
IF (ATTRIBUTE1="aValue", 1, 0) as a1,
IF (ATTRIBUTE2="aValue", 1, 0) as a2,
IF (ATTRIBUTE3="aValue", 1, 0) as a3
FROM Users
WHERE ATTRIBUTE1="aValue" OR ATTRIBUTE2="aValue" OR ATTRIBUTE3="aValue"
ORDER BY (a1+a2+a3) DESC
LIMIT 20