occurrence to score - deep-learning

I get a frequency of words I would like to convert number_of_occuerence to a number between 0-10.
word number_of_occurrence score
and 200 10
png 2 1
where 50 6
news 120 7

If you want to rate terms frequencies in a corpus, I suggest you to read this wikipedia article : Term frequency–inverse document frequency.
There are many ways to count the term frequency.
I understood want to rate it between 0 to 10.
I didn't get how you calculated you score values examples.
Anyway I suggest you an usual method: the log function.
#count the occurrences of you terms
freq_table = {}
words = tokenize(sentence)
for word in words:
word = word.lower()
#stem the word if you can, using nltk
if word in stopWords:#do you really want to count the occurrences of 'and'?
continue
if word in freq_table:
freq_table[word] += 1
else:
freq_table[word] = 1
#log normalize the occurrences
for wordCount in freq_table.values():
wordCount = 10*math.log(1+wordCount)
of course instead of log normalization you can use a normalization by the maximum.
#ratio max normalize the occurrences
max = max(freq_table.values())
for wordCount in freq_table.values():
wordCount = 10*wordCount/max
Or if you need a threshold effect,you can use a sigmoid function you could customize:
For more word processing check the Natural Language Toolkit. For a good term frequency count stematisation is good choice (stopwords are also useful)!

Score is between 0-10. The maximum score is 10 for occurence 50, therefore anything higher than that should also has score 10. On the other hand, minimum score is 0, while the score is 1 for occurence 5, so assume anything lower than that has score 0.
Interpolation is based on your given condition only:
If a word appear 50 times it should be closer to 10 and if a word
appear 5 times it should be closer to 1.
df['score'] = df['number_of_occurrence'].apply(lambda x: x/5 if 5<=x<=50 else (0 if x< 5 else 10))
Output:

Related

Min and Max, but there are also letters

I'll keep this as simple as I can, I'm new to MySQL and I have values in my column such as "10, 5, 3 and n/v", the code is
SELECT MIN(InsertColumnHere), MAX(InsertColumnHere)
FROM diplomaeval
and the result is..
MIN = 10 and MAX = N/V
I need for them to be "switched" places so they go to their respective places, as N/V is the lowest and 10 is the highest, not the other way around.

How can I make my select statement deterministically match only 1/n of my dataset?

I'm processing data from a MySQL table where each row has a UUID associated with it. EDIT: the "UUID" is in fact an MD5 hash (VARCHAR) of the job text.
My select query looks something like:
SELECT * FROM jobs ORDER BY priority DESC LIMIT 1
I am only running one worker node right now, but would like to scale it out to several nodes without altering my schema.
The issue is that the jobs take some time, and scaling out beyond one right now would introduce a race condition where several nodes are working on the same job before it completes and the row is updated.
Is there an elegant way to effectively "shard" the data on the client-side, by specifying some modifier config value per worker node? My first thought was to use the MOD function like this:
SELECT * FROM jobs WHERE UUID MOD 2 = 0 ORDER BY priority DESC LIMIT 1
and SELECT * FROM jobs WHERE UUID MOD 2 = 1 ORDER BY priority DESC LIMIT 1
In this case I would have two workers configured as "0" and "1". But this isn't giving me an even distribution (not sure why) and feels clunky. Is there a better way?
The problem is you're storing the ID as a hex string like acbd18db4cc2f85cedef654fccc4a4d8. MySQL will not convert the hex for you. Instead, if it starts with a letter you get 0. If it starts with a number, you get the starting numbers.
select '123abc' + 0 = 123
select 'abc123' + 0 = 0
6 out of 16 will start with a letter so they will all be 0 and 0 mod anything is 0. The remaining 10 of 16 will be some number so will be distributed properly, 5 of 16 will be 0, 5 of 16 will be 1. 6/16 + 5/16 = 69% will be 0 which is very close to your observed 72%.
To do this right we need to convert the 128 hex string into a 64 bit unsigned integer.
Slice off 64 bits with either left(uuid, 16) or right(uuid, 16).
Convert the hex (base 16) into decimal (base 10) using conv.
cast the result to an unsigned bigint. If we skip this step MySQL appears to use a float which loses accurracy.
select cast(conv(right(uuid, 16), 16, 10) as unsigned) mod 2
Beautiful.
That will only use 64 bits of the 128 bit checksum, but for this purpose that should be fine.
Note this technique works with an MD5 checksum because it is pseudorandom. It will not work with the default MySQL uuid() function which is a UUID version 1. UUIDv1 is a timestamp + a fixed ID and will always mod the same.
UUIDv4, which is a random number, will work.
Convert the hex string to decimal before modding:
where CONV(substring(uuid, 1, 8), 16, 10) mod 2 = 1
A reasonable hashing function should distribute evenly enough for this purpose.
Use substring to convert only a small part so the conv doesn't overflow decimal range and maybe behave badly. Any subset of bits should also be well distributed.

MySQL: Optimized query to find matching strings from set of strings

I am having 10 sets of strings each set having 9 strings. Of this 10 sets, all strings in first set have length 10, those in second set have length 9 and so on. Finally, all strings in 10th set have length 1.
There is common prefix of (length-2) characters in each set. And the prefix length reduces by 1 in next set. Thus, first set has 8 characters in common, second has 7 and so on.
Here is what a sample of 10 sets look like:
pu3q0k0vwn
pu3q0k0vwp
pu3q0k0vwr
pu3q0k0vwq
pu3q0k0vwm
pu3q0k0vwj
pu3q0k0vtv
pu3q0k0vty
pu3q0k0vtz
pu3q0k0vw
pu3q0k0vy
pu3q0k0vz
pu3q0k0vx
pu3q0k0vr
pu3q0k0vq
pu3q0k0vm
pu3q0k0vt
pu3q0k0vv
pu3q0k0v
pu3q0k0y
pu3q0k1n
pu3q0k1j
pu3q0k1h
pu3q0k0u
pu3q0k0s
pu3q0k0t
pu3q0k0w
pu3q0k0
pu3q0k2
pu3q0k3
pu3q0k1
pu3q07c
pu3q07b
pu3q05z
pu3q0hp
pu3q0hr
pu3q0k
pu3q0m
pu3q0t
pu3q0s
pu3q0e
pu3q07
pu3q05
pu3q0h
pu3q0j
pu3q0
pu3q2
pu3q3
pu3q1
pu3mc
pu3mb
pu3jz
pu3np
pu3nr
pu3q
pu3r
pu3x
pu3w
pu3t
pu3m
pu3j
pu3n
pu3p
pu3
pu9
pud
pu6
pu4
pu1
pu0
pu2
pu8
pu
pv
0j
0h
05
pg
pe
ps
pt
p
r
2
0
b
z
y
n
q
Requirement:
I have a table PROFILES having columns SRNO (type bigint, primary key) and UNIQUESTRING (type char(10), unique key). I want to find 450 SRNOs for matching UNIQUESTRINGs from those 10 sets.
First find strings like in the first set. If we don't get enough results (ie. 450), find strings like in second set. If we still don't get enough results (450 minus results of first set) find strings like in third set. And so on.
Existing Solution:
I've written query something like:
select srno from profiles
where ( (uniquestring like 'pu3q0k0vwn%')
or (uniquestring like 'pu3q0k0vwp%') -- all those above uniquestrings after this and finally the last one
or (uniquestring like 'n%')
or (uniquestring like 'q%')
)
limit 450
However, after getting feedback from Rick James in this answer I realized this is not optimized query as it touches lot many rows than it needs.
So I plan to rewrite the query like this:
(select srno from profiles where uniquestring like 'pu3q0k0vwn%' LIMIT 450)
UNION DISTINCT
(select srno from profiles where uniquestring like 'pu3q0k0vwp%' LIMIT 450); -- and more such clauses after this for each uniquestring
I like to know if there are any better solutions to do this.
SELECT ...
WHERE str LIKE 'pu3q0k0vw%' AND -- the 10-char set
str REGEXP '^pu3q0k0vw[nprqmj]' -- the 9 next letters
LIMIT ...
# then check for 450; if not enough, continue...
SELECT ...
WHERE str LIKE 'pu3q0k0vt%' AND -- the 10-char set
str REGEXP '^pu3q0k0vt[vyz]' -- the 9 next letters
LIMIT 450
# then check for 450; if not enough, continue...
etc.
SELECT ...
WHERE str LIKE 'pu3q0k0v%' AND -- the 9-char set
str REGEXP '^pu3q0k0v[wyzxrqmtv]' -- the 9 next letters
LIMIT ...
# check, etc; for a total of 10 SELECTs or 450 rows, whichever comes first.
This will be 10+ selects. Each select will be somewhat optimized by first picking rows with a common prefix with LIKE, then it double checks with a REGEXP.
(If you don't like splitting the inconsistent pu3q0k0vw vs. pu3q0k0vt; we can discuss things further.)
You say "prefix"; I have coded the LIKE and REGEXP to assume arbitrary text after the prefix given.
UNION is not viable, since it will (I think) gather all the rows before picking 450. Each SELECT will stop at the LIMIT if there is no DISTINCT GROUP BY or ORDER BY that require gathering everything first.
REGEXP is not smart enough to avoid scanning the entire table; adding the LIKE avoids such (except when more than, say, 20% of the rows match the LIKE).

Calculate similarity between to records in MySQL

I am currently trying to figure out how to calculate the similarity between two records. My first record would be from a deactivated advertisement - so I want to find e.g. the 10 most similar advertisement regarding to some VARCHAR-fields equalness.
The thing, I can't figure out is, if there is any MySQL function, that can help me in any way - or if I need to compare the strings in some weird way?
EDIT #1
Similarity would be defined by these fields:
Title (weight: 50 %)
Content (weight: 40 %)
Category (weight: 10 %)
EDIT #2
I want the calculation to be like this:
Title: Words that match in the title field (only words >2 letters are matched).
Description: Words that match in the title field (only words >2 letters are matched).
Catgory: Match the category and if that doesn't match match the parent category with less weight :)
An equation of this could be:
#1 is the old, inactive post, #2 is the active post:
#2 title matches #1 title in 3 words out of #2's total of 10 words.
That gives 30 % match = 30 points.
#2 description matches #1 description in 10 words out of #2's total
of 400 words. That gives a 4 % match = 4 points.
#2 category doesn't match #1's category, therefore 0 % match. That
gives 0 points.
Then the sum would be 34 points for #2. :)
Edit #3
Here's my query - but it doesn't return different rows, but a lot of the same row.
SELECT
a.AdvertisementID as A_AdvertisementID,
IF(a.Topic LIKE a2.Topic, 50, 0) + IF(a.Description LIKE a2.Description, 40, 0) + IF(a.Cate_CategoryID LIKE a2.Cate_CategoryID, 10, 0) as A_Score,
a.AdvertisementID as A_AdvertisementID,
a.Topic as A_Topic,
LEFT(a.Description, 300) as A_Description,
a.Price as A_Price,
a.Type as A_Type
FROM
".DB_PREFIX."A_Advertisements a2,
".DB_PREFIX."A_Advertisements a
WHERE
a2.AdvertisementID <> a.AdvertisementID
AND
a.AdvertisementID = :a_id
ORDER BY
A_Score DESC
If you can literally compare the fields you are interested in, you could have MySQL perform a simple scoring calculation using the IF() function, for example
select
foo.id,
if (foo.title='wantedtitle', 50, 0) +
if (foo.content='wantedcontent', 40, 0) +
if (foo.category='wantedcategory', 10, 0) as score
from foo
order by score desc
limit 10
A basic 'find a fragment' could be achieved using like
select
foo.id,
if (foo.title like '%wantedtitlefragment%', 50, 0) +
if (foo.content like '%wantedcontentfragment%', 40, 0) +
if (foo.category like '%wantedcategoryfragment%', 10, 0) as score
from foo
order by score desc
limit 10
There are other techniques, but they might be slow to implement in MySQL. For example, you could calculate the Levenstein distance between two string - see this post for an example implementation.

Select random row from MySQL (with probability)

I have a MySQL table that has a row called cur_odds which is a percent number with the percent probability that that row will get selected. How do I make a query that will actually select the rows in approximately that frequency when you run through 100 queries for example?
I tried the following, but a row that has a probability of 0.35 ends up getting selected around 60-70% of the time.
SELECT * FROM table ORDER BY RAND()*cur_odds DESC
All the values of cur_odds in the table add up to 1 exactly.
If cur_odds is changed rarely you could implement the following algorithm:
1) Create another column prob_sum, for which
prob_sum[0] := cur_odds[0]
for 1 <= i <= row_count - 1:
prob_sum[i] := prob_sum[i - 1] + cur_odds[i]
2) Generate a random number from 0 to 1:
rnd := rand(0,1)
3) Find the first row for which prob_sum > rnd (if you create a BTREE index on the prob_sum, the query should work much faster):
CREATE INDEX prob_sum_ind ON <table> (prob_sum);
SET #rnd := RAND();
SELECT MIN(prob_sum) FROM <table> WHERE prob_sum > #rnd;
Given your above SQL statement, whatever numbers you have in cur_odds are not the probabilities that each row is selected, but is instead just an arbitrary weighting (relative to the "weights" of all the other rows) which could instead be best interpreted as a relative tendency to float towards the top of the sorted table. The actual value in each row is meaningless (e.g. you could have 4 rows with values of 0.35, 0.5, 0.75 and 0.99, or you could have values of 35, 50, 75 and 99, and the results would be the same).
Update: Here's what's going on with your query. You have one row with a cur_odds value of 0.35. For the sake of illustration, I'm going to assume that the other 9 rows all have the same value (0.072). Also for the sake of illustration, let's assume RAND() returns a value from 0.0 to 1.0 (it may actually).
Every time you run this SELECT statement, each row is assigned a sorting value by multiplying its cur_odds value by a RAND() value from 0.0 to 1.0. This means that the row with a 0.35 will have a sorting value between 0.0 and 0.35.
Every other row (with a value of 0.072) will have sorting values ranging between 0.0 and 0.072. This means that there is an approximately 80% chance that your one row will have a sorting value greater than 0.072, which would mean that there is no possible chance that any other row could be sorted higher. This is why your row with the cur_odds value of 0.35 is coming up first more often than you expect.
I incorrectly described the cur_odds value as a relative change weighting. It actually functions as a maximum relative weighting, which would then involve some complex math to determine the actual relative probabilities involved.
I'm not sure what you need can be done with straight T-SQL. I've implemented a weighted probability picker many times (I was even going to ask a question about best methods for this this morning, ironically) but always in code.