Is there a way to reduce the run time for this code to remove partial duplicates? - duplicates

So this is the code for removing partial duplicates within the same column of a data, however, i guess because of the process of matching each row with other rows, the code takes a lot of time to run on a dataset of even 2000 rows. Is there any way to reduce the run time?
Here's the code-
from fuzzywuzzy import fuzz,process
rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]
clean = []
threshold = 80 # this is arbitrary
for row in rows:
# score each sentence against each other sentence
# [('string', score),..]
scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if scores[1][1] > threshold:
clean.append(max([x[0] for x in scores[:2]],key=len))
else:
clean.append(scores[0][0])
# remove dupes
clean = set(clean)

Related

Randomly select an array until a specific parameter is hit

I am trying to put together an interactive meal tracker / planner in Google Sheets.
What I have is a table with food and info about kcal and their macro nutrition values.
I have already put together a logic for how much kcal does somebody requires and this gives a kcal and macro nutrition for each meal per day, e.g.
Breakfast: 518 kcal, carbohydrates 207, protein 62, fat 19
Now I want to randomly put together foods/meals from the table mentioned above until those numbers are hit automatically.
There a ways to randomly select things from an array, but I'm stuck how to combine this with a loop and a deadline.
Any idea?
Many thanks
I do not see any update so I will just go with my assumption on what you are stuck on and say that it's building the array and making sure you stop once you reach the kCal limit.
Basically you will be working with 3 variables, foodTable as your food array, mealTable which is the meal table you will generate and some kind of kCal tracker, let's just call it kCal.
In essence you want to use a do .. while loop and your while statement is basically kCal < target where target is the value you want to reach. If you want to use each value from the foodTable only once then you will need to use selectedFood = foodTable.splice(idx,1), if you want to be able to re-use the sime item then use slice() instead.
Your idx is a randomly generated number between 0 and mealTable.length (in case you are using splice to remove an item from the possible foods list this will ensure you do not generate an invalid number)
Then add the kCal value to your tracker with kCal += selectedFood[0][col] where col is the column number where you store the kCal value. Make sure that every time before you start the while loop you have kCal = 0 to reset, if you repeat this. And keep in mind that this method does not care how much over the target kCal you will go and you might want to add some kind of if statement to make sure you do not run out of items before you reach your target.

Is it possible to use topic modeling for a single document

Is it rational to use topic modelling for a single document or to be more precise is it mathematically okay to use LDA-gibbs method for a single document.If so what should be value of k and seed.
Also what is be the role of k and seed for single as well as large set of documents.
K and SEED are variable of the function LDA (in r studio).
Also let me know if I am wrong anywhere in this question.
To tell about my project ,I am trying to find out the main topics which can be used to represent the content of a single document.
I have already tried using k=4,7,10.Part of my question also is what value of k should be better.
It really depends on the document. A document could be a 700 page book or a single sentence. Your k is also going to be dependent on the document I think you mean the number of topics? If your document is the entire Wikipedia corpus 1500 topics might be appropriate if your document is a list of comments about movies then 20 topics might be appropriate. Optimizing that number can be done using the elbow method check out 17.
Seed can be pretty random it's just a leaver so your results can be replicated - it runs if you leave it blank. I would say try it and check your coherence, eyeball your topics and if it looks right then sure you can train an LDA on one document. A single document should process pretty fast.
Here is an example in python of using seed parameters. My data set is 1,048,575 rows note the seed is much higher:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus,
num_topics=20, alpha =.1, id2word=dictionary, iterations = 1000,
random_seed = 569356958)

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.
Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

MYSQL masking data from update very slow on large DB

I have a DEV DB with 16 million(ish) records. I need to 'mask' columns of personal data (name, address, phone, etc.). I found a nice function that will do the data masking wonderfully Howto generate meaningful test data using a MySQL function.
The problem is, when I call the function, it is only processing about 30 records per second.
This is way to slow.
Is there anyway to speed this up. Maybe create a temp table or something.
Here is the UPDATE statement that calls the function.
UPDATE table1
SET first_name = (str_random('Cc{3}c(4)')),
last_name = (str_random('Cc{5}c(6)')),
email = (str_random('c{3}c(5)[.|_]c{8}c(8)#[google|yahoo|live|mail]".com"')),
address1 = (str_random('d{3}d{1} Cc{5} [Street|Lane|Road|Park]')),
city = (str_random('Cc{5}c(6)')),
state = (str_random('C{2}')),
zip = (str_random('d{5}-d{4}'))
Thanks!!
Instead of calling a random function 7*16m times, it would probably be faster if you operated on procedurally generated text.
I checked out the str_random function you linked to. (That's very clever btw - cool stuff)
It calls RAND() once for each random character in the string and once each time you say "choose from list". That's a lot of rands.
I think one way to improve it would be to create and cache (in a table) a large set of random characters and instead of calling rand (say) 5 times for 5 random characters, call it once to determine an offset into the big string of random crap, then just increment the index it uses to pull from the string... (if it needs a bunch in a row - it can just pull them all at once in a row and multi-increment the offset)
The str_random_character function that the parent function calls could be replaced by something that does this instead of calling rand into an array.
It's a bit beyond me for a throwaway piece of code, but it might put you (or a better mysql guru) on a path for speeding this puppy up (maybe).
A different option would be rather than random-masking all the data... can you transform the data in some way? Since you don't need the original back, you could do something like a caesar cipher on each character in their data based on a (single) rand call for the rotation count. (If you rotate the uppers, lowers, and digits in each string separately, the data will stay looking "normal" despite not being easily reversible because of the randomized rotation) -- I wouldn't slap a SECURE sticker on it but it would be a lot quicker and not easy to reverse.
I think I have a Caesar rotator that does that somewhere if it suffices.

Reconstructing state from time series data events

For a particular project, we acquire data for a number of events and collect variables about those events at the same time. After the data has been collected, we perform a user-customizable analysis on said data to determine whatever it is that the user is interested in.
The data is collected in a form similar to this:
Timestamp Event
0 x = 0
0 y = 1
3 Event A occurred
3 x = 1
4 Event A occurred
4 x = 2
9 Event B occurred
9 y = 2
9 x = 0
To understand the entire state at any time, the most straightforward approach is to walk over the entire set of data. For example, if I start at time 0, and "analyze" until timestamp 5, I know that at that point x = 2, y = 1, and Event A has occurred twice. That's a really simple example. The user might be (and often is) interested in the time between events, say from A to B, and they might specify the first occurrence of A, then B, or the last occurrence of A, then B (respectively, 9-3 = 6 or 9-4 = 5). Like I said, this is easy to analyze when you're walking over the entire set.
Now, we need to adapt the model to analyze an arbitrary window of time. If we look at 0-N, that's the easy case. But if I look at 1-5, for instance, I have no notion of y unless I begin at 0 and know that y was initially 1 and did not change in the window 1-5.
Our approach is to essentially create a dictionary of variables, and run callbacks on events. If one analysis was "What is x when Event A occurs and time is > 3" then we would run that callback on the first Event A, and it would immediately return because time is not greater than 3. It would run again at 4, and and it would report that x was 1 at t=4.
To adapt to the "time-windowing", I think I am going to (in the background) tack on additional conditions to the analysis. If their analysis is just "What is x when Event A occurs", and the current window is 1-5, then I will change it to "What is x when Event A occurs and time >= 1 and time <= 5". Then if the next window is 6-10, I can readjust the condition as necessary.
My main question is: what pattern does this fit? We are obviously not the first people to approach a problem like this, but I have not been able to find how others have approached it. I probably just don't know what exactly to search on Google. Is there any other approach besides keeping a dictionary of the entire global state for looking up a single state at a given time? Note also that the data could have several, maybe tens of thousands of records, so the fewer iterations over the data set, the better.
I think your best approach here would be to take periodic "snapshots" of the full state data, say every 1000 samples (for example), along with recording the deltas. When you're storing your data as offsets from some original value (aka deltas), you don't have any choice but to reconstruct the full data starting with the original values. Storing periodic snapshots will lessen the amount of reconstruction you have to do - the design tradeoff is between low storage requirements but long reconstruction time on the one hand, and higher storage requirements but shorter reconstruction time on the other.
MPEGs, for example, store each frame as the differences between the current frame and the previous frame. Ordinarily, this would force an MPEG to be viewed from the beginning, but the format also periodically stores full frames so that the decoder doesn't have to backtrack all the way to the beginning of the file.
You can search by time in Log(N), and you can have a feeling about how many updates ares acceptable... hence here's my solution:
Pick a number, N, of updates that are acceptable in order to return a result. 256 might be good, given the scales you've mentioned so far.
Every N records, commit an entry of all state to a dictionary, with a timestamp.
Now, you have a tradeoff, dictionary size against speed. N->\infty is regular searching. N<-1 is your current solution, N anywhere else will require less memory, but be slower.
Your implementation is now (for time X):
Log(n) search of subsampled global dictionary to timestamp before X, (timestamped as Y).
Log(n) search of eventlist to timestamp Y, and perform less than N updates.
Picking N as a power of two will even allow you to do some nice shift tricks to do a rounded-down integer divide nice and fast.