the interaction of wordlist and top features selected based on weights - rapidminer

In the training process for a text classification case, the wordlist generated from process documentmodule has a length of about 15000 words. On the other side, I applied feature selection module, i.e.,weight by information gain and select by weight to select top 500 features. Both wordlist and selected weights are stored. Are there any ways to apply this generated 500 weights to the wordlist and constructed the short wordlist, which exactly matches the 500 weights. In other words, I would like to have the intersection of the original wordlist (about 15000 words) and the top 500 features(or top 500 words based on the ).
The following shows the script I am using.The stored weight(circled with red) is two columns where the first column is word(attributed) and the second column is corresponding weight value. Based on which, we can select top 500 or any other top features. The original wordlist (circled with red) can have 15000 words, a matrix with 15000 rows.
My question is that how to generated a filtered wordlist object based on the ranked weight object.
I have posted this question on Rapidminer forum. Please follow the update there.

You should post a representative process. In the absence of that it's difficult to give help but my view is that you could take the 500 word example set and process it again to make a word list from it.

Related

Pattern in residuals/outliers

I made a regression that produced an R-squared of 0.89. However, there is something wrong with the models according to the residuals plot: there's been a weird pattern on the left hand side of the graph. I succeeded to get rid of this problem by applying Cook's distance so these data points were thus interpreted as outliers in the data. However, I'm not sure if I should be more worried about this pattern. The data I am working with describes tree growth and the data set is very large (my sample set contains 270 000 data points).
So the question is, should I be worried or can I just rely on the cook's distance which interprets those data points as outliers?
Results after GLM:
Results after applying cook's distance:

perMANOVA for small sample size

I have data of 6 groups with sample size of n = 2, 10, 2, 9, 3, 1 and I want to perform Permutational multivariate analysis of variance (PERMANOVA) on these data.
My question is: Is it correct to run perMANOVA on these data with the small sample size? The results look strange for me because the group of n = 1 showed insignificant difference to other groups although the graphical representation of the groups clearly show a difference.
Thank you
I would not trust any result with group of n=1 because there is no source of variation to define difference among groups.
I have also received some answers from other platforms. I put them here for information:
The sample size is simply too small to yield a stable solution via manova. Note that the n = 1 cell contributes a constant value for that cell's mean, no matter what you do by way of permutations.
Finally, note that the effective per-cell sample size with unequal cell n for one-way designs tracks well to the harmonic mean of n. For your data set as it stands, that means an "effective" per-cell n of about 2.4. Unless differences are gigantic on the DV set, no procedure (parametric or exact/permutation) will have the statistical power to detect differences with that size.
MANOVA emphasizes the attribute scattering in the study group and the logic of this analysis is based on the scattering of scores. It is not recommended to use small groups with one or more people (I mean less than 20 people) to perform parametric tests such as MANOVA. In my opinion, use non-parametric tests to examine small groups.

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.
Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

MYSQL polygons that contains a polygon

I have a table for segments storages on table as polygons.
Then I want to get all segments that are touched by another polygon for example a square, or a circle.
On image : http://img.acianetmedia.com/GJ3
I represent small gray boxes as segments and big_BOX .
With this query:
SELECT id, position, ASTEXT( value )
FROM segment
WHERE MBRCONTAINS( GEOMFROMTEXT( 'POLYGON(( 20.617202597319 -103.40838420263,20.617202597319 -103.3795955521,20.590250599403 -103.3795955521,20.590250599403 -103.40838420263,20.617202597319 -103.40838420263))' ) , value )
I got 4 segments that are 100% inside big_BOX,
but how to get ALL segments that are touched by big_BOX ?
result has to be 16 segments.
A simple solution:
Instead of MBRContains, you should use MBRIntersects which will return any results that either fully or partially cross space with your big box.
A caution and full solution:
Dependent on your data, and the rest of your solution (especially on how big box is formed), it may be possible that you return more than 16 segments due the number of decimal places your coordinates use. Whilst this is however quite unlikely and would only ever be possible under extreme circumstances its just a possibility to consider.
At 7 decimal places, you're at 1.1cm accuracy (at the equator). If your big box looked to exactly line up with a 4x4 segment set, it is possible (at an absolute maximum degree) that you actually get a result set of 36 (6x6) due to the coordinates overlapping into the next segment on all sides by even the most minute measurement. Any multiple of 4 between 16 and 36 inclusive could be possible.
Again, this is largely unlikely, but if you wanted to always ensure a result set of 16 you could use a combination of methods such as Area(Intersection(#geom1, #geom2)) to calculate the intersection geography between your big box and Intersecting segments, order on that column descending and take the first 16 results.
Whilst this would guarantee the most appropriate 16 segments, it will add additional overhead to all queries just to cater for the most extreme scenarios.
The choice is yours. Hope it helps.

SublimeText 2: What does the number on the left side mean in command palette?

When you search for a file name- on the left it gives you numbers ranging from 0-999. What do these numbers represent? It seems like a search ranking but I'm not sure.
They are a measure of the likelihood that the result will match your search query. This algorithm happens under the hood on most predictive or autocomplete searches (like Google's or Mac's Spotlight search) but the ST2 team decided it would be neat to show you the numeric result.
It takes a few items into consideration. Each one of these criteria adds more value to that result:
of matching characters
How frequently the file has been used
Proximity to the top folder
Whether the letters are in sequence or dispersed through the filename
Whether the filename starts with the matched letters, or the matched letters are somewhere in between.
In the example below, you can see the values go up as "index.html" gradually becomes more accurate. As expected, buried files, or files that are used less frequently get a lower value.