RapidMiner: Calculate document similarity - rapidminer

I am using Rapidminer to calculate similarity between documents. I am using this process from my Java Application.
This process calculates similarity of each document with every other document in the dataset. I dont want to compute similarity between every document. I only want to compute similarity of one selected document with all the other documents.
The Process Document gives me a word vector with their tf-idf scores.
The Data to Similarity calculates Cosine Similarity between these vecotors.
So basically I need to calculate the Cosine Similarity of the one selected document to every other document in the dataset.
Is it possible in RapidMiner? Any insight will be helpful. Thank you.
EDIT:
ANSWER:

The Cross Distances operator would be better. It takes two inputs both of which are example sets. The first could be a list of features for all documents and the second could be the list of features for the single document. The result is a new example set with a distance calculation. If you sort this example set (the operator probably returns a sorted list already but just in case you could use Sort) to find this minimum and use Filter Example Range to select it, you will get the details of the nearest document.

Related

ANOVA and sample group associated to a variable

I have a variable X and and 16 groups of samples. I would like to know which group is the most associated to this variable (the one with the lowest values actually). I performed an ANOVA and a TukeyHSD/post-hoc but that only highlight which groups are different for variable X.
Is there a way to determine which group is significantly associated at lowest values for variable X ?
Thanks for your help
With the post-hoc comparisons already in place, and with the information which groups differ from one another, all you need to know is the mean of X within each group.
The group means are easily calculated in standard statistical software. You already know, which of those means are significantly different from one another.
Alternatively you can use a dummy coding for the group variable (i.e., 5 indicator variables with one reference group that replace the 6-level factor). A regression model that regresses X on the dummy variables is equivalent to the ANOVA model (in most parts) and allows for most pairwise comparisons (depending on the coding).
The regression coefficients will indicate the difference between groups, and the test for the coefficients will indicate whether or not these are significant on some level of confidence.

Dealing with clusters when searching for points on map using mysql

I've found various questions with solutions similar to this problem but nothing quite on the money so far. Very grateful for any help.
I have a mysql (v.5.6.10) database with a single table called POSTS that stores millions upon millions of rows of lat/long points of interest on a map. Each point is classified as one of several different types. Each row is structured as id, type, coords:
id an unsigned bigint + primary key. This is auto incremented for each new row that is inserted.
type an unsigned tinyint used to encode the type of the point of interest.
coords a mysql geospatial POINT datatype representing the lat/long of the point of interest.
There is a SPATIAL index on 'coords'.
I need to find an efficient way to query the table and return up to X of the most recently-inserted points within a radius ("R") of a specific lat/long position ("Position"). The database is very dynamic so please assume that the data is radically different each time the table is queried.
If X is infinite, the problem is trivial. I just need to execute a query something like:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position))
Where 'BoundingBox' is a mysql POLYGON datatype that perfectly encloses a circle of radius R from Position. Using a bounding box is, of course, not a perfect solution but this is not important for the particular problem that I'm trying to solve. I can order the results using "ORDER BY ID DESC" to retrieve and process the most-recently-inserted points first.
If X is less than infinite then I just need to modify the above to:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position)) ORDER BY id DESC LIMIT X
The problem that I am trying to solve is how do I obtain a good representative set of results from a given region on the map when the points in that region are heavily clustered (for example, within cities on the map search region). For example:
In the example above, I am standing at X and searching for the 5 most-recently-inserted points of type black within the black-framed bounding box. If these points were all inserted in the cluster in the bottom right hand corner (let's assume that cluster is London) then my set of results will not include the black point that is near the top right of the search region. This is a problem for my application as I do not want users to be given the impression that there are no points of interest outside any areas where points are clustered.
I have considered a few potential solutions but I can't find one that works efficiently when the number of rows is huge (10s of millions). Approaches that I have tried so far include:
Dividing the search region into S number of squares (i.e., turning it into a grid) and searching for up to x/S points within each square - i.e., executing a separate mysql query for each square in the grid. This works OK for a small number of rows but becomes inefficient when the number of rows is massive as you need to divide the region into a large number of squares for the approach to work effectively. With only a small number of squares, you cannot guarantee that each square won't contain a densely populated cluster. A large number of squares means a large number of mysql searches which causes things to chug.
Adding a column to each row in the table that stores the distance to the nearest neighbour for each point. The nearest neighbour distance for a given point is calculated when the point is inserted into the table. With this structure, I can then order the search results by the nearest neighbour distance column so that any points that are in clusters are returned last. This solution only works when I'm searching for ALL points within the search region. For example, consider the situation in the diagram shown above. If I want to find the 5 most-recently-inserted points of type green, the nearest neighbour distance that is recorded for each point will not be correct. Recalculating these distances for each and every query is going to be far too expensive, even using efficient algorithms like KD trees.
In fact, I can't see any approach that requires pre-processing of data in table rows (or, put another way, 'touching' every point in the relevant search region dataset) to be viable when the number of rows gets large. I have considered algorithms like k-means / DBSCAN, etc. and I can't find anything that will work with sufficient efficiency given the use case explained above.
Any pearls? My intuition tells me this CAN be solved but I'm stumped so far.
Post-processing in that case seems more effective. Fetch last X points of a given type. Find if there is some clustering, for example: too many points too close together, relative to the distance of your point of view. Drop oldest of them (or these which are very close - may be your data is referencing a same POI). How much - up to you. Fetch next X points and see if there are some of them which are not in the cluster, or you can calculate a value for each of them based on remoteness and recentness and discard points according to that value.

Calculate conditional mean

I'm new to cuda programming and am interested in implementing an algorithm that when coded serially calculates two or more means from a vector in one pass. What would be an efficient scheme for doing something like this in cuda?
There are two vectors of length N, element values and an indicator values identifying which subset each element belongs to.
Is there an efficient way to do this in one pass or should this be done in M passes, where M is the number of means to be calcuated and use a vector of index keys for the element values of each subset?
You can achieve this with one pass over the data with a single call to thrust::reduce_by_key. In particular, look at the "summary statistics" example, which computes several statistical properties of a single vector at once. You could generalize this method to reduce_by_key which computes reductions over many sub-vectors in parallel. Your "indicator values" would provide be the "keys" reduce_by_key uses to determine which sub-vector each element belongs to.
Partition each vector into smaller vectors and use threads to sum required elements of each sub vector. Then combine the sums and generate the global means. I would try to generate the M means at the same time rather than do M passes.

Randomly Assigning Positions

Here's my basic problem. Let's say I have 50 employees working on a certain day, and I want my program to randomly distribute them to a "position" (I.e.: front desk, phones, etc) based on what they have been trained on. The program already knows what each employee has been trained on. What is the best method pragmatically to go through and assign an employee to each of the 50 positions?
P.s. I am programming this into Access using VBA, but this is more a question of process than actual code.
Hi lukewarm,
You are looking for a maximum bipartite matching. This is a problem from graph theory. It boils down to determining the maximum flow in an undirected, bipartite graph with constant edge weights of 1:
You divide all vertices in Your graph in two separate sets. The first set contains all Your workers, the second one all available positions.
Now You insert an edge from every worker to every position she/he is able to work on.
Insert two more vertices: A source and a sink. Connect the source with every worker vertex and the sink with every position vertex.
Determine the maximum flow from source to sink
Hope I could help, greetings.
EDIT: Support for randomness
Since finding the maximum bipartite matching/maximum flow is a deterministic algorithm, it would always return the same result. In order to change that You could mix/shuffle the order of the edges in the graph before applying the algorithm.
In your position table have a sequence, 1, 2, 3, 4 and a count of positions to be filled. Then look at what the person did yesterday, and 1 to the position sequence and now they're assigned to the next position. If there are enough for that position today then go to the next priority position.
Not random but maybe close enough.

Full-text search relevance is measured in?

I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.
Testing MySQL's MATCH() ... AGAINST(), the highest relevance I get is 30+, when I test against a 100% similar string.
So what exactly is the relevance? To quote the manual:
Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.
My problem is how to test the relevance value if a string is a duplicate. If it's 100% duplicate, prevent it from being inserter into Question Bank. But if it is only so similar, prompt the quizmaker to verify, insert or not. So how do I do that? 30+ for 100% identical string is not percentage, so I'm stump.
Thanks in advance.
The basic data structure for a text retrieval system is an Inverted Index. This is essentially a list of words found in the document collection with a list of the documents they occur in. It can also have metadata about the occurrence for each document, such as the number of times the word appears.
Documents containing the words can be queried by matching on the search terms. To determine relevance, a heuristic known as a Cosine Ranking is calculated on the hits. This works by constructing n-dimensional vector with one component for each of the n search terms. You can also weight the search terms if desired. This vector gives a point in n-dimensional space that corresponds to your search terms.
A similar vector based on the weighted occurrences in each document can be constructed from the inverted index with each axis in the vector corresponding with the axis for each search term. If you calculate a dot product of these vectors you get the cosine of the angle between them. 1.0 is equivalent to cos (0), which would assume the vectors occupy a common line from the origin. The closer the vectors together, the smaller the angle and the closer the cosine is to 1.0.
If you sort the search results by the cosine (or bung them into a priority queue as mg does) you get the most relevant. Cleverer relevance algorithms tend to fiddle with the weights of the search terms, skewing the dot product in favour of terms with high relevance.
If you want to dig a little, Managing Gigabytes by Bell and Moffet discusses the internal architecture of text retrieval systems.
andygeers is on the right track: Those numbers have no empirical meaning other than their relations to each other and cannot be used on their own to determine what is or is not an "exact match". You need to determine that yourself. Even aside from the limitations of fulltext search ranking, there's also the open question of just what you consider to consitiute an "exact match". (Actual text only or do soundex matches count? Do synonyms (e.g., "couch" vs. "sofa") count as matching or as distinct? Should an attempt be made to compensate for misspellings? Etc.)
If I had the need to perform such a check, I would grab only the highest-ranked entry returned by the fulltext search, remove any designated stopwords, normalize whitespace, convert to lowercase, do the comparison, and leave it at that until I encountered a case that called for it to be refined further. It's not really all that much extra work - if you specify the language you're using for your application, you could probably find someone around here who could write the normalization function within a dozen or so lines of code.
I don't know the specifics of the MySQL function you're using, but I imagine it could be that there is no absolute meaning for those numbers - they're just designed to be compared with other values produced by the same function. To check for an absolute match you could select out the text itself and compare manually.