Association Rules in Rapid Miner - rapidminer

I wanted to ask this question in RapidMiner Community but after 2 days, still no answer.
It might be an easy question for you. I want to find a meaningful relation between a couple of column values.
the table is like this:
SiteID , Number Of 2MB users , Numberof 4MB users, Average 2MB speed Usage, Average 4MB speed Usage , Congestion Status
It is clear that there is a relation between the number of users in each site and Average Usage of them and Congestion Status of the site.
but how to find it? a step by step guide would be helpful .
So many thanks

You could use the operator Correlation Matrix. This finds how attributes are correlated and the second output of this operator is a matrix showing this.

Related

Managing set of users

We have a website with many users. To manage users who transacted on a given day, we use Redis and stored a list of binary numbers as the values. For instance, if our system had five users, and user 2 and 5 transacted on 2nd January, our key for 2nd January will look like '01001'. This also helps us to determine unique users over a given period and new users using simple bit operations. However, with growing number of users, we are running out of memory to store all these keys.
Is there any alternative database that we can use to store the data in a similar manner? If not, how should we store the data to get similar performance?
Redis' nemory usage can be affected by many parameters so I would also try looking in INFO ALL for starters.
With every user represented by a bit, 400K daily visitors should take at least 50KB per value, but due to sparsity in the bitmap index that could be much larger. I'd also suspect that since newer users are more active, the majority of your bitmaps' "active" flags are towards its end, causing it to reach close to its maximal size (i.e. total number of users). So the question you should be trying to answer is how to store these 400K visits efficiently w/o sacrificing the functionality you're using. That actually depends what you're doing with the recorded visits.
For example, if you're only interested in total counts, you could consider using the HyperLogLog data structure to count your transacting users with a low error rate and small memory/resources footprint. On the other hand, if you're trying to track individual users, perhaps keep a per user bitmap mapped to the days since signing up with your site.
Furthermore, there are bitmap compression techniques that you could consider implementing in your application code/Lua scripting/hacking Redis. The best answer would depend on what you're trying to do of course.

MySQL architecture for n * (n - 1) / 2 algorithm

I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?
If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.
If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.
I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.

Getting average or keeping temp data in db - performance concern

I am building a little app for users to create collections. I want to have a rating system in there. And now, since I want to cover all my fields, let's pretend that I have a lot of visitors. Performance comes into play, especially with rates.
Let's suppose that I have rates table, and there I have id, game_id, user_id and rate. Data comes simple, for every user there is one entry. Let's suppose again, that 1000 users will rate one game. And I want to print out average rate on that game subpage (and somewhere else, like on the games list). For now, I have two scenarios to go with:
Getting AVG each time the game is displayed.
Creating another column in games, called temprate and store there rate for the game. It would be updated evey time someone votes.
Those two scenarios have obvious flaws. First one is more stressful to my host, since it definietly will consume more power of the machine. Secound is more work while rating (getting all the game data, submitting rate, getting new AVG).
Please advice me, which scenario should I go with? Or maybe you have some other ideas?
I work with PDO and no framework.
So I've finally manage to solve this issue. I used file caching based on dumping arrays into files. I just go with something like if (cache) { $var = cache } else { $var = db }. I am using JG Cache, for now, but propably I'll write myself something similar soon, but for now - it's a great solution.
I'd have gone with a variation of your "number 2" solution (update a separate rating column), maybe in a separate table just for this.
If the number of writes becomes a problem, then that'd be well after select avg(foo) from ... does, and there are lots of ways to mitigate it by just updating the average rating periodically or just processing new votes every so often.
Likely then eventually you can't just do an avg() anyway because you have to consider each vote for fraud, calculating a sort score and who knows what else.,

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

How would one use Lucene.NET to help implement search on a site like Stack Overflow?

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.
The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].
As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:
Lucene.NET is not being used for Stack
Overflow - we are using SQL Server
Full Text indexing. Search is an area
where we continue to make minor
tweaks.
So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?
Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):
Technologies:
ASP.NET MVC
SQL Server 2008
.NET 3.5
C# 3.0
And of course, the star of the show, Lucene.NET.
The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.
Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.
Models
This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:
Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
Events: This is data about a real-time event. It has location information, date/time information.
The important thing to note about these models:
They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
Every instance of a model has a unique URL on the site
Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:
Votes: Keyed on the user
Replies: Optional, as an example, see the Notes case above
Favorited: Is the model listed as a favorite of a user?
Comments: (optional)
Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.
And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):
Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).
Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.
SQL Server 2008
The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.
It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.
Lucene.NET
This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.
Indexing
Questions/Models
I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).
In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.
The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).
As an example, a question sets the subject, and then the answer elaborates on the subject.
For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.
I believe that this will help with making the relevance scores more relevant to each other.
Tags
Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:
If the tag exists
Which questions the tags are associated with
The questions themselves
However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
where
t.Name = <tag>
And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.
This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).
The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).
At this point, the confidence score (Wilson score interval) comes into play for ordering the results.
This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).
Given that I can change the SQL statement to order based on the Wilson score interval like this:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
left outer join VoteTallyStatistics as s on
s.ModelTypeId = ta.ModelTypeId and
s.ModelId = ta.ModelId
where
t.Name = <tag>
order by
--- Use Id to break ties.
s.WilsonIntervalLowerBound desc, m.Id
It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".
Replies
Originally, I thought this is in a separate index of its own, with a key back into the Questions index.
I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.
This would of course bloat the index. I'm somewhat comfortable with that right now.
Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.
There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?
What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?
Boost
This is related to indexing of course, but I think it merits it's own section.
Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.
For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?
The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.
Shashikant Kore suggests a document structure like this:
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.
The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.
Search for Items Tagged
I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.
However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.
General Search
So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):
q votes answers accepted answer votes asp.net highlights mvc highlights
------- ------- --------------------- ------------------ --------------
21 26 51 2 2
58 23 70 2 5
29 24 40 3 4
37 15 25 1 2
59 23 47 2 2
Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).
How is all of this brought together?
At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.
How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?
It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.
My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:
1 / ((e ^ cs) * (e ^ rs))
This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.
The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).
Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.
Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:
confidence score = votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound
The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.
To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:
confidence score = 0.5 +
(votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound) / 2
My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).
An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.
As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.
The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.
I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:
Algorithms of the Intelligent Web
Collective Intelligence in Action
Programming Collective Intelligence
The lucene index will have following fields :
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
All these are fields are Analyzed. Length normalization is disabled to get better control on the scoring.
The aforementioned order of the fields also reflect their importance in descending order. That is if the query match in title is more important than in accepted answer, everything else remaining same.
The # of upvotes is for the question and the top answer can be captured by boosting those fields. But, the raw upvote count cannot be used as boost values as it could skew results dramatically. (A question with 4 upvotes will get twice the score of one with 2 upvotes.) These values need to be dampened aggressively before they could be used as boost factor. Using something natural logarithm (for upvotes >3) looks good.
Title can be boosted by a value little higher than that of the question.
Though inter-linking of questions is not very common, having a basic pagerank-like weight for a question could throw up some interesting results.
I do not consider tags of the question as very valuable information for search. Tags are nice when you just want to browse the questions. Most of the time, tags are part of the text, so search for the tags will result match the question. This is open to discussion, though.
A typical search query will be performed on all the four fields.
+(title:query question:query accepted_answer:query all_combined:query)
This is a broad sketch and will require significant tuning to arrive at right boost values and right weights for queries, if required. Experiementation will show the right weights for the two dimensions of quality - relevance and importance. You can make things complicated by introducing recency as aranking parameter. The idea here is, if a problem occurs in a particular version of the product and is fixed in later revisions, the new questions could be more useful to the user.
Some interesting twists to search could be added. Some form of basic synonym search could be helpful if only a "few" matching results are found. For example, "descrease java heap size" is same as "reduce java heap size." But, then, it will also mean "map reduce" will start matching "map decrease." (Spell checker is obvious, but I suppose, programmers would spell their queries correctly.)
You've probably done more thinking on this subject than most folks who will try and answer you (part of the reason why it's been a day and I'm your first response, I'd imagine). I'm just going to try and tackle your final three questions, b/c there's just a lot there that I don't have time to go into, and I think those three are the most interesting (the physical implementation questions are probably going to wind up being 'pick something, and then tweak it as you learn more').
vote data Not sure that votes make something more relevant to a search, frankly, just makes them more popular. If that makes sense, I'm trying to say that whether a given post is relevant to your question is mostly independant of whether it was relevant to other people. that said, there's probably at least a weak correlation between interesting questions and those that folks would want to find. Vote data is probably most useful in doing searches based purely on data, e.g. "most popular" type searches. In generic text-based searches, I'd probably not provide any weight for votes at first, but would consider working on an algorithm that perhaps provides a slight weight for the sorting (so, not the results returned, but minor boost to the ordering of them).
replies I'd agree w/ your approach here, subject to some testing; remember that this is going to have to be an iterative process based on user feedback (so you'll need to collect metrics on whether searches returned successful results for the searcher)
other Don't forget the user's score also. So, users get points on SO also, and that influences their default rank in the answers of each question they answer (looks like it's mostly for tiebreaking on replies that have the same number of bumps)
Determining relevance is always tricky. You need to figure out what you're trying to accomplish. Is your search trying to provide an exact match for a problem someone might have or is it trying to provide a list of recent items on a topic?
Once you've figured what you want to return you can look at the relative effect of each feature you're indexing. That will get a rough search going. From there you tweak based on user feedback (I suggest using implicit feedback instead of explicit otherwise you'll annoy the user).
As to indexing, you should try to put the data in so that each item has all the information necessary to rank it. This means you'll need to grab the data from a number of locations to build it up. Some indexing systems have the capability to add values to existing items which would make it easy to add scores to questions when subsequent answers came in. Simplicity would just have you rebuild the question every so often.
I think that Lucene is not good for this job.
You need something really fast with high availbility... like SQL
But you want open source?
I would suggest you use Sphinx - http://www.sphinxsearch.com/
It's much better, and i am speaking with experience, i used them both.
Sphinx is amazing. Really is.