Recursive "incategory" CirrusSearch query - mediawiki

As a follow up to this question I'd like to know if there is a way to perform a recursive search within a particular MediaWiki category. Currently that only seems to search within the category given, not sub-categories.
The only alternative I found was to concatenate all the sub-pages into one big query but does not seems to scale for really large categories as MediaWiki (on the Commons) says Query was not understood. Please make it simpler..
What other options are there to recursively search within a particular category?

No. MediaWiki categories are not hierarchical, so you will encounter loops (Category:A < Category:B < Category:A), category trees branching and re-merging, and all kinds of other weird things. In large wikis, such as Wikipedia, it is also not very useful, because there is often no well defined ontology for categorization. If you traverse the category of the nations of the world on enwp, for instance, you will get the article about the pope somewhere in there (because the popes are categorized under the Vatican City State).
That's why Deepcat had to be developed.

Related

Creating more relevant results from LDA topic modeling?

I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.
What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.
He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.
I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.
Thank you!
Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.
By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.
First you remove the stop words to make the most common words list more useful.
Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.
After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.
The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.
The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.
A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.

SQL - Rank or Order by "human" Relevance

Looking implement a ranking/order by feature that ranks products by the way we as humans regard as relevant, not what a computer regards as relevant. Currently I have this sql statment
select MATCH(productName) AGAINST('xyz' IN NATURAL LANGUAGE MODE) AS relevant...
... ORDER BY relevant DESC
These seems to work well, with regards to how many times a 'keyword' appears within the recordset, but its very Yay or Nay, if you know what I mean.
However, searching for "computer console" (in the unlikely event), I would like to see "Playstation", "xBox", "Nintendo" Although I never actually typed these keywords into the search field.
Search for "ladder" I personally would expect to see ladders for height access not the board game "snakes and ladders" or clothing with a ladder patten.
Some with "Iron" I wound not expect "Iron man bedding" to appear within the first page.
Is there an industry way of achieving such thing or does anyone have any ideas how this could be accomplished. i.e secondry table with keywords / search terms matching product_id.
Regards
This may not be exactly the same situation as yours but it may help you.
I designed a relevancy-based search results system for a large content management system I developed at my work.
Content comprises a title, the content and a hidden keywords field (words that should be used for search but are not included in the title or content). [there's lots more fields, but these three will do for demonstration of concept]
When content is added it gets indexed: some non alpha-numeric characters are removed, each word is stemmed (ie. educate, education, educator, educates, etc all get indexed as the same word), some words are converted to another based on some internal rules, and then they all get stored in an index.
When a search is done the system does the same as above to each keyword (removes unwanted characters, stemming, conversion based on internal rules).
The system then gets a list of content that has each of the parsed search keywords anywhere in any of those fields.
My code then parses each of the matching results: First it looks for all of they keywords existing consecutively in one of the fields; and if it doesn't find the search phrase it then iteratively [made up word] looks for smaller groups of keywords until found (ie. if 4 search keywords are entered it tries all 4 first, then 3, then 2, then 1 if they aren't all found together)
Based on how many of the keywords were found consecutively the system applies a score to the search result. Higher scores are given based on whether the keyword(s) were found in the title, content or keywords field [this took some fine tuning] and also how close it/they were found close to the start of the field.
The results are then given to the client based on this score.
The system works very well in our situation, particularly the grouped keywords part makes for good results.
You could use a similar system in your situation. A search for "ladder" would order a product like "Ladder - extra large" before "Snakes and Ladders Game".
For "computer console" you could add terms like these to a hidden keywords field.
Note that parsing the list for relevancy takes a bit of server resources so this type of system would only be suitable where you have sufficient infrastructure available or where the list of content is not large.

Implementing search on medical link list/table that allows for synonyms/abbreviations- and importing such a thing

I'm making a simple searchable list which will end up containing about 100,000 links on various medical topics- mostly medical conditions/diseases.
Now on the surface of things this sounds easy... in fact I've set my tables up in the following way:
Links: id, url, name, topic
Topics (eg cardiology, paediatrics etc): id, name
Conditions (eg asthma, influenza etc): id, name, aliases
And possibly another table:
Link & condition (since 1 link can pertain to multiple conditions): link id, condition id
So basically since doctors (including myself) are super fussy, I want to make it so that if you're searching for a condition- whether it be an abbreviation, british or american english, or an alternative ancient name- you get relevant results (eg "angiooedema", "angioedema", "Quincke's edema" etc would give you the same results; similarly with "gastroesophageal reflux" "gastro-oesophageal reflux disease", GERD, GORD, GOR). Additionally, at the top of the results it would be good to group together links for a diagnosis that matches the search string, then have matches to link name, then finally matches to the topic.
My main problem is that there are thousands if not tens of thousands of conditions, each with up to 20 synonyms/spellings etc. One option is to get data from MeSH which happens to be a sort of medical thesaurus (but in american english only so there would have to be a way of converting from british english). The trouble being that the XML they provide is INSANE and about 250mb. To help they have got a guide to what the data elements are.
Honestly, I am at a loss as to how to tackle this most effectively as I've just started programming and working with databases and most of the possibilities of what to do seem difficult/suboptimal.
Was wondering if anyone could give me a hand? Happy to clarify anything that is unclear.
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Link
Topic
Conditions
Then you can write a Lucene query such as Topic:edema and you should get all results.
You can do wildcard search for more.
To match british spellings (or even misspellings) you can use the ~ query which finds terms within a certain string distance. For example edema~0.5 matches oedema, oedoema and so on...
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
Summary: my recommendation is to use Apache Solr as an adjunct to your MySql db.
It's hard. Your best bet is to use MeSH and then perhaps soundex to match on British English terms.

How to get the total count of every product attribute/filter like newegg

If you go to newegg.com (just one example) you'll notice while browsing products you can see the number of items next to each product attribute in the left hand sidebar.
With so many attributes on some items and so many different configurations of product filters how do they calculate all of those totals so fast?
For newegg.com, they are using a faceted navigation technology provided by endeca
In nutshell, endeca will actually use the data provided in xml/csv or directly retrieve data from any database (not limited to just mysql) and calculate similarity and group the result into their own format
Endeca is not free, the open-source alternative such as sphinx or lucene solr
Newegg uses Endeca, and they were probably one of Endeca's earlier customers. In retrospect, Endeca might have been a big contributor to their success. Faceted navigation works very well on complex electronics like computer parts.
There are a few things to consider in faceted navigation:
1) Do you want just faceted navigation on category-driven queries, or do you also want it to work on search? In fact, categories are a hierarchical facet of sorts.
2) Does the de-normalized inverted index model of Solr cause you problems?
If the answer to 1) is true -- it probably is -- you'll need some inverted indices. Inverted indices are pretty much the only way to do keyword search. They will also do faceting with some caveats.
Essentially you can consider each facet as an inverted index (in fact keyword search might be considered a special facet with ranking functions). Then to do counts you'd have to intersect/and the current query and filters with all other facet values. However, this model can lead to problems if you need to represent sparse product sets (see 2).
If the answer to 2) is true, it might help more to think about facets more in terms of OLAP. I don't know if inverted indices can handle complex relationships without some abstractions.
It's fair to consider and implement faceted search/nav as a blend of fulltext (typically implemented as an inverted index) and/or OLAP.
I'm pretty sure you can pull off faceting with a column store, but you'd still need to have an inverted index at your disposal to merge with if you want keyword search.
#Dan Grossman:
It might seem so, BUT --
Did you think for a moment how many combinations there are of facets? You can't cache so many pages like that. There are probably more combinations on Newegg.com than stars in your sky.
Add in multiple selection and it's even worse. Game over.
You can only cache some cases like unfiltered and commonly filtered. If you try to spider Newegg.com without limiting levels of recursion, you'll kill the spider. Faceted sites cause problems for search engines in general for this very reason. See http://www.searchmarketingstandard.com/facets-navigational-seo-powerhouse-part
You do not know that they calculate them fast. You only know that they render them fast. They could spend hours calculating those totals and rendering their pages, cache the results and serve those static files until some time when they want to refresh the data.

How would one use Lucene.NET to help implement search on a site like Stack Overflow?

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.
The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].
As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:
Lucene.NET is not being used for Stack
Overflow - we are using SQL Server
Full Text indexing. Search is an area
where we continue to make minor
tweaks.
So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?
Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):
Technologies:
ASP.NET MVC
SQL Server 2008
.NET 3.5
C# 3.0
And of course, the star of the show, Lucene.NET.
The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.
Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.
Models
This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:
Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
Events: This is data about a real-time event. It has location information, date/time information.
The important thing to note about these models:
They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
Every instance of a model has a unique URL on the site
Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:
Votes: Keyed on the user
Replies: Optional, as an example, see the Notes case above
Favorited: Is the model listed as a favorite of a user?
Comments: (optional)
Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.
And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):
Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).
Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.
SQL Server 2008
The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.
It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.
Lucene.NET
This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.
Indexing
Questions/Models
I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).
In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.
The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).
As an example, a question sets the subject, and then the answer elaborates on the subject.
For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.
I believe that this will help with making the relevance scores more relevant to each other.
Tags
Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:
If the tag exists
Which questions the tags are associated with
The questions themselves
However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
where
t.Name = <tag>
And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.
This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).
The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).
At this point, the confidence score (Wilson score interval) comes into play for ordering the results.
This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).
Given that I can change the SQL statement to order based on the Wilson score interval like this:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
left outer join VoteTallyStatistics as s on
s.ModelTypeId = ta.ModelTypeId and
s.ModelId = ta.ModelId
where
t.Name = <tag>
order by
--- Use Id to break ties.
s.WilsonIntervalLowerBound desc, m.Id
It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".
Replies
Originally, I thought this is in a separate index of its own, with a key back into the Questions index.
I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.
This would of course bloat the index. I'm somewhat comfortable with that right now.
Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.
There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?
What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?
Boost
This is related to indexing of course, but I think it merits it's own section.
Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.
For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?
The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.
Shashikant Kore suggests a document structure like this:
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.
The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.
Search for Items Tagged
I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.
However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.
General Search
So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):
q votes answers accepted answer votes asp.net highlights mvc highlights
------- ------- --------------------- ------------------ --------------
21 26 51 2 2
58 23 70 2 5
29 24 40 3 4
37 15 25 1 2
59 23 47 2 2
Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).
How is all of this brought together?
At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.
How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?
It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.
My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:
1 / ((e ^ cs) * (e ^ rs))
This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.
The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).
Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.
Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:
confidence score = votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound
The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.
To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:
confidence score = 0.5 +
(votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound) / 2
My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).
An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.
As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.
The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.
I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:
Algorithms of the Intelligent Web
Collective Intelligence in Action
Programming Collective Intelligence
The lucene index will have following fields :
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
All these are fields are Analyzed. Length normalization is disabled to get better control on the scoring.
The aforementioned order of the fields also reflect their importance in descending order. That is if the query match in title is more important than in accepted answer, everything else remaining same.
The # of upvotes is for the question and the top answer can be captured by boosting those fields. But, the raw upvote count cannot be used as boost values as it could skew results dramatically. (A question with 4 upvotes will get twice the score of one with 2 upvotes.) These values need to be dampened aggressively before they could be used as boost factor. Using something natural logarithm (for upvotes >3) looks good.
Title can be boosted by a value little higher than that of the question.
Though inter-linking of questions is not very common, having a basic pagerank-like weight for a question could throw up some interesting results.
I do not consider tags of the question as very valuable information for search. Tags are nice when you just want to browse the questions. Most of the time, tags are part of the text, so search for the tags will result match the question. This is open to discussion, though.
A typical search query will be performed on all the four fields.
+(title:query question:query accepted_answer:query all_combined:query)
This is a broad sketch and will require significant tuning to arrive at right boost values and right weights for queries, if required. Experiementation will show the right weights for the two dimensions of quality - relevance and importance. You can make things complicated by introducing recency as aranking parameter. The idea here is, if a problem occurs in a particular version of the product and is fixed in later revisions, the new questions could be more useful to the user.
Some interesting twists to search could be added. Some form of basic synonym search could be helpful if only a "few" matching results are found. For example, "descrease java heap size" is same as "reduce java heap size." But, then, it will also mean "map reduce" will start matching "map decrease." (Spell checker is obvious, but I suppose, programmers would spell their queries correctly.)
You've probably done more thinking on this subject than most folks who will try and answer you (part of the reason why it's been a day and I'm your first response, I'd imagine). I'm just going to try and tackle your final three questions, b/c there's just a lot there that I don't have time to go into, and I think those three are the most interesting (the physical implementation questions are probably going to wind up being 'pick something, and then tweak it as you learn more').
vote data Not sure that votes make something more relevant to a search, frankly, just makes them more popular. If that makes sense, I'm trying to say that whether a given post is relevant to your question is mostly independant of whether it was relevant to other people. that said, there's probably at least a weak correlation between interesting questions and those that folks would want to find. Vote data is probably most useful in doing searches based purely on data, e.g. "most popular" type searches. In generic text-based searches, I'd probably not provide any weight for votes at first, but would consider working on an algorithm that perhaps provides a slight weight for the sorting (so, not the results returned, but minor boost to the ordering of them).
replies I'd agree w/ your approach here, subject to some testing; remember that this is going to have to be an iterative process based on user feedback (so you'll need to collect metrics on whether searches returned successful results for the searcher)
other Don't forget the user's score also. So, users get points on SO also, and that influences their default rank in the answers of each question they answer (looks like it's mostly for tiebreaking on replies that have the same number of bumps)
Determining relevance is always tricky. You need to figure out what you're trying to accomplish. Is your search trying to provide an exact match for a problem someone might have or is it trying to provide a list of recent items on a topic?
Once you've figured what you want to return you can look at the relative effect of each feature you're indexing. That will get a rough search going. From there you tweak based on user feedback (I suggest using implicit feedback instead of explicit otherwise you'll annoy the user).
As to indexing, you should try to put the data in so that each item has all the information necessary to rank it. This means you'll need to grab the data from a number of locations to build it up. Some indexing systems have the capability to add values to existing items which would make it easy to add scores to questions when subsequent answers came in. Simplicity would just have you rebuild the question every so often.
I think that Lucene is not good for this job.
You need something really fast with high availbility... like SQL
But you want open source?
I would suggest you use Sphinx - http://www.sphinxsearch.com/
It's much better, and i am speaking with experience, i used them both.
Sphinx is amazing. Really is.