Related
I want to create feature 'who viewed this item also viewed' like Amazon or Ebay. I'm deciding between MySql and non-relational database like MongoDB.
Edit: It seems to be straightforward to implement this feature in MySql. My guess is creating 'viewed' table in which userId, itemId, and time of viewing are saved. So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId) Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId
Wouldn't this be too much for 100,000 users who viewed 100 pages this month?
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of itemIds he looked at and each Item holds a list of userIds seen by. And I'm not sure what to do next. Am I on the right path here?
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Initial Response
It seems to be straightforward to implement this feature in MySql by just calling JOIN on Item and User table.
Yes.
But, how fast or slow the database call will be to gather entire viewing history of 100,000 users at once?
How long is a piece of string ?
That depends on the standards and quality of your Relational Database implementation. If you have ID fields on all your files, it won't have Relational integrity, power, or speed, it will have 1970's ISAM Record Filing System speeds.
On a Sybase ASE server, on a small Unix box, a SELECT of similar intent on a table (not a file) with 16 billion rows returns 100 rows in 12 milliseconds.
For non-relational database, I don't feel it is right to have User to embed all users or Item to embed all Users. So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
I can't answer re MangoDb.
But for a Relational Database, that is how we implement it.
with one great difference: the two lists are implemented in a single table
each row is a single fact viewed [sorry] from two sides (the fact that an User has viewed an Item, is one and the same fact that an Item has been viewed by an User)
So it appears to be Relational thinking ... implemented Mango-style, which requires 100% data and table duplication. I have no idea whether that is good or bad in MongoDb, in the sense that it could well be what is required for the thing to "perform". Ugly as sin.
And I'm not sure what to do next. Am I on the right path here?
Right for Relational (as long as you use one table for the two "lists"). Ask a more specific question if you do not understand this point.
If not, could you suggest a good way to implement this feature in non-relational database? And, does this suggestion have advantage in speed compared to MySql?
Sorry, I can't answer that.
But it would be unlikely that a non-relational DB can store and retrieve info that is classic Relational, faster than a semi-relational Record Filing System such as MySQL. All things being equal, of course. A real SQL platform would be faster still.
Response to Comments
First you had:
So, I'm thinking to have each User holds a list of item ids he looked at and each Item holds a list of user ids seen by.
That is two lists. That is not good, because the second list is a 100% duplication of the first.
Now you have (edited in the Question, and in the new comments):
I didn't fully understand what you meant by 'use one table for the two list'. My interpretation is create 'viewed' table in which userId, itemId, and time of viewing are saved.
That is good, you now have one list.
Just to be clear about the database we are discussing, let me erect a model, and have you confirm it.
User Item Data Model
If you are not used to the standard Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
So, when trying to recommend off of a current item a user is looking at, I would Sub = (SELECT userId FROM viewed WHERE itemId == currentItemId). Then, SELECT itemId FROM viewed INNER JOIN Sub on viewed.userId = Sub.userId. Is this what you mean?
I did make a declaration and caution about the table, but I didn't give any directions regarding non-SQL coding, so no.
I would never suggest doing something in two steps, that can be done in a single step. SQL has its problems, but difficulty in obtaining information from a set of Relational tables (ie. a derived relation) using a single SELECT is definitely not one of them.
SUB is not SQL. Although I can guess at what it does, I may well be wrong, therefore I cannot comment on that code.
Against the model I have supplied, on an ISO/IEC/ANSI Standard SQL platform, I would use:
SELECT DISTINCT ItemId -- Items viewed by ...
FROM UserItem
WHERE UserId = (
SELECT UserId -- Users who viewed Item
FROM UserItem
WHERE ItemId = #CurrentItemId
)
You will have to translate that into the non-SQL that your platform requires.
Wouldn't it be too much for 100,000 users who viewed 100 pages this month? Sorry for long question.
I have already answered that question in my initial response. Please read again.
You are trying to solve a performance problem that you do not yet have. That is not possible, given the laws of physics, the dependencies, our inability to reverse the chronology; etc. Therefore I recommend that you cease that activity.
Meanwhile, back at the farm, the cows need to be fed. Design the database first, then code the app, then if, and only if, there are performance problems, you can address them. IT Professionals can make scientific estimates, but I cannot give you a tutorial here in SO.
10,000,000 page views per month. You have not stated the no of Items, so the large figure is scary as hell. if you inform me as to how many Items; Users; Average Items viewed per session; and the duration (eg. month) you wish to cover, I can give you more specific advice.
As I understand it, an User views 1 (one) Item. As a selling-up feature, you want the system to identify the list of Items people "who viewed this item also viewed ...". That would appear to be a small fraction of 10,000,000 views. You do have an index on each table, yes ? So the non-SQL program you are using will not read 10,000,000 views to find that fraction, it will navigate the index, and read only the pages that contain that fraction.
Some of the non-SQLs need a second index to perform what real SQL platforms perform with one index. I have given that second index in the model.
While I appreciate that it was alright that a full definition was not provided for the file you described, up to now, since I am providing a model, I have to provide a complete and correct one, not a partial one.
Since Users view Items more than once, I have given a table that allows that, and tracks the Number of Views, and the Date Last Viewed. It is one row per User::Item, ever. If you would like a table that supports one row per User::Item view, please ask, I will provide.
From where I sit, on the basis of facts established thus far, the 10,000,000 figure is not concern.
This probably depends more on how you implement this feature than on the type of database used.
If you just store a lot of viewing history (like, "user x looked at item y"), you'd have to check out the users who viewed an item, and then all the items those users looked at. That can all be done on a single database table. However may end up with very large result sets.
It may be easier to use a graph structure of "connected" items that is continually updated during runtime and then easily queried.
Let's say we have a requirement to create a system that consumes a high-volume, real-time data stream of documents, and that matches those documents against a set of user-defined search queries as those documents become available. This is a prospective, as opposed to a retrospective, search service. What would be an appropriate persistence solution?
Suppose that users want to see a live feed of documents that match their queries--think Google Alerts--and that the feed must display certain metadata for each document. Let's assume an indefinite lifespan for matches; i.e., the system will allow the user to see all of the matches for a query from the time when the particular query was created. So the metadata for each document that comes in the stream, and the associations between the document and the user queries that matched that document, must be persisted to a database.
Let's throw in another requirement, that users want to be able to facet on some of the metadata: e.g., the user wants to see only the matching documents for a particular query whose metadata field "result type" equals "blog," and wants a count of the number of blog matches.
Here are some hypothetical numbers:
200,000 new documents in the data stream every day.
-The metadata for every document is persisted.
1000 users with about 5 search queries each: about 5000 total user search queries.
-These queries are simple boolean queries.
-As each new document comes in, it is processed against all 5000 queries to see which queries are a match.
Each feed--one for each user query--is refreshed to the user every minute. In other words, for every feed, a query to the database for the most recent page of matches is performed every minute.
Speed in displaying the feed to the user is of paramount importance. Scalability and high availability are essential as well.
The relationship between users and queries is relational, as is the relationship between queries and matching documents, but the document metadata itself are just key-value pairs. So my initial thought was to keep the relational data in a relational DB like MySQL and the metadata in a NoSQL DB, but can the faceting requirement be achieved in a NoSQL DB? Also, constructing a feed would then require making a call to two separate data stores, which is additional complexity. Or perhaps shove everything into MySQL, but this would entail lots of joins and counts. If we store all the data as key-value pairs in some other kind of data store, again, how would we do the faceting? And there would be a ton of redundant metadata for documents that match more than one search query.
What kind of database(s) would be a good fit for this scenario? I'm aware of tools such as Twitter Storm and Yahoo's S4, which could be used to construct the overall architecture of such a system, but I'd like to focus on the database, given the data storage, volume, and query/faceting requirements.
First, I disagree with Ben. 200k new records per day compares with 86,400 seconds in a day, so we are talking about three records per second. This is not earth shattering, but it is a respectable clip for new data.
Second, I think this is a real problem that people face. I'm not going to be one that says that this forum is not appropriate for the topic.
I think the answer to the question has a lot to do with the complexity and type of user queries that are supported. If the queries consist of a bunch of binary predicates, for instance, then you can extract the particular rules from the document data and then readily apply the rules. If, on the other hand, the queries consist of complex scoring over the text of the documents, then you might need an inverted index paired with a scoring algorithm for each user query.
My approach to such a system would be to parse the queries into individual data elements that can be determined from each document (which I might call a "queries signature" since the results would contain all fields needed to satisfy the queries). This "queries signature" would be created each time a document was loaded, and it could then be used to satisfy the queries.
Adding a new query would require processing all the documents to assign new values. Given the volume of data, this might need to be more of a batch task.
Whether SQL is appropriate depends on the features that you need to extract from the data. This in turn depends on the nature of the user queries. It is possible that SQL is sufficient. On the other hand, you might need more sophisticated tools, especially if you are using text mining concepts for the queries.
Thinking about this, it sounds like an event-processing task, rather than a regular data processing operation, so it might be worth investigating Complex Event Processing systems - rather than building everything on a regular database, using a system which processes the queries on the incoming data as it streams into the system. There are commercial systems which can hit the speed & high-availability criteria, but I haven't researched the available OSS options (luckily, people on quora have done so).
Take a look at Elastic Search. It has a percolator feature that matches a document against registered queries.
http://www.elasticsearch.org/blog/2011/02/08/percolator.html
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.
I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.
The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].
As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:
Lucene.NET is not being used for Stack
Overflow - we are using SQL Server
Full Text indexing. Search is an area
where we continue to make minor
tweaks.
So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?
Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):
Technologies:
ASP.NET MVC
SQL Server 2008
.NET 3.5
C# 3.0
And of course, the star of the show, Lucene.NET.
The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.
Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.
Models
This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:
Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
Events: This is data about a real-time event. It has location information, date/time information.
The important thing to note about these models:
They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
Every instance of a model has a unique URL on the site
Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:
Votes: Keyed on the user
Replies: Optional, as an example, see the Notes case above
Favorited: Is the model listed as a favorite of a user?
Comments: (optional)
Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.
And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):
Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).
Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.
SQL Server 2008
The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.
It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.
Lucene.NET
This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.
Indexing
Questions/Models
I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).
In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.
The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).
As an example, a question sets the subject, and then the answer elaborates on the subject.
For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.
I believe that this will help with making the relevance scores more relevant to each other.
Tags
Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:
If the tag exists
Which questions the tags are associated with
The questions themselves
However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
where
t.Name = <tag>
And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.
This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).
The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).
At this point, the confidence score (Wilson score interval) comes into play for ordering the results.
This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).
Given that I can change the SQL statement to order based on the Wilson score interval like this:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
left outer join VoteTallyStatistics as s on
s.ModelTypeId = ta.ModelTypeId and
s.ModelId = ta.ModelId
where
t.Name = <tag>
order by
--- Use Id to break ties.
s.WilsonIntervalLowerBound desc, m.Id
It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".
Replies
Originally, I thought this is in a separate index of its own, with a key back into the Questions index.
I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.
This would of course bloat the index. I'm somewhat comfortable with that right now.
Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.
There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?
What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?
Boost
This is related to indexing of course, but I think it merits it's own section.
Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.
For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?
The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.
Shashikant Kore suggests a document structure like this:
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.
The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.
Search for Items Tagged
I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.
However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.
General Search
So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):
q votes answers accepted answer votes asp.net highlights mvc highlights
------- ------- --------------------- ------------------ --------------
21 26 51 2 2
58 23 70 2 5
29 24 40 3 4
37 15 25 1 2
59 23 47 2 2
Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).
How is all of this brought together?
At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.
How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?
It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.
My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:
1 / ((e ^ cs) * (e ^ rs))
This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.
The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).
Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.
Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:
confidence score = votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound
The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.
To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:
confidence score = 0.5 +
(votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound) / 2
My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).
An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.
As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.
The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.
I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:
Algorithms of the Intelligent Web
Collective Intelligence in Action
Programming Collective Intelligence
The lucene index will have following fields :
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
All these are fields are Analyzed. Length normalization is disabled to get better control on the scoring.
The aforementioned order of the fields also reflect their importance in descending order. That is if the query match in title is more important than in accepted answer, everything else remaining same.
The # of upvotes is for the question and the top answer can be captured by boosting those fields. But, the raw upvote count cannot be used as boost values as it could skew results dramatically. (A question with 4 upvotes will get twice the score of one with 2 upvotes.) These values need to be dampened aggressively before they could be used as boost factor. Using something natural logarithm (for upvotes >3) looks good.
Title can be boosted by a value little higher than that of the question.
Though inter-linking of questions is not very common, having a basic pagerank-like weight for a question could throw up some interesting results.
I do not consider tags of the question as very valuable information for search. Tags are nice when you just want to browse the questions. Most of the time, tags are part of the text, so search for the tags will result match the question. This is open to discussion, though.
A typical search query will be performed on all the four fields.
+(title:query question:query accepted_answer:query all_combined:query)
This is a broad sketch and will require significant tuning to arrive at right boost values and right weights for queries, if required. Experiementation will show the right weights for the two dimensions of quality - relevance and importance. You can make things complicated by introducing recency as aranking parameter. The idea here is, if a problem occurs in a particular version of the product and is fixed in later revisions, the new questions could be more useful to the user.
Some interesting twists to search could be added. Some form of basic synonym search could be helpful if only a "few" matching results are found. For example, "descrease java heap size" is same as "reduce java heap size." But, then, it will also mean "map reduce" will start matching "map decrease." (Spell checker is obvious, but I suppose, programmers would spell their queries correctly.)
You've probably done more thinking on this subject than most folks who will try and answer you (part of the reason why it's been a day and I'm your first response, I'd imagine). I'm just going to try and tackle your final three questions, b/c there's just a lot there that I don't have time to go into, and I think those three are the most interesting (the physical implementation questions are probably going to wind up being 'pick something, and then tweak it as you learn more').
vote data Not sure that votes make something more relevant to a search, frankly, just makes them more popular. If that makes sense, I'm trying to say that whether a given post is relevant to your question is mostly independant of whether it was relevant to other people. that said, there's probably at least a weak correlation between interesting questions and those that folks would want to find. Vote data is probably most useful in doing searches based purely on data, e.g. "most popular" type searches. In generic text-based searches, I'd probably not provide any weight for votes at first, but would consider working on an algorithm that perhaps provides a slight weight for the sorting (so, not the results returned, but minor boost to the ordering of them).
replies I'd agree w/ your approach here, subject to some testing; remember that this is going to have to be an iterative process based on user feedback (so you'll need to collect metrics on whether searches returned successful results for the searcher)
other Don't forget the user's score also. So, users get points on SO also, and that influences their default rank in the answers of each question they answer (looks like it's mostly for tiebreaking on replies that have the same number of bumps)
Determining relevance is always tricky. You need to figure out what you're trying to accomplish. Is your search trying to provide an exact match for a problem someone might have or is it trying to provide a list of recent items on a topic?
Once you've figured what you want to return you can look at the relative effect of each feature you're indexing. That will get a rough search going. From there you tweak based on user feedback (I suggest using implicit feedback instead of explicit otherwise you'll annoy the user).
As to indexing, you should try to put the data in so that each item has all the information necessary to rank it. This means you'll need to grab the data from a number of locations to build it up. Some indexing systems have the capability to add values to existing items which would make it easy to add scores to questions when subsequent answers came in. Simplicity would just have you rebuild the question every so often.
I think that Lucene is not good for this job.
You need something really fast with high availbility... like SQL
But you want open source?
I would suggest you use Sphinx - http://www.sphinxsearch.com/
It's much better, and i am speaking with experience, i used them both.
Sphinx is amazing. Really is.
I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)