Exclude duplicate results from Solr query based on highlight snippets? - html

The scene:
I have indexed many websites using Nutch and Solr. I've implemented result grouping by site. My results output includes the page title, highlight snippets and URL. My issue is with the page navigation/copyright/company info bits that appear on many company sites.
A query for "solder", for example, may return 200+ results for a particular site -- but only a handful of the results are actually appropriate; perhaps the company's site structure includes "solder" on every page as part of their core business description, site navigation, etc. There are relevant results to see, but they're flooded by the irrelevant, repetitive matches from the other pages on the site.
The problem:
I've seen other postings asking how to prevent Nutch and Solr from indexing site headers, footers, navigation and others but with such a diverse group of sites, this approach just isn't feasible. What I'm observing, however, is that although the content for each result is significantly different, the highlighted snippets returned are 90-100% identical for the results I don't want. Observe:
Products | Alloy Information || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry
http://www.--------.com/Products/AlloyInformation.aspx
Products | Chemicals & Cleaners || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Industrial Division Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales
http://www.--------.com/Products/ChemicalsCleaners.aspx
Products | Rosin Based || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Products/RosinBased.aspx
Support | Engineering Guide || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Support Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Support/EngineeringGuide.aspx
The Big Idea:
This leads me to the question of if I can filter or group results based on the highlighted snippets that are returned. I can't just group on the content because 1) the field is huge; and 2) the content is very different from page to page. If I could group, exclude or deduplicate results whose snippets were >85% identical, that would probably solve the problem. Perhaps some sort of post-processing step or some kind of tokenizer factory? Or a sort of idf for the search results rather than the entire document set?
This seems like it would be a fairly common problem, and perhaps I've just missed how to do it. Essentially this is Google's "To blah blah your search, we have hidden xxx similar results. Click here to show them" feature.
Thoughts?

I don't thinkthere is any way of doing exactly what you are asking, except post-processing that would be up to you, and not very efficient for larger result sets.
Maybe you should ask a different question if the documents being returned are actually quite different, even though the snippets are identical. If the documents are different, presumably there is value in showing them all, rather than de-duplicating.
You could try enhancing the search result display to show more information about the documents so that the user can discriminate amongst them - maybe not relying on highlighting, but showing some other parts of the document as well?
I really do think though that at the heart of the problem is the need to make matches found in site boilerplate less relevant than matches found elsewhere. Usually relevance ranking does a good job of this because the common terms are much less important for relevance ranking, but if you are mixing documents from a wide range of different sites you might find the effect less pronounced - since oft-repeated terms on one site could be very unique on another site. If your results are truly segmented by site, you might consider creating separate indexes (cores) for each site - this would have the effect of performing the relevance calculations in a site-specific way, and might help with this problem.

In the base shipping Nutch (not Solr) there is a clustering mechanism, I don't really know how it works but it does something which I had to remove. Have you looked at that ?
Another idea popping to mind would be : to index separetely real content, from navigational snippets. And at search time you apply a heigher query weight to 'real content' field.
Which would pull forward pages with 'solder' as content as opposed to pages with only 'solder' as navigation and yet you keep all pages just in case.
Hope I understood your problem correctly.

Related

Recursive "incategory" CirrusSearch query

As a follow up to this question I'd like to know if there is a way to perform a recursive search within a particular MediaWiki category. Currently that only seems to search within the category given, not sub-categories.
The only alternative I found was to concatenate all the sub-pages into one big query but does not seems to scale for really large categories as MediaWiki (on the Commons) says Query was not understood. Please make it simpler..
What other options are there to recursively search within a particular category?
No. MediaWiki categories are not hierarchical, so you will encounter loops (Category:A < Category:B < Category:A), category trees branching and re-merging, and all kinds of other weird things. In large wikis, such as Wikipedia, it is also not very useful, because there is often no well defined ontology for categorization. If you traverse the category of the nations of the world on enwp, for instance, you will get the article about the pope somewhere in there (because the popes are categorized under the Vatican City State).
That's why Deepcat had to be developed.

Implementing search on medical link list/table that allows for synonyms/abbreviations- and importing such a thing

I'm making a simple searchable list which will end up containing about 100,000 links on various medical topics- mostly medical conditions/diseases.
Now on the surface of things this sounds easy... in fact I've set my tables up in the following way:
Links: id, url, name, topic
Topics (eg cardiology, paediatrics etc): id, name
Conditions (eg asthma, influenza etc): id, name, aliases
And possibly another table:
Link & condition (since 1 link can pertain to multiple conditions): link id, condition id
So basically since doctors (including myself) are super fussy, I want to make it so that if you're searching for a condition- whether it be an abbreviation, british or american english, or an alternative ancient name- you get relevant results (eg "angiooedema", "angioedema", "Quincke's edema" etc would give you the same results; similarly with "gastroesophageal reflux" "gastro-oesophageal reflux disease", GERD, GORD, GOR). Additionally, at the top of the results it would be good to group together links for a diagnosis that matches the search string, then have matches to link name, then finally matches to the topic.
My main problem is that there are thousands if not tens of thousands of conditions, each with up to 20 synonyms/spellings etc. One option is to get data from MeSH which happens to be a sort of medical thesaurus (but in american english only so there would have to be a way of converting from british english). The trouble being that the XML they provide is INSANE and about 250mb. To help they have got a guide to what the data elements are.
Honestly, I am at a loss as to how to tackle this most effectively as I've just started programming and working with databases and most of the possibilities of what to do seem difficult/suboptimal.
Was wondering if anyone could give me a hand? Happy to clarify anything that is unclear.
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Link
Topic
Conditions
Then you can write a Lucene query such as Topic:edema and you should get all results.
You can do wildcard search for more.
To match british spellings (or even misspellings) you can use the ~ query which finds terms within a certain string distance. For example edema~0.5 matches oedema, oedoema and so on...
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
Summary: my recommendation is to use Apache Solr as an adjunct to your MySql db.
It's hard. Your best bet is to use MeSH and then perhaps soundex to match on British English terms.

How to get the total count of every product attribute/filter like newegg

If you go to newegg.com (just one example) you'll notice while browsing products you can see the number of items next to each product attribute in the left hand sidebar.
With so many attributes on some items and so many different configurations of product filters how do they calculate all of those totals so fast?
For newegg.com, they are using a faceted navigation technology provided by endeca
In nutshell, endeca will actually use the data provided in xml/csv or directly retrieve data from any database (not limited to just mysql) and calculate similarity and group the result into their own format
Endeca is not free, the open-source alternative such as sphinx or lucene solr
Newegg uses Endeca, and they were probably one of Endeca's earlier customers. In retrospect, Endeca might have been a big contributor to their success. Faceted navigation works very well on complex electronics like computer parts.
There are a few things to consider in faceted navigation:
1) Do you want just faceted navigation on category-driven queries, or do you also want it to work on search? In fact, categories are a hierarchical facet of sorts.
2) Does the de-normalized inverted index model of Solr cause you problems?
If the answer to 1) is true -- it probably is -- you'll need some inverted indices. Inverted indices are pretty much the only way to do keyword search. They will also do faceting with some caveats.
Essentially you can consider each facet as an inverted index (in fact keyword search might be considered a special facet with ranking functions). Then to do counts you'd have to intersect/and the current query and filters with all other facet values. However, this model can lead to problems if you need to represent sparse product sets (see 2).
If the answer to 2) is true, it might help more to think about facets more in terms of OLAP. I don't know if inverted indices can handle complex relationships without some abstractions.
It's fair to consider and implement faceted search/nav as a blend of fulltext (typically implemented as an inverted index) and/or OLAP.
I'm pretty sure you can pull off faceting with a column store, but you'd still need to have an inverted index at your disposal to merge with if you want keyword search.
#Dan Grossman:
It might seem so, BUT --
Did you think for a moment how many combinations there are of facets? You can't cache so many pages like that. There are probably more combinations on Newegg.com than stars in your sky.
Add in multiple selection and it's even worse. Game over.
You can only cache some cases like unfiltered and commonly filtered. If you try to spider Newegg.com without limiting levels of recursion, you'll kill the spider. Faceted sites cause problems for search engines in general for this very reason. See http://www.searchmarketingstandard.com/facets-navigational-seo-powerhouse-part
You do not know that they calculate them fast. You only know that they render them fast. They could spend hours calculating those totals and rendering their pages, cache the results and serve those static files until some time when they want to refresh the data.

How would one use Lucene.NET to help implement search on a site like Stack Overflow?

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.
The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].
As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:
Lucene.NET is not being used for Stack
Overflow - we are using SQL Server
Full Text indexing. Search is an area
where we continue to make minor
tweaks.
So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?
Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):
Technologies:
ASP.NET MVC
SQL Server 2008
.NET 3.5
C# 3.0
And of course, the star of the show, Lucene.NET.
The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.
Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.
Models
This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:
Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
Events: This is data about a real-time event. It has location information, date/time information.
The important thing to note about these models:
They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
Every instance of a model has a unique URL on the site
Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:
Votes: Keyed on the user
Replies: Optional, as an example, see the Notes case above
Favorited: Is the model listed as a favorite of a user?
Comments: (optional)
Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.
And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):
Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).
Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.
SQL Server 2008
The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.
It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.
Lucene.NET
This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.
Indexing
Questions/Models
I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).
In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.
The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).
As an example, a question sets the subject, and then the answer elaborates on the subject.
For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.
I believe that this will help with making the relevance scores more relevant to each other.
Tags
Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:
If the tag exists
Which questions the tags are associated with
The questions themselves
However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
where
t.Name = <tag>
And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.
This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).
The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).
At this point, the confidence score (Wilson score interval) comes into play for ordering the results.
This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).
Given that I can change the SQL statement to order based on the Wilson score interval like this:
select
m.Name, m.Body
from
Models as m
left outer join TagAssociations as ta on
ta.ModelTypeId = <fixed model type id> and
ta.ModelId = m.Id
left outer join Tags as t on t.Id = ta.TagId
left outer join VoteTallyStatistics as s on
s.ModelTypeId = ta.ModelTypeId and
s.ModelId = ta.ModelId
where
t.Name = <tag>
order by
--- Use Id to break ties.
s.WilsonIntervalLowerBound desc, m.Id
It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".
Replies
Originally, I thought this is in a separate index of its own, with a key back into the Questions index.
I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.
This would of course bloat the index. I'm somewhat comfortable with that right now.
Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.
There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?
What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?
Boost
This is related to indexing of course, but I think it merits it's own section.
Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.
For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?
The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.
Shashikant Kore suggests a document structure like this:
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.
The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.
Search for Items Tagged
I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.
However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.
General Search
So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):
q votes answers accepted answer votes asp.net highlights mvc highlights
------- ------- --------------------- ------------------ --------------
21 26 51 2 2
58 23 70 2 5
29 24 40 3 4
37 15 25 1 2
59 23 47 2 2
Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).
How is all of this brought together?
At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.
How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?
It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.
My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:
1 / ((e ^ cs) * (e ^ rs))
This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.
The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).
Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.
Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:
confidence score = votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound
The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.
To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:
confidence score = 0.5 +
(votetally < 0 ?
-(1 - wilson score interval upper bound) :
wilson score interval lower bound) / 2
My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).
An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.
As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.
The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.
I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:
Algorithms of the Intelligent Web
Collective Intelligence in Action
Programming Collective Intelligence
The lucene index will have following fields :
Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined
All these are fields are Analyzed. Length normalization is disabled to get better control on the scoring.
The aforementioned order of the fields also reflect their importance in descending order. That is if the query match in title is more important than in accepted answer, everything else remaining same.
The # of upvotes is for the question and the top answer can be captured by boosting those fields. But, the raw upvote count cannot be used as boost values as it could skew results dramatically. (A question with 4 upvotes will get twice the score of one with 2 upvotes.) These values need to be dampened aggressively before they could be used as boost factor. Using something natural logarithm (for upvotes >3) looks good.
Title can be boosted by a value little higher than that of the question.
Though inter-linking of questions is not very common, having a basic pagerank-like weight for a question could throw up some interesting results.
I do not consider tags of the question as very valuable information for search. Tags are nice when you just want to browse the questions. Most of the time, tags are part of the text, so search for the tags will result match the question. This is open to discussion, though.
A typical search query will be performed on all the four fields.
+(title:query question:query accepted_answer:query all_combined:query)
This is a broad sketch and will require significant tuning to arrive at right boost values and right weights for queries, if required. Experiementation will show the right weights for the two dimensions of quality - relevance and importance. You can make things complicated by introducing recency as aranking parameter. The idea here is, if a problem occurs in a particular version of the product and is fixed in later revisions, the new questions could be more useful to the user.
Some interesting twists to search could be added. Some form of basic synonym search could be helpful if only a "few" matching results are found. For example, "descrease java heap size" is same as "reduce java heap size." But, then, it will also mean "map reduce" will start matching "map decrease." (Spell checker is obvious, but I suppose, programmers would spell their queries correctly.)
You've probably done more thinking on this subject than most folks who will try and answer you (part of the reason why it's been a day and I'm your first response, I'd imagine). I'm just going to try and tackle your final three questions, b/c there's just a lot there that I don't have time to go into, and I think those three are the most interesting (the physical implementation questions are probably going to wind up being 'pick something, and then tweak it as you learn more').
vote data Not sure that votes make something more relevant to a search, frankly, just makes them more popular. If that makes sense, I'm trying to say that whether a given post is relevant to your question is mostly independant of whether it was relevant to other people. that said, there's probably at least a weak correlation between interesting questions and those that folks would want to find. Vote data is probably most useful in doing searches based purely on data, e.g. "most popular" type searches. In generic text-based searches, I'd probably not provide any weight for votes at first, but would consider working on an algorithm that perhaps provides a slight weight for the sorting (so, not the results returned, but minor boost to the ordering of them).
replies I'd agree w/ your approach here, subject to some testing; remember that this is going to have to be an iterative process based on user feedback (so you'll need to collect metrics on whether searches returned successful results for the searcher)
other Don't forget the user's score also. So, users get points on SO also, and that influences their default rank in the answers of each question they answer (looks like it's mostly for tiebreaking on replies that have the same number of bumps)
Determining relevance is always tricky. You need to figure out what you're trying to accomplish. Is your search trying to provide an exact match for a problem someone might have or is it trying to provide a list of recent items on a topic?
Once you've figured what you want to return you can look at the relative effect of each feature you're indexing. That will get a rough search going. From there you tweak based on user feedback (I suggest using implicit feedback instead of explicit otherwise you'll annoy the user).
As to indexing, you should try to put the data in so that each item has all the information necessary to rank it. This means you'll need to grab the data from a number of locations to build it up. Some indexing systems have the capability to add values to existing items which would make it easy to add scores to questions when subsequent answers came in. Simplicity would just have you rebuild the question every so often.
I think that Lucene is not good for this job.
You need something really fast with high availbility... like SQL
But you want open source?
I would suggest you use Sphinx - http://www.sphinxsearch.com/
It's much better, and i am speaking with experience, i used them both.
Sphinx is amazing. Really is.

SEO for common misspellings

Here's the general question that I'm asking:
How do you optimise your website so that searches using common misspellings of your name find their way to you?
And my specific situation:
At my company, we sell online education courses. These are given a code of two letters followed by two numbers, eg: BE01, BE02, IH01.
These courses have been around for some time now (9 years, which is like 63 internet years or something), and since our target market is fairly niche, most of our marketing comes from word-of-mouth from the small community.
I was looking at our statistics to see the search keywords used to get to our site, and the highest ranked one which wasn't just our company name was "BE10", which is one of our least popular courses. This made me think that people are typing in how they hear other people refer to the courses verbally, that is: "bee-ee-oh-one" - BEO1 (not BE01).
Looking at some other questions, and they say that the keywords meta tag is virtually useless, and that that information should go into the content of the page. I obviously don't want to perpetuate the misconception that our courses are called BEO1 by putting that into the content, so what should I do?
I'd recommend making separate informational pages on the misspellings (http://example.com/beo1.html or what-have-you) that include a brief explanation about the confusion and refer to the correct course page. Get these indexed by including them in your sitemap (presumably you have one already), and if you like, improve their likelihood of indexing and their ranking by linking them in an inconspicuous "common misspellings" section in the real course pages.