Synonym dictionary implementation? - language-agnostic

How should I approach this problem? I basically need to implement a dictionary of synonyms. It takes as input some "word/synonim" pairs and I have to be able to "query" it for the list of all synonims of a word.
For example:
Dictionary myDic;
myDic.Add("car", "automobile");
myDic.Add("car", "autovehicle");
myDic.Add("car", "vehicle");
myDic.Add("bike", "vehicle");
myDic.ListOSyns("car") // should return {"automobile","autovehicle","vehicle" ± "car"}
// but "bike" should NOT be among the words returned
I'll code this in C++, but I'm interested in an overall idea of the implementation, so the question is not exactly language-specific.
PS: The main idea is to have some groups of words (synonyms). In the example above there would be two such groups:
{"automobile","autovehicle","vehicle", "car"}
{"bike", "vehicle"}
"vehicle" belongs to both, "bike" just to the second one, the others just to the first

I would implement it as a Graph + hash table / search tree
each keyword would be a Vertex, and each connection between 2 keywords would be an edge.
a hash table or a search tree will connect from each word to its node (and vice versa).
when a query is submitted - you find the node with your hash/tree and do BFS/DFS of the required depth. (meaning you cannot continue after a certain depth)
complexity: O(E(d)+V(d)) for searching graph (d = depth) (E(d) = number of edges in the relevant depth, same for V(d))
O(1) for creating an edge (not including searching for the node, detailed below its search)
O(logn) / O(1) for finding node (for tree/hash table)
O(logn) /O(1) for adding a keyword to the tree/hash table and O(1) to add a Vertex
p.s. as mentioned: the designer should keep in mind if he needs a directed or indirected Graph, as mentioned in the comments to the question.
hope that helps...

With the clarification in the comments to the question, it's relatively simple since you're not storing groups of mutual synonyms, but rather separately defining the acceptable synonyms for each word. The obvious container is either:
std::map<std::string, std::set<std::string> >
or:
std::multi_map<std::string, std::string>
if you're not worried about duplicates being inserted, like this:
myDic.Add("car", "automobile");
myDic.Add("car", "auto");
myDic.Add("car", "automobile");
In the case of multi_map, use the equal_range member function to extract the synonyms for each word, maybe like this:
struct Dictionary {
vector<string> ListOSyns(const string &key) const {
typedef multi_map<string, string>::const_iterator constit;
pair<constit, constit> x = innermap.equal_range(key);
vector<string> retval(x.first, x.second);
retval.push_back(key);
return retval;
}
};
Finally, if you prefer a hashtable-like structure to a tree-like structure, then unordered_multimap might be available in your C++ implementation, and basically the same code works.

Related

Overwrite function only for a particular instance in LUA

I basically don't look for an answer on how to do something but I found how to do it, yet want more information. Hope this kind of question is OK here.
Since I just discovered this the code of a game I'm modding I don't have any idea what should I google for.
In Lua, I can have for example:
Account = {balance = 0}
function Account.withdraw (v)
self.balance = self.balance - v
end
I can have (in another lua file)
function Account.withdrawBetter (v)
if self.balance > v then
self.balance = self.balance - v
end
end
....
--somewhere in some function, with an Account instance:
a1.withdraw = a1.withdrawBetter
`
What's the name for this "technique" so I can find some more information about it (possible pitfalls, performance considerations vs. override/overwrite, etc)? note I'm only changing withdraw for the particular instance (a1), not for every Account instance.
Bonus question: Any other oo programming languages with such facility?
Thanks
OO in Lua
First of all, it should be pointed out that Lua does not implement Object Oriented Programming; it has no concept of objects, classes, inheritance, etc.
If you want OOP in Lua, you have to implement it yourself. Usually this is done by creating a table that acts as a "class", storing the "instance methods", which are really just functions that accept the instance as its first argument.
Inheritance is then achieved by having the "constructor" (also just a function) create a new table and set its metatable to one with an __index field pointing to the class table. When indexing the "instance" with a key it doesn't have, it will then search for that key in the class instead.
In other words, an "instance" table may have no functions at all, but indexing it with, for example, "withdraw" will just try indexing the class instead.
Now, if we take a single "instance" table and add a withdraw field to it, Lua will see that it has that field and not bother looking it up in the class. You could say that this value shadows the one in the class table.
What's the name for this "technique"
It doesn't really have one, but you should definitely look into metatables.
In languages that do support this sort of thing, like in Ruby (see below) this is often done with singleton classes, meaning that they only have a single instance.
Performance considerations
Indexing tables, including metatables takes some time. If Lua finds a method in the instance table, then that's a single table lookup; if it doesn't, it then needs to first get the metatable and index that instead, and if that doesn't have it either and has its own metatable, the chain goes on like that.
So, in other words, this is actually faster. It does use up some more space, but not really that much (technically it could be quite a lot, but you really shouldn't worry about that. Nonetheless, here's where you can read up on that, if you want to).
Any other oo programming languages with such facility?
Yes, lots of 'em. Ruby is a good example, where you can do something like
array1 = [1, 2, 3]
array2 = [4, 5, 6]
def array1.foo
puts 'bar'
end
array1.foo # prints 'bar'
array2.foo # raises `NoMethodError`

how do i decode/encode the url parameters for the new google maps?

Im trying to figure out how to extract the lat/long of the start/end in a google maps directions link that looks like this:
https://www.google.com/maps/preview#!data=!1m4!1m3!1d189334!2d-96.03687!3d36.1250439!4m21!3m20!1m4!3m2!3d36.0748342!4d-95.8040972!6e2!1m5!1s1331-1399+E+14th+St%2C+Tulsa%2C+OK+74120!2s0x87b6ec9a1679f9e5%3A0x6e70df70feebbb5e!3m2!3d36.1424613!4d-95.9736986!3m8!1m3!1d189334!2d-96.03687!3d36.1250439!3m2!1i1366!2i705!4f13.1&fid=0
Im guessing the "!" is a separator between variables followed by XY where x is a number and y is a lower case letter, but can not quite figure out how to reliably extract the coordinates as the number/order of variables changes as well as their XY prefixes.
ideas?
thanks
Well, this is old, but hey. I've been working on this a bit myself, so here's what I've figured out:
The data is an encoded javascript array, so the trick when trying to generate your own data string is to ensure that your formatting keeps the structure of the array intact. To do this, let's look at what each step represents.
As you're correctly figured out, each exclamation point defines the start of a value definition. The first character, an int value, is an inner count, and (I believe) acts as an identifier, although I'm not 100% certain on this. It seems to be pretty flexible in terms of what you can have here, as long as it's an int. The second character, however, is much more important. It defines the data type of the value. I don't know if I've found all the data types yet, but the ones I have figured out are:
m: matrix
f: float
d: double
i: integer
b: boolean
e: enum (as integer)
s: string
u: unsigned int
x: hexdecimal value?
the remaining characters actually hold the value itself, so a string will just hold the string, a boolean will be '1' or '0', and so on. However, there's an important gotcha: the matrix data type.
The value of the matrix will be an integer. This is the length of the matrix, measured in the number of values. That is, for a matrix !1mx, the next x value definitions will belong to the matrix. This includes nested matrix definitions, so a matrix of form [[1,2]] would look like !1m3!1m2!1i1!2i2 (outer matrix has three children, inner matrix has 2). this also means that, in order to remove a value from the list, you must also check it for matrix ancestors and, if they exist, update their values to reflect the now missing member.
The x data type is another anomaly. I'm going to guess it's hexdecimal encoded for most purposes, but in my particular situation (making a call for attribution info), they appear to also use the x data type to store lat/long information, and this is NOT encoded in hex, but is an unsigned long with the value set as
value = coordinate<0 ? (430+coordinate)*1e7 : coordinate*1e7
An example (pulled directly from google maps) of the x data type being used in this way:
https://www.google.com/maps/vt?pb=!1m8!4m7!2u7!5m2!1x405712614!2x3250870890!6m2!1x485303036!2x3461808386!2m1!1e0!2m20!1e2!2spsm!4m2!1sgid!2sznfCVopRY49wPV6IT72Cvw!4m2!1ssp!2s1!8m11!13m9!2sa!15b1!18m5!2b1!3b0!4b1!5b0!6b0!19b1!19u12!3m1!5e1105!4e5!18m1!1b1
For the context of the question asked, it's important to note that there are no reliable identifiers in the structure. Google reads the values in a specific order, so always keep in mind when building your own encoded data that order matters; you'll need to do some research/testing to determine that order. As for reading, your best hope is to rebuild the matrix structure, then scan it for something that looks like lat/long values (i.e. a matrix containing exactly two children of type double (or x?))
Looks like the developer tools from current browsers (I am using Chrome for that) can give you a lot of info.
Try the following:
Go to Google Maps with Chrome (or adapt the instructions for other browser);
Open Developer Tools (Ctrl + Shift + I);
Go to Network tab. Clear the current displayed values;
Drag the map until some url with encoded data appears;
Click on that url, and then go to the Preview sub-tab;
Try this.
function URLtoLatLng(url) {
this.lat = url.replace(/^.+!3d(.+)!4d.+$/, '$1');
this.lng = url.replace(/^.+!4d(.+)!6e.+$/, '$1');
return this;
}
var url = new URLtoLatLng('https://www.google.com/maps/preview#!data=!1m4!1m3!1d189334!2d-96.03687!3d36.1250439!4m21!3m20!1m4!3m2!3d36.0748342!4d-95.8040972!6e2!1m5!1s1331-1399+E+14th+St%2C+Tulsa%2C+OK+74120!2s0x87b6ec9a1679f9e5%3A0x6e70df70feebbb5e!3m2!3d36.1424613!4d-95.9736986!3m8!1m3!1d189334!2d-96.03687!3d36.1250439!3m2!1i1366!2i705!4f13.1&fid=0');
console.log(url.lat + ' ' + url.lng);

How do you return two values from a single method?

When your in a situation where you need to return two things in a single method, what is the best approach?
I understand the philosophy that a method should do one thing only, but say you have a method that runs a database select and you need to pull two columns. I'm assuming you only want to traverse through the database result set once, but you want to return two columns worth of data.
The options I have come up with:
Use global variables to hold returns. I personally try and avoid globals where I can.
Pass in two empty variables as parameters then assign the variables inside the method, which now is a void. I don't like the idea of methods that have a side effects.
Return a collection that contains two variables. This can lead to confusing code.
Build a container class to hold the double return. This is more self-documenting then a collection containing other collections, but it seems like it might be confusing to create a class just for the purpose of a return.
This is not entirely language-agnostic: in Lisp, you can actually return any number of values from a function, including (but not limited to) none, one, two, ...
(defun returns-two-values ()
(values 1 2))
The same thing holds for Scheme and Dylan. In Python, I would actually use a tuple containing 2 values like
def returns_two_values():
return (1, 2)
As others have pointed out, you can return multiple values using the out parameters in C#. In C++, you would use references.
void
returns_two_values(int& v1, int& v2)
{
v1 = 1; v2 = 2;
}
In C, your method would take pointers to locations, where your function should store the result values.
void
returns_two_values(int* v1, int* v2)
{
*v1 = 1; *v2 = 2;
}
For Java, I usually use either a dedicated class, or a pretty generic little helper (currently, there are two in my private "commons" library: Pair<F,S> and Triple<F,S,T>, both nothing more than simple immutable containers for 2 resp. 3 values)
I would create data transfer objects. If it is a group of information (first and last name) I would make a Name class and return that. #4 is the way to go. It seems like more work up front (which it is), but makes it up in clarity later.
If it is a list of records (rows in a database) I would return a Collection of some sort.
I would never use globals unless the app is trivial.
Not my own thoughts (Uncle Bob's):
If there's cohesion between those two variables - I've heard him say, you're missing a class where those two are fields. (He said the same thing about functions with long parameter lists.)
On the other hand, if there is no cohesion, then the function does more than one thing.
I think the most preferred approach is to build a container (may it be a class or a struct - if you don't want to create a separate class for this, struct is the way to go) that will hold all the parameters to be returned.
In the C/C++ world it would actually be quite common to pass two variables by reference (an example, your no. 2).
I think it all depends on the scenario.
Thinking from a C# mentality:
1: I would avoid globals as a solution to this problem, as it is accepted as bad practice.
4: If the two return values are uniquely tied together in some way or form that it could exist as its own object, then you can return a single object that holds the two values. If this object is only being designed and used for this method's return type, then it likely isn't the best solution.
3: A collection is a great option if the returned values are the same type and can be thought of as a collection. However, if the specific example needs 2 items, and each item is it's 'own' thing -> maybe one represents the beginning of something, and the other represents the end, and the returned items are not being used interchangably, then this may not be the best option.
2: I like this option the best, if 4, and 3 do not make sense for your scenario. As stated in 3, if you wanted to get two objects that represent the beginning and end items of something. Then I would use parameters by reference (or out parameters, again, depending on how it's all being used). This way your parameters can explicitly define their purpose: MethodCall(ref object StartObject, ref object EndObject)
Personally I try to use languages that allow functions to return something more than a simple integer value.
First, you should distinguish what you want: an arbitrary-length return or fixed-length return.
If you want your method to return an arbitrary number of arguments, you should stick to collection returns. Because the collections--whatever your language is--are specifically tied to fulfill such a task.
But sometimes you just need to return two values. How does returning two values--when you're sure it's always two values--differ from returning one value? No way it differs, I say! And modern languages, including perl, ruby, C++, python, ocaml etc allow function to return tuples, either built-in or as a third-party syntactic sugar (yes, I'm talking about boost::tuple). It looks like that:
tuple<int, int, double> add_multiply_divide(int a, int b) {
return make_tuple(a+b, a*b, double(a)/double(b));
}
Specifying an "out parameter", in my opinion, is overused due to the limitations of older languages and paradigms learned those days. But there still are many cases when it's usable (if your method needs to modify an object passed as parameter, that object being not the class that contains a method).
The conclusion is that there's no generic answer--each situation has its own solution. But one common thing there is: it's not violation of any paradigm that function returns several items. That's a language limitation later somehow transferred to human mind.
Python (like Lisp) also allows you to return any number of
values from a function, including (but not limited to)
none, one, two
def quadcube (x):
return x**2, x**3
a, b = quadcube(3)
Some languages make doing #3 native and easy. Example: Perl. "return ($a, $b);". Ditto Lisp.
Barring that, check if your language has a collection suited to the task, ala pair/tuple in C++
Barring that, create a pair/tuple class and/or collection and re-use it, especially if your language supports templating.
If your function has return value(s), it's presumably returning it/them for assignment to either a variable or an implied variable (to perform operations on, for instance.) Anything you can usefully express as a variable (or a testable value) should be fair game, and should dictate what you return.
Your example mentions a row or a set of rows from a SQL query. Then you reasonably should be ready to deal with those as objects or arrays, which suggests an appropriate answer to your question.
When your in a situation where you
need to return two things in a single
method, what is the best approach?
It depends on WHY you are returning two things.
Basically, as everyone here seems to agree, #2 and #4 are the two best answers...
I understand the philosophy that a
method should do one thing only, but
say you have a method that runs a
database select and you need to pull
two columns. I'm assuming you only
want to traverse through the database
result set once, but you want to
return two columns worth of data.
If the two pieces of data from the database are related, such as a customer's First Name and Last Name, I would indeed still consider this to be doing "one thing."
On the other hand, suppose you have come up with a strange SELECT statement that returns your company's gross sales total for a given date, and also reads the name of the customer that placed the first sale for today's date. Here you're doing two unrelated things!
If it's really true that performance of this strange SELECT statement is much better than doing two SELECT statements for the two different pieces of data, and both pieces of data really are needed on a frequent basis (so that the entire application would be slower if you didn't do it that way), then using this strange SELECT might be a good idea - but you better be prepared to demonstrate why your way really makes a difference in perceived response time.
The options I have come up with:
1 Use global variables to hold returns. I personally try and avoid
globals where I can.
There are some situations where creating a global is the right thing to do. But "returning two things from a function" is not one of those situations. Doing it for this purpose is just a Bad Idea.
2 Pass in two empty variables as parameters then assign the variables
inside the method, which now is a
void.
Yes, that's usually the best idea. This is exactly why "by reference" (or "output", depending on which language you're using) parameters exist.
I don't like the idea of methods that have a side effects.
Good theory, but you can take it too far. What would be the point of calling SaveCustomer() if that method didn't have a side-effect of saving the customer's data?
By Reference parameters are understood to be parameters that contain returned data.
3 Return a collection that contains two variables. This can lead to confusing code.
True. It wouldn't make sense, for instance, to return an array where element 0 was the first name and element 1 was the last name. This would be a Bad Idea.
4 Build a container class to hold the double return. This is more self-documenting then a collection containing other collections, but it seems like it might be confusing to create a class just for the purpose of a return.
Yes and no. As you say, I wouldn't want to create an object called FirstAndLastNames just to be used by one method. But if there was already an object which had basically this information, then it would make perfect sense to use it here.
If I was returning two of the exact same thing, a collection might be appropriate, but in general I would usually build a specialized class to hold exactly what I needed.
And if if you are returning two things today from those two columns, tomorrow you might want a third. Maintaining a custom object is going to be a lot easier than any of the other options.
Use var/out parameters or pass variables by reference, not by value. In Delphi:
function ReturnTwoValues(out Param1: Integer):Integer;
begin
Param1 := 10;
Result := 20;
end;
If you use var instead of out, you can pre-initialize the parameter.
With databases, you could have an out parameter per column and the result of the function would be a boolean indicating if the record is retrieved correctly or not. (Although I would use a single record class to hold the column values.)
As much as it pains me to do it, I find the most readable way to return multiple values in PHP (which is what I work with, mostly) is using a (multi-dimensional) array, like this:
function doStuff($someThing)
{
// do stuff
$status = 1;
$message = 'it worked, good job';
return array('status' => $status, 'message' => $message);
}
Not pretty, but it works and it's not terribly difficult to figure out what's going on.
I generally use tuples. I mainly work in C# and its very easy to design generic tuple constructs. I assume it would be very similar for most languages which have generics. As an aside, 1 is a terrible idea, and 3 only works when you are getting two returns that are the same type unless you work in a language where everything derives from the same basic type (i.e. object). 2 and 4 are also good choices. 2 doesn't introduce any side effects a priori, its just unwieldy.
Use std::vector, QList, or some managed library container to hold however many X you want to return:
QList<X> getMultipleItems()
{
QList<X> returnValue;
for (int i = 0; i < countOfItems; ++i)
{
returnValue.push_back(<your data here>);
}
return returnValue;
}
For the situation you described, pulling two fields from a single table, the appropriate answer is #4 given that two properties (fields) of the same entity (table) will exhibit strong cohesion.
Your concern that "it might be confusing to create a class just for the purpose of a return" is probably not that realistic. If your application is non-trivial you are likely going to need to re-use that class/object elsewhere anyway.
You should also consider whether the design of your method is primarily returning a single value, and you are getting another value for reference along with it, or if you really have a single returnable thing like first name - last name.
For instance, you might have an inventory module that queries the number of widgets you have in inventory. The return value you want to give is the actual number of widgets.. However, you may also want to record how often someone is querying inventory and return the number of queries so far. In that case it can be tempting to return both values together. However, remember that you have class vars availabe for storing data, so you can store an internal query count, and not return it every time, then use a second method call to retrieve the related value. Only group the two values together if they are truly related. If they are not, use separate methods to retrieve them separately.
Haskell also allows multiple return values using built in tuples:
sumAndDifference :: Int -> Int -> (Int, Int)
sumAndDifference x y = (x + y, x - y)
> let (s, d) = sumAndDifference 3 5 in s * d
-16
Being a pure language, options 1 and 2 are not allowed.
Even using a state monad, the return value contains (at least conceptually) a bag of all relevant state, including any changes the function just made. It's just a fancy convention for passing that state through a sequence of operations.
I will usually opt for approach #4 as I prefer the clarity of knowing what the function produces or calculate is it's return value (rather than byref parameters). Also, it lends to a rather "functional" style in program flow.
The disadvantage of option #4 with generic tuple classes is it isn't much better than returning a collection (the only gain is type safety).
public IList CalculateStuffCollection(int arg1, int arg2)
public Tuple<int, int> CalculateStuffType(int arg1, int arg2)
var resultCollection = CalculateStuffCollection(1,2);
var resultTuple = CalculateStuffTuple(1,2);
resultCollection[0] // Was it index 0 or 1 I wanted?
resultTuple.A // Was it A or B I wanted?
I would like a language that allowed me to return an immutable tuple of named variables (similar to a dictionary, but immutable, typesafe and statically checked). But, sadly, such an option isn't available to me in the world of VB.NET, it may be elsewhere.
I dislike option #2 because it breaks that "functional" style and forces you back into a procedural world (when often I don't want to do that just to call a simple method like TryParse).
I have sometimes used continuation-passing style to work around this, passing a function value as an argument, and returning that function call passing the multiple values.
Objects in place of function values in languages without first-class functions.
My choice is #4. Define a reference parameter in your function. That pointer references to a Value Object.
In PHP:
class TwoValuesVO {
public $expectedOne;
public $expectedTwo;
}
/* parameter $_vo references to a TwoValuesVO instance */
function twoValues( & $_vo ) {
$vo->expectedOne = 1;
$vo->expectedTwo = 2;
}
In Java:
class TwoValuesVO {
public int expectedOne;
public int expectedTwo;
}
class TwoValuesTest {
void twoValues( TwoValuesVO vo ) {
vo.expectedOne = 1;
vo.expectedTwo = 2;
}
}

Is it any way to implement a linked list with indexed access too?

I'm in the need of sort of a linked list structure, but if it had indexed access too it would be great.
Is it any way to accomplish that?
EDIT: I'm writing in C, but it may be for any language.
One method of achieving your goal is to implement a random or deterministic skip list. On the bottom level - you have your linked list with your items.
In order to get to elements using indexes, you'll need to add information to the inner nodes - of how many nodes are in the low most level, from this node until the next node on this level. This information can be added and maintained in O(logn).
This solution complexity is:
Add, Remove, Go to index, all work in O(logn).
The down side of this solution is that it is much more difficult to implement than the regular linked list. So using a regular linked list, you get Add, Remove in O(1), and Go to index in O(n).
You can probably use a tree for what you are aiming at. Make a binary tree that maintains the weights of each node of the tree (where the weight is equal to the number of nodes attached to that node, including itself). If you have a balancing scheme available for the tree, then insertions are still O(log n), since you only need to add one to the ancestor nodes' weights. Getting a node by index is O(log n), since you need only look at the indices of the ancestors of your desired node and the two children of each of those ancestors.
For achieving array like indexing in languages like C++, Java, Python, one would have to overload the array indexing operator [] for a class which implements the linked list data structure. The implementation would be O(n). In C since operator overloading is not possible, hence one would have to write a function which takes the linked list data structure and a position and returns the corresponding object.
In case a faster order access is required, one would have to use a different data structure like the BTree suggested by jprete or a dynamic array (which automatically grows as and when new elements are added to it). A quick example would be std::vector in C++ standard library.
SQL server row items in the clustered index are arranged like so:
.
/ \
/\ /\
*-*-*-*
The linked list is in the leaves (*-*-*). The linked list is ordered allowing fast directional scanning, and the tree serves as a `road-map' into the linked-list. So you would need a key-value pair for your items and then a data structure that encapsulates the tree and linked list.
so your data structure might look something like this:
struct ll_node
{
kv_pair current;
ll_node * next;
};
struct tree_node
{
value_type value;
short isLeaf;
union
{
tree_node * left_child;
kv_pair * left_leaf;
}
union
{
tree_node * right_child;
kv_pair * right_leaf
}
};
struct indexed_ll
{
tree_node * tree_root;
ll_node * linked_list_tail;
};

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".