Processing ELF relocations - understanding the relocs, symbols, section data, and how they work together - reverse-engineering

TL;DR
I tried to make this a short question but it's a complicated problem so it ended up being long. If you can answer any part of this or give any suggestions or tips or resources or anything at all, it would be extremely helpful (even if you don't directly solve all my issues). I'm banging my head against the wall right now. :)
Here are the specific issues I am having. Read on below for more information.
I'm looking for guidance on how to process relocation entries and update the unresolved symbols in the section data. I simply don't understand what to do with all the information I've pulled from the relocations and the sections, etc.
I'm also hoping to understand just what is going on when the linker encounters relocations. Trying to correctly implement the relocation equations and use all the correct values in the correct way is incredibly challenging.
When I encounter op codes and addresses and symbols, etc, I need to understand what to do with them. I feel like I am missing some steps.
I feel like I don't have a good grasp on how the symbol table entries interact with the relocations. How should I use the symbol's binding, visibility, value, and size information?
Lastly, when I output my file with the resolved data and new relocation entries used by the executable, the data is all incorrect. I'm not sure how to follow all the relocations and provide all the information necessary. What is the executable expecting from me?
My approach so far
I am trying to create a relocation file in a specific [undocumented] proprietary format that is heavily based on ELF. I have written a tool that takes an ELF file and a partially linked file (PLF) and processes them to output the fully resolved rel file. This rel file is used to load/unload data as needed in order to save memory. The platform is a 32bit PPC. One wrinkle is that tool is written for Windows in c#, but the data is intended for the PPC, so there are fun endian issues and the like to watch out for.
I've been trying to understand how relocations are handled when used to resolve unresolved symbols and so on. What I've done so far is to copy the relevant sections from the PLF and then for each corresponding .rela section, I parse the entries and attempt to fix up the section data and generate new relocation entries as needed. But this is where my difficulty is. I'm way out of my element here and this sort of thing seems to typically be done by linkers and loaders so there's not a lot of good examples to draw upon. But I have found a few that have been of help, including THIS ONE.
So what is happening is:
Copy section data from PLF to be used for rel file. I'm only interested in the .init (no data), .text, .ctors, .dtors, .rodata, .data, .bss (no data), and another custom section we are using.
Iterate over the .rela sections in the PLF and read in the Elf32_Rela entries.
For each entry, I pull out the r_offset, r_info, and r_addend fields and extract the relevant info from r_info (the symbol and the reloc type).
From the PLF's symbol table, I can get the symbolOffset, the symbolSection, and the symbolValue.
From the ELF, I get the symbolSection's load address.
I compute int localAddress = ( .relaSection.Offset + r_offset ).
I get the uint relocValue from the symbolSection's contents at r_offset.
Now I have all the info I need so I do a switch on the reloc type and process the data. These are the types I support:
R_PPC_NONE
R_PPC_ADDR32
R_PPC_ADDR24
R_PPC_ADDR16
R_PPC_ADDR16_LO
R_PPC_ADDR16_HI
R_PPC_ADDR16_HA
R_PPC_ADDR14
R_PPC_ADDR14_BRTAKEN
R_PPC_ADDR14_BRNTAKEN
R_PPC_REL24
R_PPC_REL14
R_PPC_REL14_BRTAKEN
R_PPC_REL14_BRNTAKEN
Now what?? I need to update the section data and build companion relocation entries. But I don't understand what is necessary to do and how to do it.
The whole reason I'm doing this is because there is an old obsolete unsupported tool that does not support using custom sections, which is a key requirement for this project (for memory reasons). We have a custom section that contains a bunch of initialization code (totaling about a meg) that we want to unload after start up. The existing tool just ignores all the data in that section.
So while making our own tool that does support custom sections is ideal, if there are any bright ideas for another way to achieve this goal, I'm all ears! We've floated around an idea of using the .dtor section for our data since it's nearly empty anyway. But this is messy and might not work anyway if it prevents a clean shutdown.
Relocations plus example code
When I process the relocations, I'm working off of the equations and information found in the ABI docs HERE (around section 4.13, page 80ish) as well as a number of other code examples and blog posts I've dug up. But it's all so confusing and not really spelled out and all the code I've found does things a little differently.
For example,
R_PPC_ADDR16_LO --> half16: #lo(S + A)
R_PPC_ADDR14_BRTAKEN --> low14*: (S + A) >> 2
etc
So when I see this kind of code, how do I decipher it?
Here's one example (from this source)
case ELF::R_PPC64_ADDR14 : {
assert(((Value + Addend) & 3) == 0);
// Preserve the AA/LK bits in the branch instruction
uint8_t aalk = *(LocalAddress+3);
writeInt16BE(LocalAddress + 2, (aalk & 3) | ((Value + Addend) & 0xfffc));
} break;
case ELF::R_PPC64_REL24 : {
uint64_t FinalAddress = (Section.LoadAddress + Offset);
int32_t delta = static_cast<int32_t>(Value - FinalAddress + Addend);
if (SignExtend32<24>(delta) != delta)
llvm_unreachable("Relocation R_PPC64_REL24 overflow");
// Generates a 'bl <address>' instruction
writeInt32BE(LocalAddress, 0x48000001 | (delta & 0x03FFFFFC));
} break;
Here's some from another example (here)
case R_PPC_ADDR32: /* word32 S + A */
addr = elf_lookup(lf, symidx, 1);
if (addr == 0)
return -1;
addr += addend;
*where = addr;
break;
case R_PPC_ADDR16_LO: /* #lo(S) */
if (addend != 0) {
addr = relocbase + addend;
} else {
addr = elf_lookup(lf, symidx, 1);
if (addr == 0)
return -1;
}
*hwhere = addr & 0xffff;
break;
case R_PPC_ADDR16_HA: /* #ha(S) */
if (addend != 0) {
addr = relocbase + addend;
} else {
addr = elf_lookup(lf, symidx, 1);
if (addr == 0)
return -1;
}
*hwhere = ((addr >> 16) + ((addr & 0x8000) ? 1 : 0)) & 0xffff;
break;
And one other example (from here)
case R_PPC_ADDR16_HA:
write_be16 (dso, rela->r_offset, (value + 0x8000) >> 16);
break;
case R_PPC_ADDR24:
write_be32 (dso, rela->r_offset, (value & 0x03fffffc) | (read_ube32 (dso, rela->r_offset) & 0xfc000003));
break;
case R_PPC_ADDR14:
write_be32 (dso, rela->r_offset, (value & 0xfffc) | (read_ube32 (dso, rela->r_offset) & 0xffff0003));
break;
case R_PPC_ADDR14_BRTAKEN:
case R_PPC_ADDR14_BRNTAKEN:
write_be32 (dso, rela->r_offset, (value & 0xfffc)
| (read_ube32 (dso, rela->r_offset) & 0xffdf0003)
| ((((GELF_R_TYPE (rela->r_info) == R_PPC_ADDR14_BRTAKEN) << 21)
^ (value >> 10)) & 0x00200000));
break;
case R_PPC_REL24:
write_be32 (dso, rela->r_offset, ((value - rela->r_offset) & 0x03fffffc) | (read_ube32 (dso, rela->r_offset) & 0xfc000003));
break;
case R_PPC_REL32:
write_be32 (dso, rela->r_offset, value - rela->r_offset);
break;
I really want to understand the magic these guys are doing here and why their code doesn't always look the same. I think some of the code is making assumptions that the data was already properly masked (for branches, etc), and some of the code is not. But I don't understand any of this at all.
Following the symbols/data/relocations, etc
When I look at the data in a hexeditor, I see a bunch of "48 00 00 01" all over. I've figured out that this is an opcode and needs to be updated with relocation information (this specifically is for 'bl' branch and link), yet my tool doesn't operate on the vast majority of them and the ones that I do update have the wrong values in them (compared to an example made by an obsolete tool). Clearly I am missing some part of the process.
In addition to the section data, there are additional relocation entries that need to be added to the end of the rel file. These consist of internal and external relocations but I haven't figured out much at all about these yet. (what's the difference between the two and when do you use one or the other?)
If you look near the end of this file at the function RuntimeDyldELF::processRelocationRef, you'll see some Relocation Entries being created. They also make stub functions. I suspect this is the missing link for me, but it's as clear as mud and I'm not following it even a little bit.
When I output the symbols in each relocation entry, they each have a binding/visibility [Global/Weak/Local] [Function/Object] and a value, a size, and a section. I know the section is where the symbol is located, and the value is the offset to the symbol in that section (or is it the virtual address?). The size is the size of the symbol, but is this important? Maybe the global/weak/local is useful for determining if it's an internal or external relocation?
Maybe this relocation table I'm talking about creating is actually a symbol table for my rel file? Maybe this table updates the symbol value from being a virtual address to being a section offset (since that's what the value is in relocatable files and the symbol table in the PLF is basically in an executable)?
Some resources:
blog on relocations: http://eli.thegreenplace.net/2011/08/25/load-time-relocation-of-shared-libraries/
mentions opcodes at the end: http://wiki.netbsd.org/examples/elf_executables_for_powerpc/
my related unanswered question: ELF Relocation reverse engineering
Whew! That's a beast of a question. Congrats if you made it this far. :) Thanks in advance for any help you can give me.

I stumbled across this question and thought it deserved an answer.
Have elf.h handy. You can find it on the internet.
Each RELA section contains an array of Elf32_Rela entries as you know, but is also tied to a certain other section. r_offset is an offset into that other section (in this case - it works differently for shared libraries). You will find that section headers have a member called sh_info. This tells you which section that is. (It's an index into the section header table as you would expect.)
The 'symbol', which you got from r_info is in fact an index into a symbol table residing in another section. Look for the member sh_link in your RELA section's header.
The symbol table tells you the name of the symbol you are looking for, in the form of the st_name member of Elf32_Sym. st_name is an offset into a string section. Which section that is, you get from the sh_link member of your symbol table's section header. Sorry if this gets confusing.
Elf32_Shdr *sh_table = elf_image + ((Elf32_Ehdr *)elf_image)->e_shoff;
Elf32_Rela *relocs = elf_image + sh_table[relocation_section_index]->sh_offset;
unsigned section_to_modify_index = sh_table[relocation_section_index].sh_info;
char *to_modify = elf_image + sh_table[section_to_modify_index].sh_offset;
unsigned symbol_table_index = sh_table[relocation_section_index].sh_link;
Elf32_Sym *symbol_table = elf_image + sh_table[symbol_table_index].sh_offset;
unsigned string_table_index = sh_table[symbol_table].sh_link;
char *string_table = elf_image + sh_table[string_table_index].sh_offset;
Let's say we are working with relocation number i.
Elf32_Rela *rel = &relocs[i];
Elf32_Sym *sym = &symbol_table[ELF32_R_SYM(rel->r_info)];
char *symbol_name = string_table + sym->st_name;
Find the address of that symbol (let's say that symbol_name == "printf"). The final value will go in (to_modify + rel->r_offset).
As for the table on pages 79-83 of the pdf you linked, it tells us what to put at that address, and how many bytes to write. Obviously the address we just got (of printf in this case) is part of most of them. It corresponds to the S in the expressions.
r_addend is just A. Sometimes the compiler needs to add a static constant to a reloc i guess.
B is the base address of the shared object, or 0 for executable programs because they are not moved.
So if ELF32_R_TYPE(rel->r_info) == R_PPC_ADDR32 we have S + A, and the word size is word32 so we would get:
*(uint32_t *)(to_modify + rel->r_offset) = address_of_printf + rel->r_addend;
...and we have successfully performed a relocation.
I can't help you when it comes to the #lo, #hi, etc and the word sizes like low14. I know nothing about PPC but the linked pdf seems reasonable enough.
I also don't know about the stub functions. You don't normally need to know about those when linking (dynamically at least).
I'm not sure if I've answered all of your questions but you should be able to see what your example code does now at least.

Just answering the question: What is relocation?
When you write assembly program, if it is position-dependent, the program will be assumed to be loaded in a specific part of memory. And when you load your program to another memory address, then all the memory relative addressing data will have to be updated so that it will load correctly from the new memory area.
For example:
load A, B
Sometimes B can be an address, sometimes it is just pure constant data. So only compiler will know when they generate the assembly. Compiler will construct a table of all entries containing the offset, and its original loading relative address, assuming a fixed global address where it is laded. This is called the relocation table.
It is linked to the symbol table, PLT table, and GOT etc:
http://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-vi_18.html
https://reverseengineering.stackexchange.com/questions/1992/what-is-plt-got
What does the concept of relocation mean?
How does C++ linking work in practice?

Try to llok into ELF Specification. It takes about 60 pages, and greatly clarifies things. Especially part 2, the one about linking.

Related

how to make Ghidra use a function's complete/original stackframe for decompiled code

I have a case where some function allocates/uses a 404 bytes temporary structure on the stack for its internal calculations (the function is self-contained and shuffles data around within that data structure). Conceptually the respective structure seems to consist of some 32-bit counters followed by an int[15] and a byte[80] array, and then an area that might or might not actually be used. Some of the generated data in the tables seems to represent offsets that are again used by the function to navigate within the temporary structure.
Unfortunately Ghidra's decompiler makes a total mess while trying to make sense of the function: In particular it creates separate "local_.." int-vars (and then uses a pointer to that var) for what should correctly be a pointer into the function's original data-structure (e.g. pointing into one of the arrays).
undefined4 local_17f;
...
dest= &local_17f;
for (i = 0xf; i != 0; i = i + -1) {
*dest = 0;
dest = dest + 1;
}
Ghidra does not seem to understand that an array based data access is actually being used at that point. Ghirda's decompiler then also generates a local auStack316[316] variable which unfortunately seems to cover only a part of the respective local data structure used by the original ASM code (at least Ghidra actually did notice that a temporary memory buffer is used). As a result the decompiled code basically uses two overlapping (and broken) shadow data structures that should correctly just be the same block of memory.
Is there some way to make Ghidra's decompiler use the complete 404 bytes block allocated by the function as an auStack404 thus bypassing Ghidra's flawed interpretation logic and actually preserve the original functionality of the ASM code?
I think I found something.. In the "Listing" view the used local-variable layout is shown as a comment under the function's header. It seems that by right clicking on a respective local-var line in that comment, "set data type" can be applied to a respective local variable. Ah, and then there is what I've been looking for under "Function/"Edit stack frame" :-)

How to get the .plt address when i have the GOT_OFFSET_TABLE?

When I test to reconstruct the elf executable from process image, i got the .dynamic and GOT_OFFSET_TABLE,but The global offset table was filled in with the resolved values of the corresponding shared library functions.I must replace these addresses with the original PLT stub addresses. So,how can I get the PLT stub addresses from a process image.
for(c=0;dyn[c].d_tag !=DT_NULL ;c++){
switch (dyn[c].d_tag){
case DT_PLTGOT:
got = (Elf64_Addr)dyn[c].d_un.d_ptr;
printf("[+]Located plt.got vaddr:0x%x\n",got);
printf("[+]Relevant GOT entries begin at 0x%x\n", (Elf64_Addr)dyn[c].d_un.d_ptr + 24);
got_off = dyn[c].d_un.d_ptr - h.data_vaddr;
GOT_TABLE = (Elf64_Addr*)&pmem[h.data_offset +got_off];
GOT_TABLE += 3;
break;
...
}
Elf64_Addr PLT_VADDR = ?
gdb-peda$ x/10gx 0x601000 + 3*8
0x601018: 0x00007f3eaf557800 0x00007f3eaf522740 0x601018->GOT_OFFSET_TABLE[3]
0x601028: 0x00007f3eaf578160 0x0000000000000000
0x601038: 0x0000000000000000 0x0000000000000000
0x601048: 0x0000000000000000 0x0000000000000000
0x601058: 0x0000000000000000 0x0000000000000000
Looks like we're reading the same book, 'Learning Linux Binary Analysis' and stuck on the same exercise. I did not find an elegant way to retrieve the .plt address but you may be able to find it by using ElfN_Shdr.sh_name and by doing a strcmp(ElfN_Shdr.sh_name, '.plt') on it.
The author, elfmaster seems to be using that :
https://github.com/elfmaster/ftrace/blob/master/ftrace.c#L418
So I managed do get my way out to .plt address on a stripped binary compiled with PIE(Process Independent Execution). I observed that the .plt section was always on top of .text section.
By substracting the rel relocation table size, stored in DT_RELSZ, to the entry point, e_entry, you can get the address of .plt. I do not know yet if it coincidence and that this trick is actually garbage but it works.
I checked multiples elfparser and it seems that they are all using section name(sh_name) to identify .plt section. We're asked to recover an ELF file in memory and it appears that section headers are only relevant for linking and not loaded for execution but I'm not sure and you should double checked that.

Redacted comments in MS's source code for .NET [duplicate]

The Reference Source page for stringbuilder.cs has this comment in the ToString method:
if (chunk.m_ChunkLength > 0)
{
// Copy these into local variables so that they
// are stable even in the presence of ----s (hackers might do this)
char[] sourceArray = chunk.m_ChunkChars;
int chunkOffset = chunk.m_ChunkOffset;
int chunkLength = chunk.m_ChunkLength;
What does this mean? Is ----s something a malicious user might insert into a string to be formatted?
The source code for the published Reference Source is pushed through a filter that removes objectionable content from the source. Verboten words are one, Microsoft programmers use profanity in their comments. So are the names of devs, Microsoft wants to hide their identity. Such a word or name is substituted by dashes.
In this case you can tell what used to be there from the CoreCLR, the open-sourced version of the .NET Framework. It is a verboten word:
// Copy these into local variables so that they are stable even in the presence of race conditions
Which was hand-edited from the original that you looked at before being submitted to Github, Microsoft also doesn't want to accuse their customers of being hackers, it originally said races, thus turning into ----s :)
In the CoreCLR repository you have a fuller quote:
Copy these into local variables so that they are stable even in the presence of race conditions
Github
Basically: it's a threading consideration.
In addition to the great answer by #Jeroen, this is more than just a threading consideration. It's to prevent someone from intentionally creating a race condition and causing a buffer overflow in that manner. Later in the code, the length of that local variable is checked. If the code were to check the length of the accessible variable instead, it could have changed on a different thread between the time length was checked and wstrcpy was called:
// Check that we will not overrun our boundaries.
if ((uint)(chunkLength + chunkOffset) <= ret.Length && (uint)chunkLength <= (uint)sourceArray.Length)
{
///
/// imagine that another thread has changed the chunk.m_ChunkChars array here!
/// we're now in big trouble, our attempt to prevent a buffer overflow has been thawrted!
/// oh wait, we're ok, because we're using a local variable that the other thread can't access anyway.
fixed (char* sourcePtr = sourceArray)
string.wstrcpy(destinationPtr + chunkOffset, sourcePtr, chunkLength);
}
else
{
throw new ArgumentOutOfRangeException("chunkLength", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
}
chunk = chunk.m_ChunkPrevious;
} while (chunk != null);
Really interesting question though.
Don't think that this is the case - the code in question copies to local variables to prevent bad things happening if the string builder instance is mutated on another thread.
I think the ---- may relate to a four letter swear word...

What's happening when I use for(i in object) in AS3?

To iterate over the properties of an Object in AS3 you can use for(var i:String in object) like this:
Object:
var object:Object = {
thing: 1,
stuff: "hats",
another: new Sprite()
};
Loop:
for(var i:String in object)
{
trace(i + ": " + object[i]);
}
Result:
stuff: hats
thing: 1
another: [object Sprite]
The order in which the properties are selected however seems to vary and never matches anything that I can think of such as alphabetical property name, the order in which they were created, etc. In fact if I try it a few different times in different places, the order is completely different.
Is it possible to access the properties in a given order? What is happening here?
I'm posting this as an answer just to compliment BoltClock's answer with some extra insight by looking directly at the flash player source code. We can actually see the AVM code that specifically provides this functionality and it's written in C++. We can see inside ArrayObject.cpp the following code:
// Iterator support - for in, for each
Atom ArrayObject::nextName(int index)
{
AvmAssert(index > 0);
int denseLength = (int)getDenseLength();
if (index <= denseLength)
{
AvmCore *core = this->core();
return core->intToAtom(index-1);
}
else
{
return ScriptObject::nextName (index - denseLength);
}
}
As you can see when there is a legitimate property (object) to return, it is looked up from the ScriptObject class, specifically the nextName() method. If we look at those methods within ScriptObject.cpp:
Atom ScriptObject::nextName(int index)
{
AvmAssert(traits()->needsHashtable());
AvmAssert(index > 0);
InlineHashtable *ht = getTable();
if (uint32_t(index)-1 >= ht->getCapacity()/2)
return nullStringAtom;
const Atom* atoms = ht->getAtoms();
Atom m = ht->removeDontEnumMask(atoms[(index-1)<<1]);
if (AvmCore::isNullOrUndefined(m))
return nullStringAtom;
return m;
}
We can see that indeed, as people have pointed out here that the VM is using a hash table. However in these functions there is a specific index supplied, which would suggest, at first glance, that there must then be specific ordering.
If you dig deeper (I won't post all the code here) there are a whole slew of methods from different classes involved in the for in/for each functionality and one of them is the method ScriptObject::nextNameIndex() which basically pulls up the whole hash table and just starts providing indices to valid objects within the table and increments the original index supplied in the argument, so long as the next value points to a valid object. If I'm right in my interpretation, this would be the cause behind your random lookup and I don't believe there would be any way here to force a standardized/ordered map in these operations.
Sources
For those of you who might want to get the source code for the open source portion of the flash player, you can grab it from the following mercurial repositories (you can download a snapshop in zip like github so you don't have to install mercurial):
http://hg.mozilla.org/tamarin-central - This is the "stable" or "release" repository
http://hg.mozilla.org/tamarin-redux - This is the development branch. The most recent changes to the AVM will be found here. This includes the support for Android and such. Adobe is still updating and open sourcing these parts of the flash player, so it's good current and official stuff.
While I'm at it, this might be of interest as well: http://code.google.com/p/redtamarin/. It's a branched off (and rather mature) version of the AVM and can be used to write server-side actionscript. Neat stuff and has a ton of information that gives insight into the workings of the AVM so I thought I'd include it too.
This behavior is documented (emphasis mine):
The for..in loop iterates through the properties of an object, or the elements of an array. For example, you can use a for..in loop to iterate through the properties of a generic object (object properties are not kept in any particular order, so properties may appear in a seemingly random order)
How the properties are stored and retrieved appears to be an implementation detail, which isn't covered in the documentation. As ToddBFisher mentions in a comment, though, a data structure commonly used to implement associative arrays is the hash table. It's even mentioned in this page about associative arrays in AS3, and if you inspect the AVM code as shown by Ascension Systems, you'll find exactly such an implementation. As described, there is no notion of order or sorting in typical hash tables.
I don't believe there is a way to access the properties in a specific order unless you store that order somehow.

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".