Related
I am trying to index multiple CSV files with different "schemas" in a Solr index. There's possibly some common schema elements (header columns) across these CSVs . My requirement is to be able to provide search across these CSVs amongst other items.
From what I understand, one way to index would be to treat the entire CSV as a giant text string and index that. I am not sure what searchability aspects get impacted if I index that way.
The other way is basically define a common schema and then programmatically extract the columns from the doc and index line by line with the caveat that if a file doesn't have any common schema I may not be able to index it. (BTW, this last part maybe a non-starter for me but just lets indulge the possibility for now)
Are there any other ways ? Is there any advantage to one over another?
BTW, I tried the schemaless mode but it doesn't work for me. I can index the first file but the moment I do the next file and it has some different columns, its giving back an error. Is this expected behaviour or am I doing something wrong?
Appreciate any pointers, thanks!
Update: the error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates?
I'm using laravel and I have currently more than 1 json column in a table. Should I put all json stuff into one column or is it ok performance wise to use multiple columns for json stuff?
For example, I have these two:
$table->json('seo')->nullable();
$table->json('settings')->nullable();
Or would it be better to have one json column named meta and there store settings and seo?
$table->json('meta')->nullable();
Any recommendations? And if you have give me some explanations of why or why not.
There are no penalties for having multiple JSON columns. No more than having multiple BLOB columns. I mean, it takes just a few more bytes per row. But that's probably a trivial difference relative to the size of the JSON data. I wouldn't worry about it.
What you should think about, however, is defining some kind of rules of what fields you are going to store in each JSON document. What counts as seo versus settings? If you had to pass this coding project to another programmer, would they know how to use these two columns?
In MySQL 5.7 a new data type for storing JSON data in MySQL tables has been
added. It will obviously be a great change in MySQL. They listed some benefits
Document Validation - Only valid JSON documents can be stored in a
JSON column, so you get automatic validation of your data.
Efficient Access - More importantly, when you store a JSON document in a JSON column, it is not stored as a plain text value. Instead, it is stored
in an optimized binary format that allows for quicker access to object
members and array elements.
Performance - Improve your query
performance by creating indexes on values within the JSON columns.
This can be achieved with “functional indexes” on virtual columns.
Convenience - The additional inline syntax for JSON columns makes it
very natural to integrate Document queries within your SQL. For
example (features.feature is a JSON column): SELECT feature->"$.properties.STREET" AS property_street FROM features WHERE id = 121254;
WOW ! they include some great features. Now it is easier to manipulate data. Now it is possible to store more complex data in column.
So MySQL is now flavored with NoSQL.
Now I can imagine a query for JSON data something like
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN
(
SELECT JSON_EXTRACT(data,"$.inverted")
FROM t1 | {"series": 3, "inverted": 8}
WHERE JSON_EXTRACT(data,"$.inverted")<4 );
So can I store huge small relations in few json colum? Is it good? Does it break normalization. If this is possible then I guess it will act like NoSQL in a MySQL column. I really want to know more about this feature. Pros and cons of MySQL JSON data type.
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN ...
Using a column inside an expression or function like this spoils any chance of the query using an index to help optimize the query. The query shown above is forced to do a table-scan.
The claim about "efficient access" is misleading. It means that after the query examines a row with a JSON document, it can extract a field without having to parse the text of the JSON syntax. But it still takes a table-scan to search for rows. In other words, the query must examine every row.
By analogy, if I'm searching a telephone book for people with first name "Bill", I still have to read every page in the phone book, even if the first names have been highlighted to make it slightly quicker to spot them.
MySQL 5.7 allows you to define a virtual column in the table, and then create an index on the virtual column.
ALTER TABLE t1
ADD COLUMN series AS (JSON_EXTRACT(data, '$.series')),
ADD INDEX (series);
Then if you query the virtual column, it can use the index and avoid the table-scan.
SELECT * FROM t1
WHERE series IN ...
This is nice, but it kind of misses the point of using JSON. The attractive part of using JSON is that it allows you to add new attributes without having to do ALTER TABLE. But it turns out you have to define an extra (virtual) column anyway, if you want to search JSON fields with the help of an index.
But you don't have to define virtual columns and indexes for every field in the JSON document—only those you want to search or sort on. There could be other attributes in the JSON that you only need to extract in the select-list like the following:
SELECT JSON_EXTRACT(data, '$.series') AS series FROM t1
WHERE <other conditions>
I would generally say that this is the best way to use JSON in MySQL. Only in the select-list.
When you reference columns in other clauses (JOIN, WHERE, GROUP BY, HAVING, ORDER BY), it's more efficient to use conventional columns, not fields within JSON documents.
I presented a talk called How to Use JSON in MySQL Wrong at the Percona Live conference in April 2018. I'll update and repeat the talk at Oracle Code One in the fall.
There are other issues with JSON. For example, in my tests it required 2-3 times as much storage space for JSON documents compared to conventional columns storing the same data.
MySQL is promoting their new JSON capabilities aggressively, largely to dissuade people against migrating to MongoDB. But document-oriented data storage like MongoDB is fundamentally a non-relational way of organizing data. It's different from relational. I'm not saying one is better than the other, it's just a different technique, suited to different types of queries.
You should choose to use JSON when JSON makes your queries more efficient.
Don't choose a technology just because it's new, or for the sake of fashion.
Edit: The virtual column implementation in MySQL is supposed to use the index if your WHERE clause uses exactly the same expression as the definition of the virtual column. That is, the following should use the index on the virtual column, since the virtual column is defined AS (JSON_EXTRACT(data,"$.series"))
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN ...
Except I have found by testing this feature that it does NOT work for some reason if the expression is a JSON-extraction function. It works for other types of expressions, just not JSON functions. UPDATE: this reportedly works, finally, in MySQL 5.7.33.
The following from MySQL 5.7 brings sexy back with JSON sounds good to me:
Using the JSON Data Type in MySQL comes with two advantages over
storing JSON strings in a text field:
Data validation. JSON documents will be automatically validated and
invalid documents will produce an error. Improved internal storage
format. The JSON data is converted to a format that allows quick read
access to the data in a structured format. The server is able to
lookup subobjects or nested values by key or index, allowing added
flexibility and performance.
...
Specialised flavours of NoSQL stores
(Document DBs, Key-value stores and Graph DBs) are probably better
options for their specific use cases, but the addition of this
datatype might allow you to reduce complexity of your technology
stack. The price is coupling to MySQL (or compatible) databases. But
that is a non-issue for many users.
Note the language about document validation as it is an important factor. I guess a battery of tests need to be performed for comparisons of the two approaches. Those two being:
Mysql with JSON datatypes
Mysql without
The net has but shallow slideshares as of now on the topic of mysql / json / performance from what I am seeing.
Perhaps your post can be a hub for it. Or perhaps performance is an after thought, not sure, and you are just excited to not create a bunch of tables.
From my experience, JSON implementation at least in MySql 5.7 is not very useful due to its poor performance.
Well, it is not so bad for reading data and validation. However, JSON modification is 10-20 times slower with MySql that with Python or PHP.
Lets imagine very simple JSON:
{ "name": "value" }
Lets suppose we have to convert it to something like that:
{ "name": "value", "newName": "value" }
You can create simple script with Python or PHP that will select all rows and update them one by one. You are not forced to make one huge transaction for it, so other applications will can use the table in parallel. Of course, you can also make one huge transaction if you want, so you'll get guarantee that MySql will perform "all or nothing", but other applications will most probably not be able to use database during transaction execution.
I have 40 millions rows table, and Python script updates it in 3-4 hours.
Now we have MySql JSON, so we don't need Python or PHP anymore, we can do something like that:
UPDATE `JsonTable` SET `JsonColumn` = JSON_SET(`JsonColumn`, "newName", JSON_EXTRACT(`JsonColumn`, "name"))
It looks simple and excellent. However, its speed is 10-20 times slower than Python version, and it is single transaction, so other applications can not modify the table data in parallel.
So, if we want to just duplicate JSON key in 40 millions rows table, we need to not use table at all during 30-40 hours. It has no sence.
About reading data, from my experience direct access to JSON field via JSON_EXTRACT in WHERE is also extremelly slow (much slower that TEXT with LIKE on not indexed column). Virtual generated columns perform much faster, however, if we know our data structure beforehand, we don't need JSON, we can use traditional columns instead. When we use JSON where it is really useful, i. e. when data structure is unknown or changes often (for example, custom plugin settings), virtual column creation on regular basis for any possible new columns doesn't look like good idea.
Python and PHP make JSON validation like a charm, so it is questionable do we need JSON validation on MySql side at all. Why not also validate XML, Microsoft Office documents or check spelling? ;)
I got into this problem recently, and I sum up the following experiences:
1, There isn't a way to solve all questions.
2, You should use the JSON properly.
One case:
I have a table named: CustomField, and it must two columns: name, fields.
name is a localized string, it content should like:
{
"en":"this is English name",
"zh":"this is Chinese name"
...(other languages)
}
And fields should be like this:
[
{
"filed1":"value",
"filed2":"value"
...
},
{
"filed1":"value",
"filed2":"value"
...
}
...
]
As you can see, both the name and the fields can be saved as JSON, and it works!
However, if I use the name to search this table very frequently, what should I do? Use the JSON_CONTAINS,JSON_EXTRACT...? Obviously, it's not a good idea to save it as JSON anymore, we should save it to an independent table:CustomFieldName.
From the above case, I think you should keep these ideas in mind:
Why MYSQL support JSON?
Why you want to use JSON? Did your business logic just need this? Or there is something else?
Never be lazy
Thanks
Strong disagree with some of things that are said in other answers (which, to be fair, was a few years ago).
We have very carefully started to adopt JSON fields with a healthy skepticism. Over time we've been adding this more.
This generally describes the situation we are in:
Like 99% of applications out there, we are not doing things at a massive scale. We work with many different applications and databases, the majority of these are capable of running on modest hardware.
We have processes and know-how in place to make changes if performance does become a problem.
We have a general idea of which tables are going to be large and think carefully about how we optimize queries for them.
We also know in which cases this is not really needed.
We're pretty good at data validation and static typing at the application layer.
Lastly,
When we use JSON for storing complex data, that data is never referenced directly by other tables. We also tend to never need to use them in where clauses in hot paths.
So with all this in mind, using a little JSON field instead of 1 or more tables vastly reduces the complexity of queries and data model. Removing this complexity makes it easier to write certain queries, makes our code simpler and just generally saves time.
Complexity and performance is something that needs to be carefully balanced. JSON fields should not be blindly applied, but for the cases where this works it's fantastic.
'JSON fields don't perform well' is a valid reason to not use JSON fields, if you are at a place where that performance difference matters.
One specific example is that we have a table where we store settings for video transcoding. The settings table has 1 'profile' per row, and the settings themselves have a maximum nesting level of 4 (arrays and objects).
Despite this being a large database overall, there's only a few hundreds of these records in the database. Suggesting to split this into 5 tables would yield no benefit and lots of pain.
This is an extreme example, but we have plenty of others (with more rows) where the decision to use JSON fields is a few years in the past, and hasn't yet caused an issue.
Last point: it is now possible to directly index on JSON fields.
Let's assume I need to store some data of unknown amount within a database table. I don't want to create extra tables, because this will take more time to get the data. The amount of data can be different.
My initial thought was to store data in a key1=value1;key2=value2;key3=value3 format, but the problem here is that some value can contain ; in its body. What is the best separator in this case? What other methods can I use to be able to store various data in a single row?
The example content of the row is like data=2012-05-14 20:07:45;text=This is a comment, but what if I contain a semicolon?;last_id=123456 from which I can then get through PHP an array with corresponding keys and values after correctly exploding row text with a seperator.
First of all: You never ever store more than one information in only one field, if you need to access them separately or search by one of them. This has been discussed here quite a few times.
Assuming you allwas want to access the complete collection of information at once, I recommend to use the native serialization format of your development environment: e.g. if it is PHP, use serialze().
If it is cross-plattform, JSON might be a way to go: Good JSON encoding/decoding libraries exist for something like all environments out there. The same is true for XML, but int his context the textual overhead of XML is going to bite a bit.
On a sidenote: Are you sure, that storing the data in additional tables is slower? You might want to benchmark that before finally deciding.
Edit:
After reading, that you use PHP: If you don't want to put it in a table, stick with serialize() / unserialize() and a MEDIUMTEXT field, this works perfectly, I do it all the time.
EAV (cringe) is probably the best way to store arbitrary values like you want, but it sounds like you're firmly against additional tables for whatever reason. In light of that, you could just save the result of json_encode in the table. When you read it back, just json_decode to get it back into an array.
Keep in mind that if you ever need to search for anything in this field, you're going to have to use a SQL LIKE. If you never need to search this field or join it to anything, I suppose it's OK, but if you do, you've totally thrown performance out the window.
it can be the quotes who separate them .
key1='value1';key2='value2';key3='value3'
if not like that , give your sql example and we can see how to do it.
Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?
Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.
EDIT: [Added to make clear why these first suggestions would not solve the problem:]
This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.
EDIT2: [Added some more background info:]
The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.
Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.
Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.
If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.
I would be glad to get recommendations and/or experience reports.
EDIT3: Commercial solution found that "just works"
Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.
As there is no comprehensive documentation available online, I will try and describe my experiences so far.
What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.
Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited.
There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).
The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.
It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.
So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.
This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.
If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.
In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.
You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.
I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.
There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.
I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.
Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.
The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)
Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:
<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.
You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.
Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:
<field name="title" type="text" indexed="true" stored="true"/>
<field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>
Then tell it to copy the original field into the fancy one:
<copyField source="title" dest="title_ngrams"/>
Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.
That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!
Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.
Good luck!
If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
If not, the "industry standard" is http://www.sphinxsearch.com/
Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/
Also, a good presentation that introduces Sphinx and explains architecture+usage
http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation
Update
Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.
I'd use Apache Solr. The indexing strategy is entirely tunable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters), can incrementally read directly from your database to populate the index (see DataImportHandler in the same wiki), and can be queried from basically any language that speaks HTTP and XML or something like JSON.
what about using tools such as proposed above (lucene etc.) for full text indexing and having LIKE search for cases, where nothing was found? (i.e. run LIKE only after fulltext indexed search returned zero results)
What you're trying to do is unlikely to ever be all that much faster than LIKE '%searchterm%' without a great deal of custom code. The equivalent of LIKE 'searchterm%' ought to be trivial though. You could do what you're asking by building an index of all possible partial words that aren't covered by the trailing wild-card, but this would result in an unbelievably large index size, and it would be unusually slow for updates. Long tokens would result in Bad Things™. May I ask why you need this? Re: Spotlight... You do realize that Spotlight doesn't do this, right? It's token-based just like every other full-text indexer. Usually query expansion is the appropriate method of getting inexact matches if that's your goal.
Edit:
I had a project exactly like this at one point; part-numbers for all kinds of stuff. We finally settled on searchterm* in Xapian, but I believe Lucene also has the equivalent. You won't find a good solution that handles wild-card searches on either side of the token, but a trailing wild-card is usually more than good enough for what you want, and I suspect you'll find that users adapt to your system fairly quickly if they have any control over cleaning up the data. Combine it with query expansion (or even limited token expansion) and you should be pretty well set. Query expansion would convert a query for "sledgehammer" into "sledgehammer* OR (sledge* hammer*)" or something similar. Not every query will work, but people are already pretty well trained to try related queries when something doesn't work, and as long as at least one or two obvious queries come up with the results they expect, you should be OK. Your best bet is still to clean up the data and organize it better. You'd be surprised how easy this ends up being if you version everything and implement an egalitarian edit policy. Maybe let people add keywords to an entry and be sure to index those, but put limits on how many can be set. Too many and you may actually degrade the search results.
Shingle search could do the trick.
http://en.wikipedia.org/wiki/W-shingling
For example, if you use 3-character shingles, you can split "Roisonic" to: "roi", "son", "ic ", and store all three values, associating them with original entry. When searching for "oison", you first will search for "ois", "iso", "son". First you fuzzy-match all entries by shingles (finding the one with "son"), and then you can refine the search by using exact string matching.
Note that 3-character shingle require the fragment in query to be at least 5 characters long, 4-char shingle requires 7-char query and so on.
The exact answer to your question is right here Whether it will perform sufficiently well for the size of your data is another question.
I'm pretty sure Mysql offers a fulltext option, and it's probably also possible to use Lucene.
See here for related comments
Best efficient way to make a fulltext search in MySQL
A "real" full text index using parts of a word would be many times bigger than the source text and while the search may be faster any update or insert processing would be horibly slow.
You only hope is if there is some sort of pattern to the "mistakes' made. You could apply a set of "AI" type rules to the incoming text and produce cannonical form of the text which you could then apply a full text index to. An example for a rule could be to split a word ending in hammer into two words s/(\w?)(hammer)/\1 \2/g or to change "sledg" "sled" and "schledge" to "sledge". You would need to apply the same set of rules to the query text. In the way a product described as "sledgehammer" could be matched by a search for ' sledg hammer'.