How to quickly search big text line separated data file?

How to quickly search big text line separated data file? - csv

I've got a big datafile (>= 300M, csv), and want to query data and return line from it.
I can use this method:
grep pattern data.csv
But it is very slow. I need query several patterns, so maybe index this file is a good solution .
Is there any good commandline tool can do this job?
I know there are:
idutils: query is fast, but return result need to access the datafile make it slow.
solr: not that easy to use.

You are missing a lot specifics from your question that would make it easier to help you. For instance, the fields in your CSV, the pattern(s) you typically search upon, if you are searching against the same data set each time, and search frequency. Assuming you need to search against the same data set in ways not supported by grep and/or idutils, Solr makes sense. For instance if you want to do searching that can return partial matches Solr makes this much easier.
While not a command line solution, standing up Solr and loading it with CSV is a straightforward activity. Depending on the size in bytes of your CSV it make not require any tuning. The hard work is defining a Solr schema.xml definition that indexes your data in a way that supports your various search requirements. In your particular case it sounds like you want to do have some tokenization and perhaps stemming done to your searchable fields as you are already looking to do pattern matching. But that really depends on your specific search needs.

Related

What is generally faster, grepping through files or running a SQL LIKE %x% query through blobs?

Say I'm designing a tool that would save code snippets either in a PostgreSQL/MySQL database or on the file system. I want to search through these snippets. Using a search engine like Sphinx doesn't seem practical because we need exact text matches of code when searching code.
grep and ack and has always worked great, but storing stuff in a database makes a large collection of stuff more manageable in certain ways. I'm wonder what the relative performance of running grep recursively over a tree of directories is compared to running a query like SQL's LIKE or MySQL's REGEXP function over an equivalent number of records with TEXT blobs is.

If you've 1M files to grep through, you will (best I'm aware) go through each one with a regular expression.
For all intents and purposes, you're going to end up doing the same thing over table rows if you mass-query them using a LIKE operator or a regular expression.
My own experience with grep is that I seldom look for something that doesn't contain at least one full word, however, so you might be able to take advantage of a database to reduce the set in which you're searching.
MySQL has native full text search features, but I'd recommend against because they mean you're not using InnoDB.
You can read about those from Postgres here:
http://www.postgresql.org/docs/current/static/textsearch.html
After creating an index on a tsvector column, you can then do your "grep" in two steps, one to immediately find rows that might vaguely qualify, followed by another on your true criteria:
select * from docs where tsvcol ## :tsquery and (regexp at will);
That will be significantly faster than anything grep can do.

I can't compare them but both will take long. My guess is grep will be faster.
But MySQL support full text indexing and searching, which will be faster then grep -- i guess again.
Also, I did not understand, what is the problem with Sphinx or Lucene. Anyway, here's a benchmark for MySQL, Sphinx and Lucene

The internet seems guess that grep uses Boyer-Moore, which will make the query time depend additively (not multiplicatively) on the query size. This isn't that relevant though.
I think it's near-optimal for a one-time search. But in your case you can do better since you have repeated searches, which you can exploit the structure of (e.g. by indexing certain common substrings in your query), as bpgergo hints at.
Also I'm not sure the regular expression engine you are thinking of using is optimized for a non-special query, you could try it and see.
You may wish to keep all the files you're searching through in memory to avoid harddisk-based slowdown. This should work unless you are searching a staggering amount of text.

If you want a full-text index on code I would recommend Russ Cox’ codesearch tools
https://code.google.com/p/codesearch/
This is How Google Code Search Worked
http://swtch.com/~rsc/regexp/regexp4.html

Performance of MySql Xml functions?

I am pretty excited about the new Mysql XMl Functions.
Now I can finally embed something like "object oriented" documents in my oldschool relational database.
For an example use-case consider a user who sings up at your website using facebook connect.
You can fetch an object for the user using the graph api, and get nice information. This information however can vary vastly. Some fields may or may not be set, some may be added over time and so on.
Well if you are just intersted in very special fields (for example friends relations, gender, movies...), you can project them into your relational database scheme.
However using the XMl functions you could store the whole object inside a field and then your different models can access the data using the ExtractValue function. You can store everything right away without needing to worry what you will need later.
But what will the performance be?
For example I have a table with 50 000 entries which represent useres.
I have an enum field that states "male", "female" (or various other genders to be politically correct).
The performance of for example fetching all males will be very fast.
But what about something like WHERE ExtractValue(userdata, '/gender/') = 'male' ?
How will the performance vary if the object gets bigger?
Can I maby somehow put an Index on specified xpath selections?
How do field types work together with this functions/performance. Varchar/blob?
Do I need fulltext indexes?
To sum up my question:
Mysql XML functins look great. And I am sure they are really great if you just want to store structured data that you fetch and analyze further in your application.
But how will they stand battle in procedures where there are internal scans/sorting/comparision/calculations performed on them?
Can Mysql replace document oriented databases like CouchDB/Sesame?
What are the gains and trade offs of XML functions?
How and why are they better/worse than a dynamic application that stores various data as attributes?
For example a key/value table with an xpath as key and the value as value connected to the document entity.
Anyone made any other experiences with it or has noticed something mentionable?

I tend to make comments similar to Pekka's, but I think the reason we cannot laugh this off is your statement "This information however can vary vastly." That means it is not realistic to plan to parse it all and project it into the database.
I cannot answer all of your questions, but I can answer some of them.
Most notably I cannot tell you about performance on MySQL. I have seen it in SQL Server, tested it, and found that SQL Server performs in memory XML extractions very slowly, to me it seemed as if it were reading from disk, but that is a bit of an exaggeration. Others may dispute this, but that is what I found.
"Can Mysql replace document oriented databases like CouchDB/Sesame?" This question is a bit over-broad but in your case using MySQL lets you keep ACID compliance for these XML chunks, assuming you are using InnoDB, which cannot be said automatically for some of those document oriented databases.
"How and why are they better/worse than a dynamic application that stores various data as attributes?" I think this is really a matter of style. You are given XML chunks that are (presumably) documented and MySQL can navigate them. If you just keep them as-such you save a step. What would be gained by converting them to something else?
The MySQL docs suggest that the XML file will go into a clob field. Performance may suffer on larger docs. Perhaps then you will identify sub-documents that you want to regularly break out and put into a child table.
Along these same lines, if there are particular sub-docs you know you will want to know about, you can make a child table, "HasDocs", do a little pre-processing, and populate it with names of sub-docs with their counts. This would make for faster statistical analysis and also make it faster to find docs that have certain sub-docs.
Wish I could say more, hope this helps.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!

Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.

Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.

MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?

How to search for text fragments in a database

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?
Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.
EDIT: [Added to make clear why these first suggestions would not solve the problem:]
This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.
EDIT2: [Added some more background info:]
The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.
Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.
Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.
If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.
I would be glad to get recommendations and/or experience reports.
EDIT3: Commercial solution found that "just works"
Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.
As there is no comprehensive documentation available online, I will try and describe my experiences so far.
What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.
Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited.
There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).
The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.
It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.
So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.

This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.
If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.
In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.
You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.
I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.
There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.

I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.
Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.
The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)
Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:
<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.
You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.
Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:
<field name="title" type="text" indexed="true" stored="true"/>
<field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>
Then tell it to copy the original field into the fancy one:
<copyField source="title" dest="title_ngrams"/>
Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.
That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!
Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.
Good luck!

If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
If not, the "industry standard" is http://www.sphinxsearch.com/
Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/
Also, a good presentation that introduces Sphinx and explains architecture+usage
http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation
Update
Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.

I'd use Apache Solr. The indexing strategy is entirely tunable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters), can incrementally read directly from your database to populate the index (see DataImportHandler in the same wiki), and can be queried from basically any language that speaks HTTP and XML or something like JSON.

what about using tools such as proposed above (lucene etc.) for full text indexing and having LIKE search for cases, where nothing was found? (i.e. run LIKE only after fulltext indexed search returned zero results)

What you're trying to do is unlikely to ever be all that much faster than LIKE '%searchterm%' without a great deal of custom code. The equivalent of LIKE 'searchterm%' ought to be trivial though. You could do what you're asking by building an index of all possible partial words that aren't covered by the trailing wild-card, but this would result in an unbelievably large index size, and it would be unusually slow for updates. Long tokens would result in Bad Things™. May I ask why you need this? Re: Spotlight... You do realize that Spotlight doesn't do this, right? It's token-based just like every other full-text indexer. Usually query expansion is the appropriate method of getting inexact matches if that's your goal.
Edit:
I had a project exactly like this at one point; part-numbers for all kinds of stuff. We finally settled on searchterm* in Xapian, but I believe Lucene also has the equivalent. You won't find a good solution that handles wild-card searches on either side of the token, but a trailing wild-card is usually more than good enough for what you want, and I suspect you'll find that users adapt to your system fairly quickly if they have any control over cleaning up the data. Combine it with query expansion (or even limited token expansion) and you should be pretty well set. Query expansion would convert a query for "sledgehammer" into "sledgehammer* OR (sledge* hammer*)" or something similar. Not every query will work, but people are already pretty well trained to try related queries when something doesn't work, and as long as at least one or two obvious queries come up with the results they expect, you should be OK. Your best bet is still to clean up the data and organize it better. You'd be surprised how easy this ends up being if you version everything and implement an egalitarian edit policy. Maybe let people add keywords to an entry and be sure to index those, but put limits on how many can be set. Too many and you may actually degrade the search results.

Shingle search could do the trick.
http://en.wikipedia.org/wiki/W-shingling
For example, if you use 3-character shingles, you can split "Roisonic" to: "roi", "son", "ic ", and store all three values, associating them with original entry. When searching for "oison", you first will search for "ois", "iso", "son". First you fuzzy-match all entries by shingles (finding the one with "son"), and then you can refine the search by using exact string matching.
Note that 3-character shingle require the fragment in query to be at least 5 characters long, 4-char shingle requires 7-char query and so on.

The exact answer to your question is right here Whether it will perform sufficiently well for the size of your data is another question.

I'm pretty sure Mysql offers a fulltext option, and it's probably also possible to use Lucene.
See here for related comments
Best efficient way to make a fulltext search in MySQL

A "real" full text index using parts of a word would be many times bigger than the source text and while the search may be faster any update or insert processing would be horibly slow.
You only hope is if there is some sort of pattern to the "mistakes' made. You could apply a set of "AI" type rules to the incoming text and produce cannonical form of the text which you could then apply a full text index to. An example for a rule could be to split a word ending in hammer into two words s/(\w?)(hammer)/\1 \2/g or to change "sledg" "sled" and "schledge" to "sledge". You would need to apply the same set of rules to the query text. In the way a product described as "sledgehammer" could be matched by a search for ' sledg hammer'.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008