Is HBase meaningful if it's not running in a distributed environment? - mysql

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.

Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.

MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?

Related

MySQL - Best methods to provide fast Dynamic filter support for large-scale database record lists?

I am curious what techniques Database Developers and Architects use to create dynamic filter data response Stored Procedures (or Functions) for large-scale databases.
For example, let's take a database with millions of people in it, and we want to provide a stored procedure "get-person-list" which takes a JSON parameter. Within this JSON parameter, we can define filters such as $.filter.name.first, $.filter.name.last, $.filter.phone.number, $.filter.address.city, etc.
The frontend (web solution) allows the user to define one or more filters, so the front-end can say "Show me everyone with a First name of Ted and last name of Smith in San Diego."
The payload would look like this:
{
"filter": {
"name": {
"last": "smith",
"first": "ted"
},
"address": {
"city": "san diego"
}
}
}
Now, what would the best technique be to write a single stored procedure capable of handling numerous (dozens or more) filter settings (dynamically) and returning the proper result set all with the best optimization/speed?
Is it possible to do this with CTE, or are prepared statements based on IF/THEN logic (building out the SQL to be executed based on filter value) the best/only real method?
How do big companies with huge databases and thousands of users write their calls to return complex dynamic lists of data as quickly as possible?
Everything Bill wrote is true, and good advice.
I'll take it a little further. You're proposing building a search layer into your system, which is fine.
You're proposing an interface in which you pass a JSON object to code inside the DBMS.That's not fine. That code will either have a bunch of canned queries handling the various search scenarios, or will have a mess of string-handling code that reads JSON, puts together appropriate queries, then uses MySQL's PREPARE statement to run them. From my experience that is, with respect, a really bad idea.
Here's why:
The stored-procedure language has very weak string-handling support compared to host languages. No sprintf. No arrays of strings. No join or implode operators. Clunky regex, and not always present on every server. You're going to need string handling to build search queries.
Stored procedures are trickier to debug, test, deploy, and maintain than ordinary application code. That work requires special skills and special access.
You will need to maintain this code, especially if your system proves successful. You'll add requirements that will require expanding your search capabilities.
It's impossible (seriously, impossible) to know what your actual application usage patterns will be at scale. You surely will, as a consequence of growth, find usage patterns that surprise you. My point is that you can't design and build a search system and then forget about it. It will evolve along with your app.
To keep up with evolving usage patterns, you'll need to refactor some queries and add some indexes. You will be under pressure when you do that work: People will be complaining about performance. See points 1 and 2 above.
MySQL / MariaDB's stored procedures aren't compiled with an optimizing compiler, unlike Oracle and SQL Server's. So there's no compelling performance win.
So don't use a stored procedure for this. Please. Ask me how I know this sometime.
If you need a search module with a JSON interface, implement it in your favorite language (php, C#, nodejs, java, whatever). It will be easier to debug, test, deploy, and maintain.
To write a query that searches a variety of columns, you would have to write dynamic SQL. That is, write code to parse your JSON payload for the filter keys and values, and format SQL expressions in a string that is part of a dynamic SQL statement. Then prepare and execute that string.
In general, you can't "optimize for everything." Trying to optimize when you don't know in advance which queries your users will submit is a nigh-impossible task. There's no perfect solution.
The most common method of optimizing search is to create indexes. But you need to know the types of search in advance to create indexes. You need to know which columns will be included, and which types of search operations will be used, because the column order in an index affects optimization.
For N columns, there are N-factorial permutations of columns, but clearly this is impractical because MySQL only allows 64 indexes per table. You simply can't create all the indexes needed to optimize every possible query your users attempt.
The alternative is to optimize queries partially, by indexing a few combinations of columns, and hope that these help the users' most common queries. Use application logs to determine what the most common queries are.
There are other types of indexes. You could use fulltext indexing, either the implementation built in to MySQL, or else supplement your MySQL database with ElasticSearch or similar technology. These provide a different type of index that effectively indexes everything with one index, so you can search based on multiple columns.
There's no single product that is "best." Which fulltext indexing technology meets your needs requires you to evaluate different products. This is some of the unglamorous work of software development — testing, benchmarking, and matching product features to your application requirements. There are few types of work that I enjoy less. It's a toss-up between this and resolving git merge conflicts.
It's also more work to manage copies of data in multiple datastores, making sure data changes in your SQL database are also copied into the fulltext search index. This involves techniques like ETL (extract, transform, load) and CDC (change data capture).
But you asked how big companies with huge databases do this, and this is how.
Input
I to that "all the time". The web page has a <form>. When submitted, I look for fields of that form that were filled in, then build
WHERE this = "..."
AND that = "..."
into the suitable SELECT statement.
Note: I leave out any fields that were not specified in the form; I make sure to escape the strings.
I'm walking through $_GET[] instead of JSON, so it is quite easy.
INDEXing
If you have columns for each possible fields, then it is a matter of providing indexes only for the most likely columns to search on. (There are practical and even hard-coded limits on Indexes.)
If you have stored the attributes in EAV table structure, you have my condolences. Search the [entitity-attribute-value] tag for many other poor soles who wandered into that swamp.
If you store the attributes in JSON, well that is likely to be an order of magnitude worse than EAV.
If you throw all the information in a FULLTEXT columns and use MATCH, then you can get enough speed for "millions" or rows. But it comes with various caveats (word length, stoplist, endings, surprise matches, etc).
If you would like to discuss further, then scale back your expectations and make a list of likely search keys. We can then discuss what technique might be best.

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

Performance of MySql Xml functions?

I am pretty excited about the new Mysql XMl Functions.
Now I can finally embed something like "object oriented" documents in my oldschool relational database.
For an example use-case consider a user who sings up at your website using facebook connect.
You can fetch an object for the user using the graph api, and get nice information. This information however can vary vastly. Some fields may or may not be set, some may be added over time and so on.
Well if you are just intersted in very special fields (for example friends relations, gender, movies...), you can project them into your relational database scheme.
However using the XMl functions you could store the whole object inside a field and then your different models can access the data using the ExtractValue function. You can store everything right away without needing to worry what you will need later.
But what will the performance be?
For example I have a table with 50 000 entries which represent useres.
I have an enum field that states "male", "female" (or various other genders to be politically correct).
The performance of for example fetching all males will be very fast.
But what about something like WHERE ExtractValue(userdata, '/gender/') = 'male' ?
How will the performance vary if the object gets bigger?
Can I maby somehow put an Index on specified xpath selections?
How do field types work together with this functions/performance. Varchar/blob?
Do I need fulltext indexes?
To sum up my question:
Mysql XML functins look great. And I am sure they are really great if you just want to store structured data that you fetch and analyze further in your application.
But how will they stand battle in procedures where there are internal scans/sorting/comparision/calculations performed on them?
Can Mysql replace document oriented databases like CouchDB/Sesame?
What are the gains and trade offs of XML functions?
How and why are they better/worse than a dynamic application that stores various data as attributes?
For example a key/value table with an xpath as key and the value as value connected to the document entity.
Anyone made any other experiences with it or has noticed something mentionable?
I tend to make comments similar to Pekka's, but I think the reason we cannot laugh this off is your statement "This information however can vary vastly." That means it is not realistic to plan to parse it all and project it into the database.
I cannot answer all of your questions, but I can answer some of them.
Most notably I cannot tell you about performance on MySQL. I have seen it in SQL Server, tested it, and found that SQL Server performs in memory XML extractions very slowly, to me it seemed as if it were reading from disk, but that is a bit of an exaggeration. Others may dispute this, but that is what I found.
"Can Mysql replace document oriented databases like CouchDB/Sesame?" This question is a bit over-broad but in your case using MySQL lets you keep ACID compliance for these XML chunks, assuming you are using InnoDB, which cannot be said automatically for some of those document oriented databases.
"How and why are they better/worse than a dynamic application that stores various data as attributes?" I think this is really a matter of style. You are given XML chunks that are (presumably) documented and MySQL can navigate them. If you just keep them as-such you save a step. What would be gained by converting them to something else?
The MySQL docs suggest that the XML file will go into a clob field. Performance may suffer on larger docs. Perhaps then you will identify sub-documents that you want to regularly break out and put into a child table.
Along these same lines, if there are particular sub-docs you know you will want to know about, you can make a child table, "HasDocs", do a little pre-processing, and populate it with names of sub-docs with their counts. This would make for faster statistical analysis and also make it faster to find docs that have certain sub-docs.
Wish I could say more, hope this helps.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program