I am building a MySQL database that will have roughly 10,000 records. Each record will contain a textual document (a few pages of text in most cases). I want to do all sorts of n-gram counting across the entire database. I have algorithms already written in Python that will what I want against a directory containing a large number of text files, but to do that I will need to extract 10,000 text files from the database - this will have performance issues.
I'm a rookie with MySQL, so I'm not sure if it has any built-in features that do n-gram analysis, or whether there are good plugins out there that would do it. Please note that I need to go up to at least 4-grams (preferably 5-grams) in my analysis, so the simple 2-gram plugins I've seen won't work here. I also need to have the ability to remove the stopwords from the textual documents before doing the n-gram counting.
Any ideas from the community?
Thanks,
Ron
My suggestion would be to use a dedicated full-text search index program like lucene/solr, which has much richer and extensible support for this sort of thing. It will require you to learn a bit to get it set up, but it sounds as if you want to mess around at a level that will be difficult to customize in MySQL.
If you really want to prematurely optimize ;) you could translate your python into C and then wrap it with thin mysql UDF wrapper code.
But I'd highly recommend just loading your documents one at a time and running your python scripts on them to populate a mysql table of n-grams. My hammer for every nail at the moment is Django. It's ORM makes interacting with mysql tables and optimizing those interactions a cinch. I'm using it to do statistics in python on multimillion record databases for production sites that have to return gobs of data in less than a second. And any python ORM will make it easier to switch out your database if you find something better than mysql, like postgre. The best part is that there are lots of python and django tools to monitor all aspects of your app's performance (python execution, mysql load/save, memory/swap). That way you can attack the right problem. It may be that sequential bulk mysql reads aren't what's slowing you down...
Related
I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.
Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.
For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.
Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.
If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf
The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?
I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.
The activities are things like:
6 PM: John favorited Bacon
5:30 PM: Jane commented on Snow Crash
5:15 PM: Jane added a photo of Bacon to her album
The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).
I need to be able to do the following:
Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
Pull activities for a thing (like "Bacon") in reverse date order
Filter by activity type ("favorite", "comment")
Store at least 30 million activities
Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.
I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.
Is anybody doing anything like this outside of a traditional RDBMS?
Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...
Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change
I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.
I have no idea how much performance or much data you plan on using, but 30M rows is not very many.
If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.
But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.
EDIT:Some other points:
key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
Therefore, you cannot do a CREATE INDEX
Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.
In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.
I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.
It seems to me that what you want to do -- Query a large set of data in several different ways and order the results -- is exactly and precisely what RDBMeS were designed for.
I doubt you would find any other datastore that would do this as well as a modern commercial DBMS (Oracle, SQLServer, DB2 etc.) or any opn source tool that would accomplish
this any better than MySql.
You could have a look at Googles BigTable, which is really a relational database but
it can present an 'object'y personality to your program. Its exceptionaly good for free format text
searches, and complex predicates. As the whole thing (at least the version you can download) is implemented in Python I doubt it would beat MySql in a query marathon.
For a project I once needed a simple database that was fast at doing lookups and which would do lots of lookups and just an occasional write. I just ended up writing my own file format.
While you could do this too, it is pretty complex, especially if you need to support it from a web server. With a web server, you would at least need to protect every write to the file and make sure it can be read from multiple threads. The design of this file format is something you should work out as good as possible with plenty of testing and experiments. One minor bug could prove fatal for a web project in this style, but if you get it working, it can work real well and extremely fast.
But for 99.999% of all situations, you don't want such a custom solution. It's easier to just upgrade the hardware, move to Oracle, SQL Server or InterBase, use a dedicated database server, use faster hard disks, install more memory, upgrade to a 64-bit system. Those are the more generic tricks to improve performance with the least effort.
I'd recommend learning about message queue technology. There are several open-source options available, and also robust commercial products that would serve up the volume you describe as a tiny snack.
CouchDB is schema-free, and it's fairly simple to retrieve a huge amount of data quickly, because you are working only with indexes. You are not "querying" the database each time, you are retrieving only matching keys (which are pre-sorted making it even faster).
"Views" are re-indexed everytime new data is entered into the database, but this takes place transparently to the user, so while there might be potential delay in generating an updated view, there will virtually never be any delay in retrieving results.
I've just started to explore building an "activity stream" solution using CouchDB, and because the paradigm is different, my thinking about the process had to change from the SQL thinking.
Rather than figure out how to query the data I want and then process it on the page, I instead generate a view that keys all documents by date, so I can easily create multiple groups of data, just by using the appropriate date key, essentially running several queries simultaneously, but with no degradation in performance.
This is ideal for activity streams, and I can isolate everything by date, or along with date isolation I can further filter results of a particular subtype, etc - by creating a view as needed, and because the view itself is just using javascript and all data in CouchDB is JSON, virtually everything can be done client-side to render your page.
Currently working on a project that is centered around a medical nomenclature known as SNOMED. At the heart of snomed is are three relational datasets that are 350,000, 1.1 mil, and 1.3 mil records in length. We want to be able to quickly query this dataset for the data entry portion where we would like to have some shape or form of auto-completion/suggestion.
Its currently in a MySQL MyISAM DB just for dev purposes but we want to start playing with some in memory options. It's currently 30MB + 90MB + 70MB in size including the indexes. The MEMORY MySQL Engine and MemCached were the obvious ones, so my question is which of these would you suggest or is there something better out there?
We're working in Python primarily at the app level if that makes a difference. Also we're running on a single small dedicated server moving to 4GB DDR2 soon.
Edit: Additional Info
We're interested in keeping the suggesting and autocompletion fast. Something that will peform well for these types of queires is desirable. Each term in snomed typically has several synonyms, abbreviations, and a preferred name. We will be querying this dataset heavily (90MB in size including index). We're also considering building an inverted index to speed things up and return more relevant results (many of the terms are long "Entire coiled artery of decidua basalis (body structure)"). Lucene or some other full text search may be appropriate.
From your use case, it sounds like you want to do full-text searching; I would suggest sphinx. It's blazing fast, even on large data sets. You can integrate memcached if you need extra speed.
Please see
Techniques to make autocomplete on website more responsive
How to do query auto-completion suggestions in Lucene
autocomplete server side implementation
For how to do this with Lucene. Lucene is the closest to industry standard full-text search library. It is fast and gives quality results. However, It takes time to master Lucene - you have to handle many low-level details. An easier way may be to use Solr, a Lucene sub-project which is much easier to set up, and can give JSON output, that can be used for autocomplete.
As Todd said, you can also use Sphinx. I have never used it, but heard it is highly integrable with MySQL. I failed to find how to implement autocomplete using Sphinx - maybe you should post this as a separate question.