Geospatial and full text search for Rails app hosted on Heroku - mysql

I'm planning out a Rails app that will be hosted on Heroku and will need both geospatial and full text search capabilities.
I know that Heroku offers add-ons like WebSolr and IndexTank that sound like they can do the job, but I was wondering if this could be done in MySQL and/or PostgreSQL without having to pay for any add-ons?

Depending on the scale of your application you should be able to accomplish both FULLTEXT and SPATIAL indexes in MySQL with ease. Once your application gets massive, i.e hundreds of millions of rows with high concurrency and multiples of thousands of requests per second you might need to move to another solution for either FULLTEXT or SPATIAL queries. But, I wouldn't recommend optimize for that early on, since it can be very hard to do properly. For the foreseeable future MySQL should suffice.
You can read about spatial indexes in MySQL here. You can read about fulltext indexes in MySQL here. Finally, I would recommend taking the steps outlined here to make your schema.rb file and rake tasks work with these two index types.
I have only used MySQL for both, but my understanding is that PostgreSQL has a good geo-spatial index solution as well.

If you have a database at Heroku, you can use Postgres's support for Full Text Search: http://www.postgresql.org/docs/8.3/static/textsearch.html. The oldest servers Heroku runs (for shared databases) are on 8.3 and 8.4. The newest are on 9.0.
A blog post noticing this little fact can be seen here: https://tenderlovemaking.com/2009/10/17/full-text-search-on-heroku.html
Apparently, that "texticle" (heh. cute.) addon works...pretty well. It will even create the right indexes for you, as I understand it.
Here's the underlying story: postgres full-text-search is pretty fast and fuss-free (although Rails-integration may not be great), although it does not offer the bells and whistles of Solr or IndexTank. Make sure you read about how to properly set up GIN and/or GiST indexes, and use the tsvector/tsquery types.
The short version:
Create an (in this case, expression-based) index: CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));. In this case "body" is the field being indexed.
Use the ## operator: SELECT * FROM ... WHERE to_tsvector('english', pgweb.body) ## to_tsquery('hello & world') LIMIT 30
The hard part may be mapping things back into application land, the blog post previously cited is trying to do that.
The dedicated databases can also be requisitioned with PostGIS, which is a very powerful and fully featured system for indexing and querying geographical data. OpenStreetMap uses the PostgreSQL geometry types (built-in) extensively, and many people combine that with PostGIS to great effect.
Both of these (full text search, PostGIS) take advantage of the extensible data type and indexing infrastructure in Postgres, so you should expect them to work with high performance for many, many records (spend a little time carefully reviewing the situation if things look busted). You might also take advantage of fact that you are able to leverage these features in combination with transactions and structured data. For example:
CREATE TABLE products (pk bigserial, price numeric, quantity integer, description text); can just as easily be used with full text search...any text field will do, and it can be in connection with regular attributes (price, quantity in this case).

I'd use thinking sphinx, a full text search engine also deployable on heroku.
It has geo search built-in: http://freelancing-god.github.com/ts/en/geosearching.html
EDIT:
Sphynx is almost ready for heroku, see here: http://flying-sphinx.com/

IndexTank is now free up to 100k documents on Heroku, we just haven't updated the documentation. This may not be enough for your needs, but I thought I'd let you know just in case.

For full text search via Postgre I recommend pg_search, I am using it myself on heroku at the moment. I have not used texticle but from what I can see pg_search has more development activity lately and it has been built upon texticle (it will not add indexes for you, you have to do it yourself).
I cannot find the thread now but I saw that Heroku gave option for pg geo search but it was in beta.
My advice is if you are not able to find postgre solution is to host your own instance of SOLR (on EC2 instance) and use sunspot solr gem to integrate it with rails.
I have implemented my own solution and used WebSolr as well. Basically that is what they give you their own SOLR instance hassle free. Is it worth the money, in my opinion no. For integration that use sunspot solr client as well, so it is just are you going to pay somebody 20$/40$/... to host SOLR for you. I know you also get backups, maintenance etc. but call me cheap I prefer my own instance. Also WebSolr is locked on 1.4.x version of SOLR.

Related

Sphinx search or MySQL for a huge database search enginee

So I'm working on a huge eCommerce solution on top of Symfony2, Doctrine2 and MySQL (maybe a cluster since we will have a lot of people connected and working in our platform) so I'm trying to decide if will be better to use Sphinx search or MysQL for search solution since some data will need to be duplicated in MySQL tables and in Sphinx. Our main goal is performance so excellent response times is what we look for. I'm not an expert in none so I need some advice from people here based on theirs experience and so on, maybe some docs or whatever. What path did yours take on this side.
PS: The DB will grow up really fast take that into account and the platform will be for the entire worl
Sphinx is usually preferred when it comes to performance vs MySQL for high volume searches because it's easy to scale. You will have a delay on results, to allow it to sync data with mysql, but even so it's better.
You should also take a good look at the actual queries that will run and store in Sphinx
only the fields that are searchable along with their ids. Once you get the ids from sphinx,
to list them use a mysql slave to get their other, non-searchable data.
Depending on what queries you are using for search,
A better solution than sphinx is Amazon Cloudsearch. We had a hard time implementing it, but it was well worth it, both time and $$$, and it replaced our sphinx solution

Seeking clarification about mysql 5.6 memcache integration

I'm having trouble getting a clear understanding of what MySQL 5.6 is introducing w/r/t memcache.
As I understand it, memcache by itself is essentially a huge, shared, memory-resident hash table that is managed by a server, memcached. In particular, it knows nothing about a persistent data store, and offers no services in that regard. It simply knows about keys and values (like a Perl hash).
What I think mySQL 5.6 introduces is a NoSQL API, whereby mySQL clients can request data from the mySQL server by key, rather than by a SELECT statement. (And similarly, they can perform updates with key=value pairs). MySQL uses memcached to cache these in memory as a performance boost, but also takes care of things like writing updates back to the database before they age out of the cache, etc.
In other words, the use of memcached is an implementation detail of the mySQL 5.6 NoSQL feature, and is not something the application programmer needs to be aware of.
I'd welcome any corrections or amplification to my understanding.
Thanks,
Chap
I think it's quite simple (from the official documentation):
I disagree with your last sentence, the application programmer has to be really aware of the memcache plugin because having it onboard of the MySQL server means that he can decide (maybe he will be forced to) access data through a memcached language interface or via the SQL interface
To better understand the impact of this plugin onto an app design you should know that there are 3 configuration tables used by MySQL for a proper memcached management; understanding how the "cache_policies" works will shade some light to some of your doubts:
Table cache_policies specifies whether to use InnoDB as the data store of memcached (innodb_only), or to use the traditional memcached engine as the backstore (cache-only), or both (caching). In the last case, if memcached cannot find a key in memory, it searches for the value in an InnoDB table.
here is the link: innodb-memcached-internals
This quote above means that, depending on what you decided for a specific key-value, you will have different application scenarios :
innodb_only -> means that you can query the data via a sql interface or via a memcached interface, here is a link to some memcached language interface examples memcached-interfaces
cache-only -> means that you should query the data via the memchached interface only
caching -> means that you can use both the interfaces (note that the storage mechanism slightly changes)
Of course this latter configuration decision is strictly related to your specific needs
I don't really have a complete answer for you I'm afraid, as I too am struggling to find the detail I require before toying around with it.
That said however there is one important point which I have managed to uncover that you seem to have missed, namely that by accessing the InnoDB storage engine via the new plugin you are actually completely bypassing SQL and avoiding all the overhead that comes with it.
This of course makes it essentially a key/value store more akin to most NoSQL databases complete with all the drawbacks associated with them. i.e. no joins etc...
However on the flip side for many applications these days, this is exactly what we want. There has been only a handful of real world performance mentions that I have come across but all seem to point to this implementation significantly outperforming MongoDB and other similar NoSQL solutions (how much truth is in it I do not know) with even one (relatively in depth) comparison claiming as high as 700k qps on a commodity server (compared with around 100k on a well tuned MySQL setup), which is incredible if true.
Resource here:
http://yoshinorimatsunobu.blogspot.co.uk/search/label/handlersocket
Anyway, sorry I can't be any more help but its food for thought at least!

Good search solution for Zend Framework + Doctrine + MySQL?

I've looked into Doctrine's built-in search, MySQL myisam fulltext search, Zend_Lucene, and sphinx - but all the nuances and implementation details are making it hard to sort out for me, given that I don't have experience with anything other than the myisam search.
What I really want is something simple that will work with the Zend Framework and Doctrine (MySQL back-end, probably InnoDB). I don't need complex things like word substitutions, auto-complete, and so on (not that I'd be opposed to such things, if it were easy enough and time effective enough to implement).
The main thing is the ability to search for strings across multiple database tables, and multiple fields with some basic search criteria (e.g. user.state. = CA AND user.active = 1). The size of the database will start at around 50K+ records (old data being dumped in), the biggest single searchable table would be around 15K records, and it would grow considerably over time.
That said, Zend_Lucene is appealing to me because it is flexible (in case I do need my search solution to gorw in the future) and because it can parse MS Office files (which will be uploaded to my application by users). But its flexibility also makes it kind of complicated to set up.
I suppose the most straightforward option would be to just use Doctrine's search capabilities, but I'm not sure if that's going to be able to handle what I need. And I don't know that there is any option out there which is going to combine my desire for simplicity & power.
What search solutions would you recommend I investigate? And why would you think that solution would work well in this situation?
I would recomment using Solr search engine.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface (which is really great) and many more features.
It runs in a Java servlet container such as Tomcat.
You can use the solr-php-client to handle queries in php.

How does MyISAM scale compared to Solr for Django searching?

Imagine you have a web application written in Django and Python 2.65, and MySQL 5.1 is your database of choice.
Now, imagine you will need to scale your app to handle searching 100's of thousands of document and potentially 100's of thousands of users will be using it.
Reality: Haystack 1.0 with PySolr and Solr 1.4.0 is proving slow in the above scenarios. Is MyISAM a more workable alternative or should I spend more time working with my current configuration using Solr in a "smarter" way?
Suggestions? Tips?
Thank you for any help!
Michaux
I assume you're talking about comparing Solr vs MySQL full-text search, otherwise it would be comparing apples to oranges.
I don't know about Haystack or PySolr, but Solr itself should have no problems handling documents in the order of 100000 with lots of users. Those two parameters alone are not enough to spec the problem, though. For example, frequency of updates, real frequency of requests, document size, page size, sorting, faceting, etc.
Solr is easily scalable, both vertically and horizontally, and is Apache-licensed, while the horizontal scaling solution for MySQL is GPL+commercially licensed.
I disagree with Badger's answer about Tomcat, it's a very polished, proven, stable server that's been around for over 10 years, and the Java performance myth has to be abolished once and for all.
Bottom line: it's very likely that you need to optimize your Solr instance (both the client-side querying and the server-side index and performance settings). Solr powers some of the biggest websites so it's quite likely that it can handle your load as well.
I have no expertise with Haystack or PySolr but just looking at Solr makes me think that MySQL might be a better choice. I know that MySQL can scale to very large applications if it is setup correctly.
Apache Solr just on Tomcat. Tomcat can be a bit of a resource hog and can run slowly. MySQL runs from compiled binaries. This should provide a bit of a boost. The server that you run this off of will also make a large difference. I would say that if you have the ability go ahead and try and setup the MySQL system, and see if you get any difference.

MySQL vs PostgreSQL? Which should I choose for my Django project?

My Django project is going to be backed by a large database with several hundred thousand entries, and will need to support searching (I'll probably end up using djangosearch or a similar project.)
Which database backend is best suited to my project and why? Can you recommend any good resources for further reading?
For whatever it's worth the the creators of Django recommend PostgreSQL.
If you're not tied to any legacy
system and have the freedom to choose
a database back-end, we recommend
PostgreSQL, which achives a fine
balance between cost, features, speed
and stability. (The Definitive Guide to Django, p. 15)
As someone who recently switched a project from MySQL to Postgresql I don't regret the switch.
The main difference, from a Django point of view, is more rigorous constraint checking in Postgresql, which is a good thing, and also it's a bit more tedious to do manual schema changes (aka migrations).
There are probably 6 or so Django database migration applications out there and at least one doesn't support Postgresql. I don't consider this a disadvantage though because you can use one of the others or do them manually (which is what I prefer atm).
Full text search might be better supported for MySQL. MySQL has built-in full text search supported from within Django but it's pretty useless (no word stemming, phrase searching, etc.). I've used django-sphinx as a better option for full text searching in MySQL.
Full text searching is built-in with Postgresql 8.3 (earlier versions need TSearch module). Here's a good instructional blog post: Full-text searching in Django with PostgreSQL and tsearch2
large database with several hundred
thousand entries,
This is not large database, it's very small one.
I'd choose PostgreSQL, because it has a lot more features. Most significant it this case: in PostgreSQL you can use Python as procedural language.
Go with whichever you're more familiar with. MySQL vs PostgreSQL is an endless war. Both of them are excellent database engines and both are being used by major sites. It really doesn't matter in practice.
All the answers bring interesting information to the table, but some are a little outdated, so here's my grain of salt.
As of 1.7, migrations are now an integral feature of Django. So they documented the main differences that Django developers might want to know beforehand.
Backend Support
Migrations are supported on all backends that Django ships with, as
well as any third-party backends if they have programmed in support
for schema alteration (done via the SchemaEditor class).
However, some databases are more capable than others when it comes to schema migrations; some of the caveats are covered below.
PostgreSQL
PostgreSQL is the most capable of all the databases here in terms of schema support.
MySQL
MySQL lacks support for transactions around schema alteration operations, meaning that if a migration fails to apply you will have to manually unpick the changes in order to try again (it’s impossible to roll back to an earlier point).
In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns. On slower hardware this can be worse than a minute per million rows - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Finally, MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.
SQLite
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name
This process generally works well, but it can be slow and occasionally
buggy. It is not recommended that you run and migrate SQLite in a
production environment unless you are very aware of the risks and its
limitations; the support Django ships with is designed to allow
developers to use SQLite on their local machines to develop less
complex Django projects without the need for a full database.
Even if Postgresql looks better, I find it has some performances issues with Django:
Postgresql is made to handle "long connections" (connection pooling, persistant connections, etc.)
MySQL is made to handle "short connections" (connect, do your queries, disconnect, has some performances issues with a lot of open connections)
The problem is that Django does not support connection pooling or persistant connection, it has to connect/disconnect to the database at each view call.
It will works with Postgresql, but connecting to a Postgresql cost a LOT more than connecting to a MySQL database (On Postgresql, each connection has it own process, it's a lot slower than just popping a new thread in MySQL).
Then you get some features like the Query Cache that can be really useful on some cases. (But you lost the superb text search of PostgreSQL)
When a migration fails in django-south, the developers encourage you not to use MySQL:
! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS (one that supports DDL transactions)
Having gone down the road of MySQL because I was familiar with it (and struggling to find a proper installer and a quick test of the slow web "workbench" interface of postgreSQL put me off), at the end of the project, after a few months after deployment, while looking into back up options, I see you have to pay for MySQL's enterprise back up features. Gotcha right at the very end.
With MySql I had to write some ugly monster raw SQL queries in Django because no select distinct per group for retrieving the latest per group query. Also looking at postgreSQL's full-text search and wishing I had used postgresSQL.
I recommend PostgreSQL even if you are familiar with MySQL, but your mileage may vary.
UPDATE: DBeaver is a great equivalent of MySql Workbench gui tool but works with PostgreSQL very nicely (and many others as its a universal DB tool).
To add to previous answers :
"Full text search might be better supported for MySQL"
The FULLTEXT index in MySQL is a joke.
It only works with MyISAM tables, so you lose ACID, Transactions, Constraints, Relations, Durability, Concurrency, etc.
INSERT/UPDATE/DELETE to a largish TEXT column (like a forum post) will a rebuild a large part of the index. If it does not fit in myisam_key_buffer, then large IO will occur. I've seen a single forum post insertion trigger 100MB or more of IO ... meanwhile the posts table is exclusiely locked !
I did some benchmarking (3 years ago, may be stale...) which showed that on large datasets, basically postgres fulltext is 10-100x faster than mysql, and Xapian 10-100x faster than postgres (but not integrated).
Other reasons not mentioned are the extremely smart query optimizer, large choice of join types (merge, hash, etc), hash aggregation, gist indexes on arrays, spatial search, etc which can result in extremely fast plans on very complicated queries.
Will this application be hosted on your own servers or by a hosting company? Make sure that if you are using a hosting company, they support the database of choice.
There is a major licensing difference between the two db that will affect you if you ever intend to distribute code using the db. MySQL's client libraries are GPL and PostegreSQL's is under a BSD like license which might be easier to work with.