Use Hadoop as MySQL storage engine?

Use Hadoop as MySQL storage engine? - mysql

besides using Hive, is it a good idea in order to execute ad hoc query on large scale log data on HDFS for SQL programmers?
Is there any similar open-source implementation?

I search the question in 2014 and I found Infinidb and a blog about it. It integrate hadoop and mysql. It provide a native mysql protocol access to data stored hadoop.
I've not reading much about it, while it is questionable to me in compatibility(with existing app for mysql) and performance(compare to well tuned index and data partitioning).
But it might be the easiest solution for high-availability with really large dataset which cannot fit into few disks. (using HDFS build-in replication, no SAN or RAID will be needed)
BTW, the Infinidb website is currently affected by the Heartbleed bug. I wonder if their product secure as the yet patch the heardbleed for more then 5 months.

Technically it should not be that complicated to implement. Some conceptual problem I see with it that performance-wise behavior of the NoSQL engines is fundamentally different from what MySQL engine expect from storage engines. Specifically - they have good random access and not that efficient in the full or range scans. The question is it will be possible to translate all these costs to the optimizer. It is something applicable to any RDBMS engine. Actually many of them has a concept of pluggable storage engines and have different level of flexibility / documentation.
I think, to have such integration efficient we need to be able to push down predicates to the NoSQL engines for the full / range scans. I am not 100% sure that MySQL supports it on the level of storage engine interface.
Another serious problem I see with this approach - the fact that MySQL does not have parallel query, and thereof can not be too good for processing big data.

Related

Seeking clarification about mysql 5.6 memcache integration

I'm having trouble getting a clear understanding of what MySQL 5.6 is introducing w/r/t memcache.
As I understand it, memcache by itself is essentially a huge, shared, memory-resident hash table that is managed by a server, memcached. In particular, it knows nothing about a persistent data store, and offers no services in that regard. It simply knows about keys and values (like a Perl hash).
What I think mySQL 5.6 introduces is a NoSQL API, whereby mySQL clients can request data from the mySQL server by key, rather than by a SELECT statement. (And similarly, they can perform updates with key=value pairs). MySQL uses memcached to cache these in memory as a performance boost, but also takes care of things like writing updates back to the database before they age out of the cache, etc.
In other words, the use of memcached is an implementation detail of the mySQL 5.6 NoSQL feature, and is not something the application programmer needs to be aware of.
I'd welcome any corrections or amplification to my understanding.
Thanks,
Chap

I think it's quite simple (from the official documentation):
I disagree with your last sentence, the application programmer has to be really aware of the memcache plugin because having it onboard of the MySQL server means that he can decide (maybe he will be forced to) access data through a memcached language interface or via the SQL interface
To better understand the impact of this plugin onto an app design you should know that there are 3 configuration tables used by MySQL for a proper memcached management; understanding how the "cache_policies" works will shade some light to some of your doubts:
Table cache_policies specifies whether to use InnoDB as the data store of memcached (innodb_only), or to use the traditional memcached engine as the backstore (cache-only), or both (caching). In the last case, if memcached cannot find a key in memory, it searches for the value in an InnoDB table.
here is the link: innodb-memcached-internals
This quote above means that, depending on what you decided for a specific key-value, you will have different application scenarios :
innodb_only -> means that you can query the data via a sql interface or via a memcached interface, here is a link to some memcached language interface examples memcached-interfaces
cache-only -> means that you should query the data via the memchached interface only
caching -> means that you can use both the interfaces (note that the storage mechanism slightly changes)
Of course this latter configuration decision is strictly related to your specific needs

I don't really have a complete answer for you I'm afraid, as I too am struggling to find the detail I require before toying around with it.
That said however there is one important point which I have managed to uncover that you seem to have missed, namely that by accessing the InnoDB storage engine via the new plugin you are actually completely bypassing SQL and avoiding all the overhead that comes with it.
This of course makes it essentially a key/value store more akin to most NoSQL databases complete with all the drawbacks associated with them. i.e. no joins etc...
However on the flip side for many applications these days, this is exactly what we want. There has been only a handful of real world performance mentions that I have come across but all seem to point to this implementation significantly outperforming MongoDB and other similar NoSQL solutions (how much truth is in it I do not know) with even one (relatively in depth) comparison claiming as high as 700k qps on a commodity server (compared with around 100k on a well tuned MySQL setup), which is incredible if true.
Resource here:
http://yoshinorimatsunobu.blogspot.co.uk/search/label/handlersocket
Anyway, sorry I can't be any more help but its food for thought at least!

MySQL vs PostgreSQL Concerns w/ GIS & Speed

I'm aware there are a few threads out there addressing this issue, but I'm wondering if anything has changed since those have been published.
I'm looking to build a GIS webapp, and people are all saying that PostgreSQL is the way to go because it supports various things that have to do with mapping better, whereas MySQL's spatial extensions aren't too great.
So PostgreSQL seems like the way to go, but everywhere I go I'm reading that PostgreSQL is terribly slow compared to MySQL, is this still true?
If I want to use GeoDjango with MySQL, will I be able to do most everything?
I'm really stuck between the two, simply because people keep saying PostgreSQL is really slow, but MySQL isn't really great for dealing with GIS stuff.
What's your take SO?

No, postgresql is not slower. This myth is due to people running single threaded sequential benchmarks on myisam vs postgresql. Benchmarks that attempt to model actual usage conditions with many concurrent queries put postgresql on par with or ahead of mysql in performance, especially as you scale up in CPUs/cores.
http://www.randombugs.com/linux/mysql-postgresql-benchmarks.html
http://tweakers.net/reviews/657/5

In my opinion, it's silly to compare MySQL and PostgreSQL in terms of speed if there are variables unknown, such as - what's your budget, what's your target system output and what's your load rate?
Both RDBMSs are great, and they can be scaled. The difference is that MySQL has pluggable engine architecture, allowing it to plug in various engines. Natively, MySQL supports 9 engines if I'm not mistaken but it has a plethora of commercial engines to choose from, along with 2 popular forks (Percona's and MariaDB) that introduce various enhancements, especially for InnoDB storage engine.
Real question is, what does it mean that something is "bad" at GIS "stuff"? What does bad mean? Can't calculate something? Can't store something? I just don't get what you consider bad really.
I doubt you can go wrong by choosing either of the two databases, just beware of false benchmarks claiming one product is faster than another. Set your goal in terms of performance, install both products on your test machine and run them. If both satisfy your performance needs, use the one you feel more comfortable developing with.

Check this topic: GIS: PostGIS/PostgreSQL vs. MySql vs. SQL Server?
PostGIS is much more mature and complete, and competes with Oracle and SQL Server, not MySQL. Sorry.

When it comes to GIS capabilities, have a look at this GIS SE question:
Would PostGIS offer an advantage over MySQL for a produce farm application?
I think that from all that I read here and on GIS SE site, PostgreSQL with PostGIS is a clear winner when it comes to handling spatial data.

Upgrading from MySQL 4.1 to 5.0 - What kind of performance changes (good or bad) can we expect?

Currently have approximately 2000 simultaneouse connections. We average approximately 425 reads and writes per second. We have a read to write ration of 3:1. All of our tables are myisam. Can we expect better or worse performance when we go from mysql 4.1.22 to 5.0?

There's no way for anyone here to tell you without the schema, queries and test data.
Why not setup a dev environment on 5.0 and testing it out?

The main concern should be that the 5.0 Information Schemas, are a HUGE vulnerability and can be used to very easily gain access to the SQL server from remote locations simply by printing off the schema using injection will let an unwanted viewer, view all of the tables and capitalize off the knowledge to get passwords using the same schema for its columns.

The MySQL source tree includes a set of benchmark tests written as Perl scripts. See The MySQL Benchmark Suite for some information. You can download the source distribution for MySQL 5.0.91 at the archives.
Source distribution of MySQL 4.1 doesn't seem to be easily available anymore. You might have to check it old sources from LaunchPad unless you can find a copy of an old source distribution elsewhere on the internet.
However, the comparison that these benchmarks show is only of general interest. It may be irrelevant to how your application performs. For instance, your usage of the database may not take advantage of some performance improvements in MySQL 5.0, but it may run into some regressions in MySQL 5.0 that were necessary.
The only way to get an answer that is relevant to your application is to try the new software with a test instance of your application, using a sample of data that is a realistic model of the type and volume of data your application typically deals with. As #BenS says, no one on a site like StackOverflow can give an answer specific to your application.
You say in a comment that you're very concerned about performance, but if you don't have an instance of your application and database that you can run tests on, you aren't doing the work necessary to satisfy this concern.

I would strongly suggest moving straight to 5.1.45 with Innodb Support. Percona provides an excellent version with XtraDB that provides a number of performance related improvements. Moving off of your MyISAM tables and onto Innodb will provide a huge performance increase in almost all cases. If you are going to burn the QA/Testing time to move, do a full move now, not a half-way step.

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf

The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?

MySQL vs PostgreSQL? Which should I choose for my Django project?

My Django project is going to be backed by a large database with several hundred thousand entries, and will need to support searching (I'll probably end up using djangosearch or a similar project.)
Which database backend is best suited to my project and why? Can you recommend any good resources for further reading?

For whatever it's worth the the creators of Django recommend PostgreSQL.
If you're not tied to any legacy
system and have the freedom to choose
a database back-end, we recommend
PostgreSQL, which achives a fine
balance between cost, features, speed
and stability. (The Definitive Guide to Django, p. 15)

As someone who recently switched a project from MySQL to Postgresql I don't regret the switch.
The main difference, from a Django point of view, is more rigorous constraint checking in Postgresql, which is a good thing, and also it's a bit more tedious to do manual schema changes (aka migrations).
There are probably 6 or so Django database migration applications out there and at least one doesn't support Postgresql. I don't consider this a disadvantage though because you can use one of the others or do them manually (which is what I prefer atm).
Full text search might be better supported for MySQL. MySQL has built-in full text search supported from within Django but it's pretty useless (no word stemming, phrase searching, etc.). I've used django-sphinx as a better option for full text searching in MySQL.
Full text searching is built-in with Postgresql 8.3 (earlier versions need TSearch module). Here's a good instructional blog post: Full-text searching in Django with PostgreSQL and tsearch2

large database with several hundred
thousand entries,
This is not large database, it's very small one.
I'd choose PostgreSQL, because it has a lot more features. Most significant it this case: in PostgreSQL you can use Python as procedural language.

Go with whichever you're more familiar with. MySQL vs PostgreSQL is an endless war. Both of them are excellent database engines and both are being used by major sites. It really doesn't matter in practice.

All the answers bring interesting information to the table, but some are a little outdated, so here's my grain of salt.
As of 1.7, migrations are now an integral feature of Django. So they documented the main differences that Django developers might want to know beforehand.
Backend Support
Migrations are supported on all backends that Django ships with, as
well as any third-party backends if they have programmed in support
for schema alteration (done via the SchemaEditor class).
However, some databases are more capable than others when it comes to schema migrations; some of the caveats are covered below.
PostgreSQL
PostgreSQL is the most capable of all the databases here in terms of schema support.
MySQL
MySQL lacks support for transactions around schema alteration operations, meaning that if a migration fails to apply you will have to manually unpick the changes in order to try again (it’s impossible to roll back to an earlier point).
In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns. On slower hardware this can be worse than a minute per million rows - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Finally, MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.
SQLite
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name
This process generally works well, but it can be slow and occasionally
buggy. It is not recommended that you run and migrate SQLite in a
production environment unless you are very aware of the risks and its
limitations; the support Django ships with is designed to allow
developers to use SQLite on their local machines to develop less
complex Django projects without the need for a full database.

Even if Postgresql looks better, I find it has some performances issues with Django:
Postgresql is made to handle "long connections" (connection pooling, persistant connections, etc.)
MySQL is made to handle "short connections" (connect, do your queries, disconnect, has some performances issues with a lot of open connections)
The problem is that Django does not support connection pooling or persistant connection, it has to connect/disconnect to the database at each view call.
It will works with Postgresql, but connecting to a Postgresql cost a LOT more than connecting to a MySQL database (On Postgresql, each connection has it own process, it's a lot slower than just popping a new thread in MySQL).
Then you get some features like the Query Cache that can be really useful on some cases. (But you lost the superb text search of PostgreSQL)

When a migration fails in django-south, the developers encourage you not to use MySQL:
! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS (one that supports DDL transactions)

Having gone down the road of MySQL because I was familiar with it (and struggling to find a proper installer and a quick test of the slow web "workbench" interface of postgreSQL put me off), at the end of the project, after a few months after deployment, while looking into back up options, I see you have to pay for MySQL's enterprise back up features. Gotcha right at the very end.
With MySql I had to write some ugly monster raw SQL queries in Django because no select distinct per group for retrieving the latest per group query. Also looking at postgreSQL's full-text search and wishing I had used postgresSQL.
I recommend PostgreSQL even if you are familiar with MySQL, but your mileage may vary.
UPDATE: DBeaver is a great equivalent of MySql Workbench gui tool but works with PostgreSQL very nicely (and many others as its a universal DB tool).

To add to previous answers :
"Full text search might be better supported for MySQL"
The FULLTEXT index in MySQL is a joke.
It only works with MyISAM tables, so you lose ACID, Transactions, Constraints, Relations, Durability, Concurrency, etc.
INSERT/UPDATE/DELETE to a largish TEXT column (like a forum post) will a rebuild a large part of the index. If it does not fit in myisam_key_buffer, then large IO will occur. I've seen a single forum post insertion trigger 100MB or more of IO ... meanwhile the posts table is exclusiely locked !
I did some benchmarking (3 years ago, may be stale...) which showed that on large datasets, basically postgres fulltext is 10-100x faster than mysql, and Xapian 10-100x faster than postgres (but not integrated).
Other reasons not mentioned are the extremely smart query optimizer, large choice of join types (merge, hash, etc), hash aggregation, gist indexes on arrays, spatial search, etc which can result in extremely fast plans on very complicated queries.

Will this application be hosted on your own servers or by a hosting company? Make sure that if you are using a hosting company, they support the database of choice.

There is a major licensing difference between the two db that will affect you if you ever intend to distribute code using the db. MySQL's client libraries are GPL and PostegreSQL's is under a BSD like license which might be easier to work with.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008