I want to know analysis of MapReduce view indexer against GSI indexer in couchbase. I mean which indexing to use and when.
Read this blog post and see if this does not clear it up for you. It should. If not, please do come back and refine your questions.
Related
MySQL was shut down in the middle of an indexing operation.
It still works but some of the queries seem much slower than before.
Is there anything particular we can check?
Is it possible that an index gets half way through?
Thanks much
As a suggested in my comment, you could try a repair on the relevant table(s).
That said, there's a section of the MySQL manual dedicated to this precise topic, which details how to use the REPAIR <table> statement and indeed dump/re-import.
Is this doesn't make any difference, you may need to check the database settings (if it's a InnoDB engined table/database, it'll love being able to be resident in memory for example) and perhaps try to see what specific indexes are being used via an EXPLAIN on the queries that are causing pain.
There are also commercial tools such as New Relic that'll show what specific queries are being sluggish in quite a lot of detail as well as monitoring other aspects of your system, which may be worth exploring if this is a commercial project/web site.
I am working on a feature and could use opinions on which database I should use to solve this problem.
We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User and Message model. A user can create messages. The message is delivered to other users based on their association with the poster.
Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.
Therefore, a message might look like this:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.
Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.
Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.
And yes, the messages table can easily grow into millions or rows.
This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.
I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).
I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.
What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.
Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.
I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.
I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.
I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).
I agree with #Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.
CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.
Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).
So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).
From my experience with Hbase is not good solution for your application.
Because:
Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.
It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.
Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.
Due to the number of metadata attributes and the fact any number can
be included in a query, creating SQL indexes here doesn't seem like a
good idea.
It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).
Riak can query as fast as you make it, depends on the nodes
Mongo will let you create an index on any field, even if that is an array
CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"
RethinkDB will let you have SQL but a little faster
TokuDB will too
Redis will kill all in speed, but it's entirely stored in RAM
single level relations can be done in all of them, but differently for each.
I'm struggling with MySQL index optimization for some queries that should be simple but are taking forever. Rather than post the specific problem, I wanted to ask if there is an automated way of dealing with these.
I searched around but couldn't find anything. Surely, if query/ index optimization is just following a set of steps, then someone must have written an app to automate it for a given query... or am I not appreciating the complexities involved?
Well, I can offer a SQL indexing tutorial. Let us know if you succeed with automation ;)
Not so sure about MySQL, but there are tools for Oracle and SQL Server. They cover the trivial cases, but they tend to give a false sense of safety regarding non-trivial cases. Nor do they consider the overall workload very will, they are usually limited to suggesting indexes for particular statements.
IF it was that simple, you'd have automated index builder within MySQL.
Actually there is a query optimizer build into MySQL and it transparently rewrites your queries into what it finds most optimal form before executing them. It doesn't always work all that well though, and has it's own quirks. Knowing these helps avoid some common pitfalls (like using dependent subqueries)
There are tools, that can, with the query log given, show you which indexes are not used, and you can, by enabling logging of queries, that do not use index, see which one need an index. The problem is that indexes are expensive and you cannot just index everything, and which indexes you need depends on your queries.
I'm developing software using a MySql database and Hibernate to access it.
The problem I am having is when I look for 1 keyword I am using 40 000 queries already and
the application that I am developing should be able to process multiple keywords.
So basically we are dealing with a database filled with String values and a lot of comparing has to be done. For now, using a filter I'm loading all possible matches in memmory and I compare them in the java code. This is highly recursive and slow.
So obviously MySql and most of all Hibernate are not the way to go.
Could anyone please provide some information on which database would provide better performance.
I'm looking into Hypertable, MongoDb, Hbase, Graph Database, ... but I'm not sure which way to go.
Please help.
Thanks
Your approach is wrong, and you're doing something MySQL does natively - it can store the dataset in the RAM and work with it from there, which is what you're doing with your algorithm.
The other thing is that for specific things like text searching - there are known methods and various storage engines that are specialized for such purpose.
For example, Sphinx is one of those.
Another thing is actually using some sort of data structure that makes searches quick, such as trie - which is incredibly useful for doing things such as autocomplete (this is just an example that doesn't have to be directly connected to your question - it's just a hint that there are known data structures that work fast with strings).
Also, why do you think a NoSQL solution would be quicker when it comes to comparing large volume of string data?
As others have pointed out - it seems your app design and algorithm are the ones that are the culprits here, not underlying technology. You should be more exact in your question and outline what it is that you're doing, how you're doing it and what you'd like for it to be doing. When you answer those questions, people might point you to right direction in solving your problem because it seems you took wrong approach.
Perhaps I misunderstand your question, but ...
For now, using a filter I'm loading all possible matches in memmory and I compare them in the java code. This is highly recursive and slow.
Sounds like you're try to do the job of your database, in-memory? Create an index, write a better SQL query or something, but you're loading all possible matches and the iterating through them? At that point, why even use a database?
Basically, I don't think it's your choice of database (MySQL can handle much larger queries than 40,000 records with no problem). I think your algorithm needs some work.
Your real problem is your using 40,000 queries.
Can you explain your problem and process that leads to so many queries?
Regardless of what database you go with, your algorithm sounds too excessive and so it will always be slow.
Let's fix it first.
Will couchDB be better for storing the forum posts/topics then MySQL? assuming there is proper caching (i.e memcached being used).
It seems at first glance that CouchDB is made for this, the whole document orientated design fits perfectly but I'm more concerned about performance.
Any suggestions?
CouchDB is fast. It will meet your needs perfectly. It's good for a forum as each post and all related comments/posts in a thread will be self contained. CouchDB maps are, as far as I've seen, faster than MySQL Joins when MySQL has a large dataset.
I would say go for it.
Edit:
If you want an example of how CouchDB can be used in a decent way, check out skinnyboard. It's an agile planning tool and contains tasks on a story, and stories on a board, all with permissions in one CouchDB document. The code is a little messy in some places, but it's a good example of data encapsulation using CouchDB.