High Performance Database Opinion - mysql

I'm developing software using a MySql database and Hibernate to access it.
The problem I am having is when I look for 1 keyword I am using 40 000 queries already and
the application that I am developing should be able to process multiple keywords.
So basically we are dealing with a database filled with String values and a lot of comparing has to be done. For now, using a filter I'm loading all possible matches in memmory and I compare them in the java code. This is highly recursive and slow.
So obviously MySql and most of all Hibernate are not the way to go.
Could anyone please provide some information on which database would provide better performance.
I'm looking into Hypertable, MongoDb, Hbase, Graph Database, ... but I'm not sure which way to go.
Please help.
Thanks

Your approach is wrong, and you're doing something MySQL does natively - it can store the dataset in the RAM and work with it from there, which is what you're doing with your algorithm.
The other thing is that for specific things like text searching - there are known methods and various storage engines that are specialized for such purpose.
For example, Sphinx is one of those.
Another thing is actually using some sort of data structure that makes searches quick, such as trie - which is incredibly useful for doing things such as autocomplete (this is just an example that doesn't have to be directly connected to your question - it's just a hint that there are known data structures that work fast with strings).
Also, why do you think a NoSQL solution would be quicker when it comes to comparing large volume of string data?
As others have pointed out - it seems your app design and algorithm are the ones that are the culprits here, not underlying technology. You should be more exact in your question and outline what it is that you're doing, how you're doing it and what you'd like for it to be doing. When you answer those questions, people might point you to right direction in solving your problem because it seems you took wrong approach.

Perhaps I misunderstand your question, but ...
For now, using a filter I'm loading all possible matches in memmory and I compare them in the java code. This is highly recursive and slow.
Sounds like you're try to do the job of your database, in-memory? Create an index, write a better SQL query or something, but you're loading all possible matches and the iterating through them? At that point, why even use a database?
Basically, I don't think it's your choice of database (MySQL can handle much larger queries than 40,000 records with no problem). I think your algorithm needs some work.

Your real problem is your using 40,000 queries.
Can you explain your problem and process that leads to so many queries?
Regardless of what database you go with, your algorithm sounds too excessive and so it will always be slow.
Let's fix it first.

Related

Which is the right database for the job?

I am working on a feature and could use opinions on which database I should use to solve this problem.
We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User and Message model. A user can create messages. The message is delivered to other users based on their association with the poster.
Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.
Therefore, a message might look like this:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.
Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.
Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.
And yes, the messages table can easily grow into millions or rows.
This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.
I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).
I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.
What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.
Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.
I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.
I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.
I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).
I agree with #Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.
CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.
Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).
So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).
From my experience with Hbase is not good solution for your application.
Because:
Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.
It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.
Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.
Due to the number of metadata attributes and the fact any number can
be included in a query, creating SQL indexes here doesn't seem like a
good idea.
It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).
Riak can query as fast as you make it, depends on the nodes
Mongo will let you create an index on any field, even if that is an array
CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"
RethinkDB will let you have SQL but a little faster
TokuDB will too
Redis will kill all in speed, but it's entirely stored in RAM
single level relations can be done in all of them, but differently for each.

XML or MYSQL.Which should be used for storing connected data?

i am writing code for friend list and messaging system for my college website.I need to store interconnected data.. need to search them ...It has about 3500 records..So which way I proceed MYSQL or XML ..which is fastest..which is best ?why?
I'm going to use one of my professor's favorite answers here: "it depends."
XML and MySQL have very different applications. If you need to be doing lots of simultaneous queries for all sorts of sophisticated things, MySQL is your clear winner. Sometimes MySQL can be hard to use in some applications because you must first create a database schema in which to fit your data. It sounds like though, that you have many records with the same structure, and it would be easy enough to throw them into a database. With a SQL based database engine like MySQL, you can also construct queries using the standard SQL language. Database optimizations can also help to increase the performance of these types of queries, for example, you can used indexes and keys. If your data needs to be updated regularly, than MySQL will likely provide better performance as it will not have to rewrite the XML file. If you need your application to scale to many simultaneous connections of sophisticated queries, you are definitely going to want to go with some sort of SQL solution.
Depending upon your application though, sometimes there are other ways to store and access your data. I for one once needed to create a persistent data structure on the disk which could be accessed very quickly, but never updated. For that, I used cdb. There are also other database systems out there like the Berkeley database, and some No-SQL solutions such as couchdb and mongodb. I posed a somewhat interesting question here on stackoverflow on the use of No-SQL solutions a little while back which you may find interesting as well.
This is really just a sampling of different considerations you may want to make when you are choosing how you want to store your data. Think about questions like: How frequently will things be queried? or updated? What will your queries look like? What kinds of applications do you need to access your information from? etc.

Where to use MongoDB and where MySQL?

I am thinking about using one of two databases - MySQL and MongoDB. I am planning to storing text and numeric data and I will building my app in RoR.
So I don't know, which database system could be better for this purpose - can you help me, please, under which criterium I will decide?
Let me cast this question within more general setting and into some historical perspective.
In the 60s they were asking whether to use hierarchical or network database
In the 70s the debate was relational against network
In the 80s Relational turned into SQL databases, so question mutated to SQL vs. network
In the 90s it was SQL against object databases
In 00s it was SQL against XML databases
Today we have SQL vs. NoSQL
Do you see a pattern here? Would you still bet some money onto SQL competitor, especially if it's nothing more than glorified hash table?
I have used also MySQL and MongoDB with Mongoid in my projects, and I can say that if you want to keep binary data like images, mp3s and other stuff in your database so try Mongo, for other reasons you can use SQL databases. MongoDB has no structure - you processing the hash, so you can dynamicly add and remove keys/columns.
In your case I would use MySQL.
In my opinion you should base your decision on the purpose of your application. Do you want to search through your text data, how will you define keys. There is little use in going for MySQL if you have to request each record and scan it. Even if there is functionality to do text scans in MySQL (does it have that?) MongoDB will probably do the job more efficiently. The other way around, if you are not going to use MongoDB's strong points then you might as well go for MySQL.
Another factor might be the deadline for implementing something. If you need it fast, don't waste time on learning something new. If you have time to experiment, figure out the key features you will most likely rely upon in your application.
I think, if you need a hard structure you should use MySQL because it't its nature, but if you need something more dynamic, whith no structure at all (schema-less) you should use MongoDB, I've never use MongoDB but I know it's more object/document oriented.
It would be helpful if you could provide some more detail. Would your data easily fit into a schema, or do you need the flexibility that a document store offers? What about auto-sharding, etc? Without more information, no one can give you advice that fits your needs. Lacking that, you can't hope for feedback any better than people's personal preferences, which is little more than a flamewar waiting to happen.

What kinds of data queries are too hard to do on CouchDB (as opposed to SQL)? Seeking concrete examples

I think CouchDB is really cool and want to use it more. But I'd also like to know ahead of time whether there are any types of data query that are done easily on MySQL but are impossible or very awkward to accomplish in CouchDB.
Please answer with concrete answers or examples instead of just saying that "CouchDB is for documents and MySQL is for relational data." I don't really know what that statement means, since it seems that you can do things functionally equivalent to relational MySQL joins with CouchDB views.
For example, I've read that paginating through a data set is a bit awkward in CouchDB. This is the sort of answer I'm looking for.
A problem I'm having at the moment is displaying an AJAX grid with contents from a CouchDB database. The equivalent SQL request would be:
SELECT * FROM the_table
WHERE {filter_col} = {filter_value} [ AND ... ]
ORDER BY {order_col}
LIMIT {n} OFFSET {m}
It's a pretty simple request to run on a traditional SQL database, but having to perform filtering, ordering and paging all together at the same time is beyond what CouchDB indexing can manage - at least, without creating an insane number of different views.
Couchdb is having hard time with full-text searches (unless external software is used), although mysql isn't particularly good at that, couch is still even worse.
Couchdb isn't going to do a good job when your data model implies multiple and complex relations between objects, after all, it's a document-based system, not relational dbms.
Other than that, IMO couch rules.
EDIT: Particularly when you need to relax, of course! :)
It all depends on the motivation behind changing data stores. What problem or architectural challenge are you trying to overcome with MySQL that CouchDB can solve? If at the end of the day there is no difference in functionality or performance then the refactoring to change database platforms cannot be justified.
Have a look at some ORM frameworks, which if implemented correctly can let you swap out the back end databases easily.

Scalable 'google suggestions'-like system

I have 100,000 queries, and I need to create a google-like 'Suggestions' system.
Much like this
I need it to be pretty quick, and if possible allow for some more in-depth options (like sorting, etc.).
Can anyone recommend a database system I could use for this that could handle searching through 100k+ queries while still keeping speed, or an existing project that you think would work for my needs?
I've been looking into possibly using MongoDB, but I'm not yet sure if that's the best route.
Any help is appreciated!
There are many database solutions that will easily handle this requirement, 100K rows isn't really very many for a database. Based on what you've said in your question there isn't really a 'best' solution.
It just depends on what you have access to, and perhaps how you see the application growing. If it's going to grow into something more complex then you might be better off using a full relational database solution such as MySQL or MSSQL, otherwise MongoDB will be fine.
If they're really just 100K words, I'd be tempted to load the whole thing into memory, as a prefix trie. That will be blazingly fast.
Of course, that makes it slightly harder to update... what's adding to the list of options? Do you need an option added via one machine to be instantly available, or is eventual consistency good enough?