I want data queried from several heterogeneous databases, and joined to 1k~1m rows of results each time. in order to improve the performance, I want to add a cache for the result data. because the user may do some sorting or filtering on the result data as well, I would like to cache the results to a relational-database-like system, rather than KV-cache(memcached)
Is there any good tools for this kind of usage? IMHO, mysql has good read performance, but its write performance is not suited as a cache system, is that true?
Hibernate caches automatically, but I've been in this situation before and caching isn't the way to go: It adds more problems than it solves, and usually only briefly delays the inevitable change of technology to... a NoSQL database approach.
Try lucene or mongodb, and store the data fully denormalized in there. You won't be sorry.
Related
I am interested in doing some performance query tests in MySQL and Cassandra based on the same dataset and using only one node
What I want is to check the response time to queries in Cassandra and MySQL for different types of data volume and also with multiple data access. (try stressing the database).
What better way to do this? And what is the most appropriate benchmark for this?
First, I'll try to answer your question.
Most people stress cassandra using the cassandra-stress tool, which will be more than useless at testing MySQL. You'll need to find some generic tool (say, YCSB) that services both MySQL and Cassandra, and then compare those the best you can. YCSB is at https://github.com/brianfrankcooper/YCSB/wiki , and you can probably google for more options.
That said, if you're comparing single machine performance, you're looking at the wrong thing. That's not why people use Cassandra - MySQL is probably as fast, or faster, than Cassandra, when you compare a single node at a time, and SQL is going to be far more developer friendly than CQL (JOINs can be really nice). However, Cassandra is designed for use cases where data doesn't fit on one machine, and indeed, may not fit on a dozen machines. It's designed for scenarios where you need multi-datacenter active/active HA. It's designed for use cases where you need to be able to scale up and scale down over time, adding and removing nodes to match your load. Those are all things that are very difficult to do with MySQL and nearly trivial with Cassandra.
If you're JUST comparing speed, you may not need Cassandra at all. Choosing Cassandra should be about choosing use case - mostly scalability and HA.
Let's imagine very simple task to be done on server. There are many users chatting on our site and we would like to know, if each of them is online or not.
There are two obvious approaches to do that — use MySQL database or apply memcached NoSQL solution.
But why should memcached perform faster? If I understand it correct, MySQL will also read data from memory, not from the disk (if set up and tuned correctly). Few resources for persistence, but also not too much — just few memory pages to flush on disk.
The main question. Is there a strong reason to go NoSQL for such a task or MySQL will also perform ok?
For such a trivial task, you're right, it won't significantly change the performance, as the data will remain in memory and I/O won't be an issue.
Your question seems to imply memcached is a typical NoSQL engine; let me emphasize that memcached is an entity on its own and usually not conceptualized as a NoSQL database, but more as a fast and volatile key-value store, more often than not backed by a disk-bound database.
SQL and NoSQL each have strong points and weaknesses out of scope with your question, and more info about that is available in another thread.
NoSQL in general is for analysis of Big Data. Memcached is for making a fast caching system.
A chat doesn't need analysis of Big Data, nor a cache system, because you only need to show a few data, and data are often updated. So, a relational DBMS is the best choice.
Imagine you have a complex site which rarely changes. Imagine that your pages are complex, and several complicated queries must be executed to compose each page. In that case, using memcached makes sense, because you can compose the pages, and store them in-memory.
Imagine you have enormous business intelligence data. You need to get some aggregating operations, like avg, standard deviation, sums... well, a Big Data solution may perform better than MySQL. Thought, there is a great number of caveats.
Conclusion: NoSQL is not for chats :)
In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.
Concurrently, customer service operates on a replicated slave of the same database.
We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).
We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.
Given all this, what is our best option for a database?
I think you're going the wrong way about the problem.
Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).
Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.
What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point).
You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).
Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).
Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.
Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...
... so ...
... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.
If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.
As much as I'm not a fan of MySQL once your data gets large, I have to say that you're nowhere near needing to move to a NoSQL solution. 120M rows is not a big deal: the database I'm currently working with has ~600M in one table alone and we query it efficiently. Managing that much data from an ops perspective is the problem; querying it isn't.
It's all about proper indexes and the correct use of them when joining, and secondarily memory settings. Find your slow queries (mysql slow query log FTW!), and learn to use the explain keyword to understand whey they are slow. Then tweak your indexes so your queries are efficient. Further, make sure you understand MySQL's memory settings. There are great pages in the docs explaining how they work, and they aren't that hard to understand.
If you've done both of those things and you're still having problems, make sure disk I/O isn't an issue. Then you should look in to another solution for querying your data if it is.
NoSQL solutions like Cassandra have a lot of benefits. Cassandra is fantastic at writing data. Scaling your writes is very easy--just add more nodes! But the tradeoff is that it's harder to get the data back out. From a cost perspective, if you have expertise in MySQl, it's probably better to leverage that and scale your current solution until it hits a limit before completely switching your underlying architecture.
i am writing code for friend list and messaging system for my college website.I need to store interconnected data.. need to search them ...It has about 3500 records..So which way I proceed MYSQL or XML ..which is fastest..which is best ?why?
I'm going to use one of my professor's favorite answers here: "it depends."
XML and MySQL have very different applications. If you need to be doing lots of simultaneous queries for all sorts of sophisticated things, MySQL is your clear winner. Sometimes MySQL can be hard to use in some applications because you must first create a database schema in which to fit your data. It sounds like though, that you have many records with the same structure, and it would be easy enough to throw them into a database. With a SQL based database engine like MySQL, you can also construct queries using the standard SQL language. Database optimizations can also help to increase the performance of these types of queries, for example, you can used indexes and keys. If your data needs to be updated regularly, than MySQL will likely provide better performance as it will not have to rewrite the XML file. If you need your application to scale to many simultaneous connections of sophisticated queries, you are definitely going to want to go with some sort of SQL solution.
Depending upon your application though, sometimes there are other ways to store and access your data. I for one once needed to create a persistent data structure on the disk which could be accessed very quickly, but never updated. For that, I used cdb. There are also other database systems out there like the Berkeley database, and some No-SQL solutions such as couchdb and mongodb. I posed a somewhat interesting question here on stackoverflow on the use of No-SQL solutions a little while back which you may find interesting as well.
This is really just a sampling of different considerations you may want to make when you are choosing how you want to store your data. Think about questions like: How frequently will things be queried? or updated? What will your queries look like? What kinds of applications do you need to access your information from? etc.
I've been using mysql (with innodb; on Amazon rds) because it's sort of universal default, but it's been ridiculously under-performing, and tweaking it only delays the inevitable.
The data is mostly relatively short (<1kB of bytes each) blobs information about 100Ms of urls. There is (or should be, mysql cannot seem to handle it) very high amount of insert / update / retrieve but few complex queries - not that complex queries wouldn't be useful, but because mysql is so slow that it's far faster to get the data out, process it locally, and cache the results somewhere.
I can keep tweaking mysql and throwing more hardware at it, but it seems increasingly futile.
So what are the options? SQL/relational model/etc. optional - anything will do as long as it's fast, networked, and language-independent.
Have you done any sort of end-to-end profiling of your application and MySQL database? To provide better advice it would also be good to understand what improvements you have tried to implement, and your database structure. You haven't given a lot of information on how your MySQL database is configured either. It provides a lot of options for tuning.
You should pick up a copy of High Performance MySQL if you haven't already to learn more about the product.
There is no point in doing anything until you know what your problem is. NoSQL solutions can offer performance benefits but you have provided little evidence that MySQL is incapable of servicing your needs.
Well "Fast, networked and language-independent" + "few complex queries" brings to mind the various NoSQL solutions. To name a few:
MongoDB
CouchDB
Cassandra
And if that's not fast enough, there are always the wicked fast Redis which is my personal favorite atm. :) It is not a database per se, but it's good enough for most scenarios.
I am sure other people can list more NoSQL databases...
and there is always http://nosql-database.org/ .
Generally speaking, databases in this category is better and faster in your scenario because they have relaxed constraints and thus is easier and faster to insert/update/retrieve frequently. But that requires that you think harder about your data model and it is generally not possible to do SQL-style complex queries directly -- you'll instead write more pre-computed data or use a more denormalized design to account for the lack of complex queries.
But since complex queries is a minor problem in your case, I think NoSQL solutions are ideal for you.
With the data you've given about your application's data and workload, it is almost impossible to determine whether the problem really is MySQL itself or something else. You seem to assume that you can throw any workload to a relational engine and it should handle it. Therefore the suggestions made by other commenters about analyzing the performance more carefully are valid in my opinion. Without more data (transactions / second etc.) any further analysis regarding other suitable engines is also futile.
I'm not sure I agree with the advice to jump ship on traditional databases. It might not be the most efficient tool, but it is the one that is FAR more widely understood and used, and a strongly doubt you have a problem that can't be handled by an efficiently set up relational database.
Obvious answers are Oracle, SQLServer, etc, but it might just be your database structure isn't right. I don't know much about MySQL but I do know it's used in some pretty big projects (eBay being noteworthy).