percona nosql vs other nosql - mysql

I am evaluating nosql stores for storing key/value pairs (for a part of application), and came across percona which offers native key/value within mysql world. It seems a good solution as it allows the storage to remain at a single place (since rest of the functionality exist in mysql and would continue to as-is). Are there any other advantages over other key/value store such as cassandra? What are the disadvantages?

You're referring to the HandlerSocket interface, which bypasses the SQL query layer and allows you to fetch and store rows in a single InnoDB table by primary key. The idea is that avoiding the overhead of SQL allows applications to run a much higher rate of QPS.
HandlerSocket shows promise, but so far what we've found (I work for Percona) is that the bottleneck is the hastily-written client interfaces. That is, the client API for PHP, Ruby, etc. in their current state of implementation have such overhead that HandlerSocket is no faster than writing simple SQL statements for INSERT and SELECT. InnoDB is optimized for primary key access already, since the tables are really stored as clustered indexes by primary key.
Future development on writing optimized code for the HandlerSocket client libraries should improve this over time. If you want to help this process along, get involved in the open-source projects to develop those client libraries.
Another drawback of HandlerSocket is that AFAIK, it doesn't support in-place incrementing of values, which is an optimization some other key/value stores offer. With HandlerSocket, you'd have to fetch the value, read it, increment it, then post it back to the database. This introduces a race condition, so you'd have to lock the row somehow.

Related

How efficiently use MySQL for Stock/TimeSeries related data?

I use Python and MySQL to ingest data via API and generate signals and order execution. Currently, things are functional yet coupled, that is, the single script is fetching data, storing it in MySQL, generating signals, and then executing orders. By tightly coupled does not mean all logic is in the same file, there are separate functions for different tasks. If somehow the script breaks everything will be halted. The way DB tables are generated is based on the instrument available on the fly after running a filter mechanism. The python code creates a different table of the same schema but with different table names based on the instrument name.
Now I am willing to separate the parts:
Data Ingestion (A Must)
Signal Generation
Order Execution
Reporting
First three I am mainly focusing. My concern is that if separate processes are running, acting on the same tables, will it generate any lock or something? How do I take care of it smoothly? or, is MySQL good enough for this or I move on to some other DB Like Postgres or others?
We are already using Digital Ocean Instance, MySQL is currently installed on the same instance.
If you intend to ingest/query time-series at scale, a conventional RDBMS will fall short at one point or another. They are designed for a use case in which reads are more frequent than writes, and optimise for that.
There is a whole family of databases designed specifically for working with Time-Series data. These time-series databases can ingest data at high throughput while running queries on top, and they usually give you lifecycle capabilities so you can decide what to do when data keeps growing.
There are many options available, both open source and proprietary. Out of those databases I would recommend you to try QuestDB because of a few reasons:
It is open source and Apache 2.0 licensed, so you can use it anywhere for anything
It is a single binary (or docker container) to operate
You query data using SQL, (with extensions for time series)
You can insert data using SQL, but you will experience locks if using concurrent clients. However you can also ingest data using the ILP protocol which is designed for ingestion speed. There are official clients in 7 languages so you don't have to deal with the low-level details
It is blazingly fast. I have seen over 2 million inserts per second on a single instance and some users report sustained workloads of over 100,000 events per second
It is well supported on Digital Ocean
There are a lot of public references (and many users who are not a reference) in the finance/trading/crypto industries

PostgreSQL VS MySQL while dealing with GeoDjango in Django

There are multiple Tutorials/Questions over the Internet/Youtube/StackOverflow for finding nearyby businesses, given a location, for example (Question on StackOverflow) :
Returning nearby locations in Django
But one thing common in these all is that they all prefers PostgreSQL (instead of MySQL) for Django's Geodjango library
I am building a project as:
Here a user can register as a customer as well as a business (customer's and business's name/address etc (all) fields will be separate, even if its the same user)
This is not how database is, only for rough idea or project
Both customer and business will have their locations stored
Customers can find nearby businesses around him
I was wondering what are the specific advantages of using PostgreSQL over MySQL in context to computing and fetching the location related fields.
(MySQL is a well tested database for years and most of my data is relational, so I was planning to use MySQL or Microsoft SQL Server)
Would there be any processing disadvantages in context to algorithms used to compute nearby businesses if I choose to go with MySQL, how would it make my system slow?
But one thing common in these all is that they all prefers PostgreSQL (instead of MySQL) for Django's Geodjango library
The reason why they suggest using Postgres is that it has better support for spatial data. It's not that MySQL doesn't support spatial data. However, there is a long list of features which Postgres supports and MySQL doesn't. You can look at this page for details. Almost every time MySQL is mentioned on that page, it is to describe a feature that it does not support, but that Postgres does.
(MySQL is a well tested database for years and most of my data is relational, so I was planning to use MySQL or Microsoft SQL Server)
Note that foriegn key constraints are not compatible with MyISAM, which is the only MySQL database engine which supports spatial indexes. So if you pick MySQL, you need to choose between referential integrity and fast spatial lookups.
If you use Postgres, you can have both referential integrity and fast spatial lookups. Postgres is also a quite mature and widely used relational database these days.
Would there be any processing disadvantages in context to algorithms used to compute nearby businesses if I choose to go with MySQL, how would it make my system slow?
It really depends on how many businesses you're searching for. If you pick an engine that does not support spatial indexes, MySQL is forced to do a full table scan, which takes O(N) time. On the other hand, it can do bounding box comparisons to eliminate many geometries quite quickly. I have seen acceptable interactive performance for 100k points, with performance dropping off after that. In contrast, Postgres with a spatial index is fast for any number of points.

Hibernate Envers performance MySQL

Right now I'm trying to choose the most appropriate approach in order to implement Audit Trail for my entities with AWS RDS MySQL database.
I have to log all entity changes including the initiator(user) who initiated these changes. One of the main criterion is performance.
Hibernate Envers looks like the easiest and the most complete solution and can be very quickly integrated. Right now I'm worried about the possible performance slowdown after Envers introducing. I saw a few posts where developers prefer approach for Audit Trail based on database triggers.
The main issue with triggers is how to get initiator(user) who initiated these changes.
Based on your experience, could you please suggest the approach for Java/Spring/Hibernate/MySQL(AWS) in order to implement Audit Trail for historical changes.
Also, do we have any solution for Audit Trail within AWS RDS MySQL database infrastructure ?
Understand that speculation about performance without concrete evidence to support one's theory is analagous to premature optimization of code. It's almost always a waste of time.
From a simple database point of view, as a table grows to a specific limit, yes it's performance will degrade, but typcally this mainly impacts queries and less on insertion/update if the table is properly indexed and queries properly formed.
But many databases support partitioning as a means to control performance concerns, particularly on larger tables. This typically involves separating a table's data across a set of boundaries defined by a partition scheme you create. You simply define what is the most relevant data and you try and store this partition on your fastest drives/storage and the less relevant, typically older, data is stored on your slower drives/storage.
You can also elect to store database tables in differing schemas/tablespaces by specifying the envers property org.hibernate.envers.default_schema. If your database supports putting schemas in different database files on the file system, you can help increase performance by allowing your entity table reads/writes not impact the reads/writes of your audit tables.
I can't speak to MySQL's support for any of these things, but I do know that MSSQL/Oracle supports partitioning very easily and Oracle for sure allows the separation of schemas across differing database files.

Seeking clarification about mysql 5.6 memcache integration

I'm having trouble getting a clear understanding of what MySQL 5.6 is introducing w/r/t memcache.
As I understand it, memcache by itself is essentially a huge, shared, memory-resident hash table that is managed by a server, memcached. In particular, it knows nothing about a persistent data store, and offers no services in that regard. It simply knows about keys and values (like a Perl hash).
What I think mySQL 5.6 introduces is a NoSQL API, whereby mySQL clients can request data from the mySQL server by key, rather than by a SELECT statement. (And similarly, they can perform updates with key=value pairs). MySQL uses memcached to cache these in memory as a performance boost, but also takes care of things like writing updates back to the database before they age out of the cache, etc.
In other words, the use of memcached is an implementation detail of the mySQL 5.6 NoSQL feature, and is not something the application programmer needs to be aware of.
I'd welcome any corrections or amplification to my understanding.
Thanks,
Chap
I think it's quite simple (from the official documentation):
I disagree with your last sentence, the application programmer has to be really aware of the memcache plugin because having it onboard of the MySQL server means that he can decide (maybe he will be forced to) access data through a memcached language interface or via the SQL interface
To better understand the impact of this plugin onto an app design you should know that there are 3 configuration tables used by MySQL for a proper memcached management; understanding how the "cache_policies" works will shade some light to some of your doubts:
Table cache_policies specifies whether to use InnoDB as the data store of memcached (innodb_only), or to use the traditional memcached engine as the backstore (cache-only), or both (caching). In the last case, if memcached cannot find a key in memory, it searches for the value in an InnoDB table.
here is the link: innodb-memcached-internals
This quote above means that, depending on what you decided for a specific key-value, you will have different application scenarios :
innodb_only -> means that you can query the data via a sql interface or via a memcached interface, here is a link to some memcached language interface examples memcached-interfaces
cache-only -> means that you should query the data via the memchached interface only
caching -> means that you can use both the interfaces (note that the storage mechanism slightly changes)
Of course this latter configuration decision is strictly related to your specific needs
I don't really have a complete answer for you I'm afraid, as I too am struggling to find the detail I require before toying around with it.
That said however there is one important point which I have managed to uncover that you seem to have missed, namely that by accessing the InnoDB storage engine via the new plugin you are actually completely bypassing SQL and avoiding all the overhead that comes with it.
This of course makes it essentially a key/value store more akin to most NoSQL databases complete with all the drawbacks associated with them. i.e. no joins etc...
However on the flip side for many applications these days, this is exactly what we want. There has been only a handful of real world performance mentions that I have come across but all seem to point to this implementation significantly outperforming MongoDB and other similar NoSQL solutions (how much truth is in it I do not know) with even one (relatively in depth) comparison claiming as high as 700k qps on a commodity server (compared with around 100k on a well tuned MySQL setup), which is incredible if true.
Resource here:
http://yoshinorimatsunobu.blogspot.co.uk/search/label/handlersocket
Anyway, sorry I can't be any more help but its food for thought at least!

What's the good choice for handling with real-time data in memory?

Clients send some real time data to the server. The server will do simple analysis with these data. It only finds data from a specific range, or sort some data. Most of data will be abandoned after the analysis, so it's no need to save them in disk.
I want to use some memory DB to handle with them. Is the memory engine of MYSQL a good choice? How about if I use some key-value memory cache engine such as Redis? Because I need to compare the data, maybe pure key-value store can't meet my requirement.
To me that sounds as if it were better off without a database, but that depends on the structure of your data and what kind of operations you have to perform.
If the structure is simple and the operations easy then you should probably store the data in data structures of the programming platform you are using.
How about if I use some key-value memory cache engine such as Redis?
Redis supports advanced data structures which makes it a pretty handy key-value based data store, however if your data requires complex relationships then you should probably check out MongoDB, OrientDB or Riak which should all support memory based storage engines.
If you plan to use the memory engine of MySQL, there are a few gotchas:
by default, indexes are implemented using hash tables rather than btrees. If you need to sort the data, or range support, using btrees may be more interesting.
locking granularity is the table. There is a R/W lock to protect against concurrent DML operations. While raw performance is not bad, scalability is not very good when you have many writers at the same time.
all rows have a fixed width (beware if you need to store varchars ...)
Furthermore, like most others RDBMS, MySQL protocol is synchronous. Each time the clients will write into the database, they will wait for a reply. If you have a lot of data, batching writes operations is almost mandatory to get good performance.
It really depends on the volume, number of clients, and throughput. If the requirements are low, then any storage solution (including MySQL) will work fine. Now if more performance or more scalability are required, then other solutions will likely be better.
What you want to write is probably a DIRT application (data intensive real time). Good storage solutions for this are MongoDB (upserts support, oneway protocol for write operations, etc ...) and Redis (in-memory, O(1) operations, pipelining, etc ...).
Depending on your needs, data modeling and processing will be arguably easier with MongoDB due to btree indexes and map/reduce support. It will probably be a bit more complex with Redis, but if you choose the correct data structure, you will end up with more deterministic performance.
Finally, you might also want to avoid storing the data by processing them on the fly. You can achieve this with a streaming engine such as the ones used on high-speed trading platforms. For instance if you are ready to code in Java, ESPER is an excellent CEP solution to process data streams and/or establish correlations between streams using a SQL-like language.