Benefits of using Apache Kafka vs Beanstalkd - mysql

I am implementing data aggregation application framework. It should be consuming create, update and delete operations on MySQL databases which are sharded across multiple instances.
These aggregations should be stored in a separate database and serve as read store for eg. statistical overviews.
I've come up with two solutions, one of which is using Kafka consumers to write this data to aggregation database. Other one is creating background jobs for Beanstalkd on application level (whenever data is persisted) and consuming them with workers which would do the same job as Kafka consumers.
I'm interested in your thoughts on in pros and cons of both approaches in terms of performance, scalability and fault tolerance.

Related

How efficiently use MySQL for Stock/TimeSeries related data?

I use Python and MySQL to ingest data via API and generate signals and order execution. Currently, things are functional yet coupled, that is, the single script is fetching data, storing it in MySQL, generating signals, and then executing orders. By tightly coupled does not mean all logic is in the same file, there are separate functions for different tasks. If somehow the script breaks everything will be halted. The way DB tables are generated is based on the instrument available on the fly after running a filter mechanism. The python code creates a different table of the same schema but with different table names based on the instrument name.
Now I am willing to separate the parts:
Data Ingestion (A Must)
Signal Generation
Order Execution
Reporting
First three I am mainly focusing. My concern is that if separate processes are running, acting on the same tables, will it generate any lock or something? How do I take care of it smoothly? or, is MySQL good enough for this or I move on to some other DB Like Postgres or others?
We are already using Digital Ocean Instance, MySQL is currently installed on the same instance.
If you intend to ingest/query time-series at scale, a conventional RDBMS will fall short at one point or another. They are designed for a use case in which reads are more frequent than writes, and optimise for that.
There is a whole family of databases designed specifically for working with Time-Series data. These time-series databases can ingest data at high throughput while running queries on top, and they usually give you lifecycle capabilities so you can decide what to do when data keeps growing.
There are many options available, both open source and proprietary. Out of those databases I would recommend you to try QuestDB because of a few reasons:
It is open source and Apache 2.0 licensed, so you can use it anywhere for anything
It is a single binary (or docker container) to operate
You query data using SQL, (with extensions for time series)
You can insert data using SQL, but you will experience locks if using concurrent clients. However you can also ingest data using the ILP protocol which is designed for ingestion speed. There are official clients in 7 languages so you don't have to deal with the low-level details
It is blazingly fast. I have seen over 2 million inserts per second on a single instance and some users report sustained workloads of over 100,000 events per second
It is well supported on Digital Ocean
There are a lot of public references (and many users who are not a reference) in the finance/trading/crypto industries

Using MongoDB and Relational DB together

How to handle transaction issues in an environment where mongodb and mysql databases work together?
I want to use mongodb for scalability and mysql for transactions. (transactions are used in the inventory management system but product information is stored in a mysql database)
There is good news: As of Mongo version 4.2, multiple document ACID transactions are now fully supported:
For situations that require atomicity of reads and writes to multiple documents (in a single or multiple collections), MongoDB supports multi-document transactions.
As a general comment to your question, there is nothing wrong with having more than one data store in your architecture. However, keep in mind that business operations which would require both Mongo and MySQL for a single logical transaction/unit, there would probably be no way to make this atomic. If you are falling into this category, then you need to rethink your database design, and just stick with one database for each business operation.

How to convert MS SQL tables to DynamoDB tables?

I am new to Amazon DynamoDB and I have eight(8) MS SQL tables that I want to migrate to DynamoDB.
What process should I use for converting and migrating the database schema and data?
I was facing the same problem a year back when I started migrating the app from SQL to DynamoDB. I am not sure if there are automated tools, but I can share what we had done for migration:
Check if your existing data types can be mapped/need to change in DynamoDB. You can merge some of the table which requires less updates into single item with List and Map types or use a Set if required.
The most important thing is to check all your existing queries. This will be the core information you will need when you will design DynamoDB tables.
Make sure you distribute Hash keys properly.
Use GSI and LSI for searching and sorting purposes (project only those attributes that will be needed; this will save money).
Some points that will save some money:
If your tables are read-heavy, try using some caching mechanism, otherwise be ready to increase throughput of the tables.
If your table is write-heavy, then implement a queuing mechanism, such as SQS.
Keep checking all of your important tables status in Management console. They have provided different matrices that will help you in managing the throughput of the tables.
I have written a blog which include all the challenges faced while moving from relational database to NoSQL database

Which strategy to use for designing a log data storage?

We want to design a data storage with Relational database keeping the request message(http/s,xmpp etc.) logs. For generating logs we use a solution based on Apache synapse esb. However since we want to store the logs and read the logs only for maintenance issues the read/write ratio will be low. (write count will be intensive since system will receive many messages to be logged. ) We thought of using Cassandra for its distributed nature and clustering capabilities. However with Cassandra database schemas search queries with filter are difficult, always requiring secondary indexes.
To keep it short my question is whether should we try the clustering solutions of mysql or using Cassandra with suitable schema design for search queries with filters?
If you wish to do real time analytics over your semi-structured or unstructured data you can go with Cassandra + Hadoop cluster. Since Cassandra wiki itself suggests Datastax Brisk edition, for such kind of architecture. It is worth giving it a try
On the other side if you wish to do realtime queries over raw logs for small set of data. Ex.
select useragent from raw_log_table where id='xxx'
Then you should do a lots of research over you row key and column key design. Because that decides the complexity of the query. Better have a look at the case studies of people here http://www.datastax.com/cassandrausers1
Regards,
Tamil

MySQL Cluster is a NoSQL technology?

MySQL Cluster is a NoSQL technology? Or is another way to use the relational database?
MySQL Cluster uses MySQL Servers as API nodes to provide SQL access/a relational view to the data. The data itself is stored in the data nodes - which are separate processes. The fastest way to access the data is through the C++ API (NDB API) - in fact that is how the MySQL Server gets to the data.
There are a number of NoSQL access methods for getting to the data (that avoid going through the MySQL Server/releational view) including Rest, Java, JPA, LDAP and most recently the Memcached key-value store API.
It is another way to use the database by spreading it across multiple machines and allowing a simplified concurrent-master setup. It comes with a bit of a cost in that your indexes cannot exceed the amount of RAM available to hold them. To you application, it looks no different than regular MySQL.
Perhaps take a look at Can MySQL Cluster handle a terabyte database.