Logback batch writing to a database - logback

we have a requirement to record (in a database) each call to one of our methods. However, because of the very large number of times the method is called we want to "batch" the inserts eg. only insert every x seconds or only insert when you have say 100 entries using JDBC batch insert (statement.executeBatch()).
I've understand I can use an Async DB Appender and that this uses an AsyncAppender which uses a Blocking Queue however this seems to be more about blocking the producing thread rather than controlling how often the consuming thread performs an action.
Please could someone advise if my requirement is possible with Logback?
Many thanks

Related

Consistent read/write on Aurora Serverless Mysql

Having distributed serverless application, based on AWS Aurora Serverless MySQL 5.6 and multiple Lambda functions. Some of Lambdas represent writing threads, another are reading treads. For denoting most important details, lets suppose that there is only one table with following structure:
id: bigint primary key autoincrement
key1: varchar(700)
key2: bigint
content: blob
unique(key1, key2)
Writing threads perform INSERTs in following manner: every writing thread generates one entry with key1+key2+content, where key1+key2 pair is unique, and id is generating automatically by autoincrement. Some writing threads can fail by DUPLICATE KEY ERROR, if key1+key2 will have repeating value, but that does not matter and okay.
There also some reading threads, which are polling table and tries to process new inserted entries. Goal of reading thread is retrieve all new entries and process them some way. Amount of reading threads is uncontrolled and they does not communicate with each other and does not write anything in table above, but can write some state in custom table.
Firstly it's seems that polling is very simple - it's enough to reading process to store last id that has been processed, and continue polling from it, e.g. SELECT * FROM table WHERE id > ${lastId}. Approach above works well on small load, but does not work with high load by obvious reason: there are some amount of inserting entries, which have not yet appeared in the database, because cluster had not been synchronized at this point.
Let's see what happens in cluster point of view, event if it consists of only two servers A and B.
1) Server A accepts write transaction with entry insertion and acquired autoincrement number 100500
2) Server B accepts write transaction with entry insertion and acquired autoincrement number 100501
3) Server B commits write transaction
4) Server B accepts read transaction, and returns entries with id > 100499, which is only 100501 entry.
5) Server A commits write transaction.
6) Reading thread receives only 100501 entry and moves lastId cursor to 100501. Entry 100500 is lost for current reading thread forever.
QUESTION: Is there way to solve problem above WITHOUT hard-lock tables on all cluster, in some lock-less aware way or something similar?
The issue here is that the local state in each lambda (thread) does not reflect the global state of said table.
As a first call I would try to always consult the table what is the latest ID before reading the entry with that ID.
Have a look at built in function LAST_INSERT_ID() in MySQL.
The caveat
[...] the most recently generated ID is maintained in the server on a
per-connection basis
Your lambda could be creating connections prior to handler function / method which would make them longer living (it's a known trick, but it's not bomb proof here), but I think new simultaneously executing lambda function would be given a new connection, in which case the above solution would fall apart.
Luckily what you have to do then is to wrap all WRITES and all READS in transactions so that additional coordination will take place when reading and writing simultaneously to the same table.
In your quest you might come across transaction isolation levels and SEERIALIZEABLE would be safest and least perfomant, but apparently AWS Aurora does not support it (I had not verified that statement).
HTH

Millions of Data to be inserted in MySQL database using Spring data JPA

Our application is based on Java 8, Spring Data JPA and MySQL. We have two different data source in my application, our task is to fetch millions of data (text stored in a table) from one data source and insert into different data source after some small computation.
When I tried to iterate through each record and insert into different Database, it is taking a longer time than the expected.
Is there any standard and fastest way of doing this? Do I need to use a stored procedure? if yes, then how would I pass the list of entities in the procedure?
Don't use JPA. JPAs main use case is: Loading a non-trivial domain model, manipulating it, then flushing it to the database with automatic detection what changed. You don't seem to need that in your usecase.
Use JDBC and batch inserts. Springs JdbcTemplate will come in handy.
Select a batch, manipulate it as desired, insert it into the target.
For tuning the select process consider value based pagination.
For writing consider removing constraints and indexes and creating them after the process.
There might be more MySQL specific options available, but I don't know about those.
You might want to split your work in three thread pools: One for reading, one for writing, one for processing the data.
I'm not sure, but Spring Batch might help with that.
Load/save entries in batches (100 or 1000 entries in one go).
Load and/or save asynchronously.

MySql, LOAD DATA or BATCH INSERT or any other better way for bulk inserts

I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.

Async Bulk(batch) insert to MySQL(or MongoDB?) via Node.js

Straight to the Qeustion ->.
The problem : To do async bulk inserts (not necessary bulk, if MySql can Handle it) using Node.js (coming form a .NET and PHP background)
Example :
Assume i have 40(adjustable) functions doing some work(async) and each adding a record in the Table after its single iteration, now it is very probable that at the same time more than one function makes an insertion call. Can MySql handle it that ways directly?, considering there is going to be an Auto-update field.
In C#(.NET) i would have used a dataTable to contain all the rows from each function and in the end bulk-insert the dataTable into the database Table. and launch many threads for each function.
What approach will you suggest in this case,
Shall the approach change in case i need to handle 10,000 or 4 million rows per table?
ALso The DB schema is not going to change, will MongoDB be a better choice for this?
I am new to Node, NoSql and in the noob learning phase at the moment. So if you can provide some explanation to your answer, it would be awesome.
Thanks.
EDIT :
Answer : Neither MySql or MongoDB support any sort of Bulk insert, under the hood it is just a foreach loop.
Both of them are capable of handling a large number of connections simultanously, the performance will largely depend on you requirement and production environment.
1) in MySql queries are executed sequentially per connection. If you are using one connection, your 40~ functions will result in 40 queries enqueued (via explicit queue in mysql library, your code or system queue based on syncronisation primitives), not necessarily in the same order you started 40 functions. MySQL won't have any race conditions problems with auto-update fields in that case
2) if you really want to execute 40 queries in parallel you need to open 40 connections to MySQL (which is not a good idea from performance point of view, but again, Mysql is designed to handle auto-increments correctly for multiple clients)
3) There is no special bulk insert command in the Mysql protocol on the wire level, any library exposing bulk insert api in fact just doing long 'insert ... values' query.

MYSQL: Parallel execution of multiple statements within a stored procedure?

I have a procedure (procedureA) that loops through a table and calls another procedure (procedureB) with variables derived from that table.
Each call to procedureB is independent of the last call.
When I run procedureA my system resources show a maximum CPU use of 50% (I assume that is 1 out of my 2 CPU cores).
However, if I open two instances of the mysql terminal and execute a query in both terminals, both CPU cores are used (CPU usage can reach close to 100%).
How can I achieve the same effect inside a stored procedure?
I want to do something like this:
BEGIN
CALL procedureB(var1); -> CPU CORE #1
SET var1 = var1+1;
CALL procedureB(var1); -> CPU CORE #2
END
I know its not going to be that easy...
Any tips?
Within MySQL, to get something done asynchronously you'd have to use an CREATE EVENT, but I'm not sure whether creating one is allowed within a stored procedure. (On a side note: async. inserts can of course be done with INSERT DELAYED, but that's 1 thread, period).
Normally, you are much better of having a couple of processes/workers/deamons which can be accessed asynchronously by you program and have their own database connection, but that of course won't be in the same procedure.
You can write your own daemon as a stored procedure, and schedule multiple copies of it to run at regular intervals, say every 5 minutes, 1 minute, 1 second, etc.
use get_lock() with N well defined lock names to abort the event execution if another copy of the event is still running, if you only want up to N parallel copies running at a time.
Use a "job table" to list the jobs to execute, with an ID column to identify the execution order. Be sure to use good transaction and lock practices of course - this is re-entrant programming, after all.
Each row can define a stored procedure to execute and possibly the parameters. You can even have multiple types of jobs, job tables, and worker events for different tasks.
Use PREPARE and EXECUTE with the CALL statement to dynamically call stored procedures whose names are stored in strings.
Then just add rows as needed to the job table, even inserting in big batches, and let your worker events process them as fast as they can.
I've done this before, in both Oracle and MySQL, and it works well. Be sure to handle errors and log them somewhere, as well as success, for that matter, for debugging and auditing, as well as performance tuning. N=#CPUs may not be the best fit, depending on your data and the types of jobs. I've seen N=2xCPUs work best for data-intensive tasks, where lots of parallel disk I/O is more important than computational power.