How does Apache Apex handle back pressure? - apache-apex

Repost from users#apex.incubator.apache.org
Apex utilizes buffer server for back pressure. How does the buffer server survive application crashes? What if the buffer server itself dies? Will Apex guarantee that the downstream operator will eventually catch up with the upstream operator when the buffer server is brought back up?

Buffer server is a pub-sub mechanism within Apex platform that is used to stream data between operators. The buffer server always lives in the same container as the upstream operator (one buffer server per container irrespective of number of operators in container); and the output of upstream operator is written to buffer server. The current operator subscribes from the upstream operator's buffer server when a stream is connected.
So if an operator fails, the upstream operator's buffer server will have the required data state until a common checkpoint is reached. If the upstream operator fails, its upstream operator's buffer server has the data state and so on. Finally, if the input operator fails, which has no upstream buffer server, then the input operator is responsible to replay the data state. Depending on the external system, input operator either relies on the external system for replays or maintain the data state itself until a common checkpoint is reached.
If for some reason the buffer server fails, the container hosting the buffer server fails. So, all the operators in the container and their downstream operators are redeployed from last known checkpoint.

Related

does mysql write requested modification to file system before return success to client

I know that databases systems usually come with a bufferpool. when request received ,the DBMS usually try to locate relevant stuff in the buffer pool ,if not it will tries to load related data into the bufferpoll. If the request tries make modifications to a table,DBMS will modify the corresponding pages in the bufferpool . I am wondering does mysql send success back to client immediately after the bufferpool was modified or after modifications was saved to file system ?
If the DBMS send success back to client immediately after bufferpool was modified ,how does the DBMS handle failure that occured when write dirty pages to the hard drive ?
This depends on your flush server setting, but normally it's set to get a confirmation back from the OS that the changes have been written through to disk. MySQL does not write to the disk buffer and call it done.
You can alter this behaviour, and there's other flags of a similar nature for each engine, like InnoDB specifically.

MySQL connection pool on Nodejs

If Node is single-threaded, what is the advantage of using a pool to connect with MySQL?
If it is, when should I release a connection?
Sharing the same, persistent, connection with the whole application isn't enough?
Nodejs is single threaded, right. But it is also async, meaning that the single thread fires multiple sql queries without waiting for the result. The result is only processed via callbacks. Therefore it makes sense to use a connection pool with more than one connection. The database is likely multi-threaded, which makes it possible to parallelize the queries, although they were fired consecutively. There is no guarantee however in which order the results are processed if you don't take extra care for that.
Addendum about connection release
If you use a connection pool, than you should aquire/release each connection from the pool for each query. There is no big overhead here, since the pool manages the underlying connections.
Get connection from pool
Query
In the callback release connection back to the pool.

Unbuffered result set in MySQL golang driver

I have a large query and I want to process the result row-by-row using the Go MySQL driver. The query is very simple but it returns a huge number of rows.
I have seen mysql_use_result() vs mysqli_store_result() at the C-API level. Is there an equivalent way to do an unbuffered query over a TCP connection, such as is used by the Go MySQL driver?
This concept of buffered/unbuffered queries in database client libraries is a bit misleading, because actually, buffering may occur on multiple levels. In general (i.e. not Go specific, and not MySQL specific), you have different kinds of buffers.
TCP socket buffers. The kernel associates a communication buffer to each socket. By default, the size of this buffer is dynamic and controlled by kernel parameters. Some clients can change those defaults to get more control and optimize. Purpose if this buffer is to regulate the traffic in the device queues and eventually, decrease the number of packets on the network.
Communication buffers. Database oriented protocols are generally based on a framing protocol, meaning that frames are defined to separate the logical packets in the TCP stream. Socket buffers do not guarantee that a complete logical packet (a frame) is available for reading. Extra communication buffers are therefore required to make sure the frames are complete when they are processed. They can also help to reduce the number of system calls. These buffers are managed by the low-level communication mechanism of the database client library.
Rows buffers. Some database clients choose to keep all the rows read from the server in memory, and let the application code browse the corresponding data structures. For instance, the PostgreSQL C client (libpq) does it. The MySQL C client leaves the choice to the developer (by calling mysql_use_result or mysql_store_result).
Anyway, the Go driver you mention is not based on the MySQL C client (it is a pure Go driver). It uses only the two first kinds of buffers (sockets, and communication buffers). Row level buffering is not provided.
There is one communication buffer per MySQL connection. Its size is a multiple of 4 KB. It will grow dynamically if the frames are large. In the MySQL protocol, each row is sent as a separate packet (in a frame), so the size of the communication buffer is directly linked to the largest rows received/sent by the client.
The consequence is you can run a query returning a huge number of rows without saturating the memory, and still having good socket performance. With this driver, buffering is never a problem for the developer, whatever the query.

What is INNODB "Pages to be Flushed"? And can I restart mysqld while there are still pages to be flushed?

I've tried reading some of the material out there but they are a bit over my head. However, from what I understand, if you have a lot of memory allocated to the buffer pool, then the writes to memory are happening faster than the disk can keep up with, and therefore, there are "pages to be flushed" still? Additionally, if I restart the mySQL server, will that cause any issues?
InnoDB performs certain tasks in the background, including flushing of dirty pages (those pages that have been changed but are not yet written to the database files) from the buffer pool, a task performed by the master thread.
For more information you can refer:
http://dev.mysql.com/doc/refman/5.6/en/innodb-performance.html#innodb-performance-adaptive_flushing
Having dirty pages is something normal. When you update a row, MySQL updates it in the buffer pool, marking the page as dirty. The change is written in the binary log as well, so in case of crash, MySQL will replay the log and data won't be lost. Writting to the binary log is a append-only operation, while the actual update involve random writes, and random write is slower. MySQL flushes dirty pages to disk when it needs to load new data in the buffer pool. So, having dirty pages in InnoDB is something normal - it's how it works and it's done to improve the overall performance.But if you really want to get rid of them, set innodb_max_dirty_pages_pct value to 0
If you are using MySQL v5.6 then you can enable this variable innodb_buffer_pool_dump_at_shutdown which
Specifies whether to record the pages cached in the InnoDB buffer pool when the MySQL server is shut down, to shorten the warmup process at the next restart. you must use this variable in conjunction with innodb_buffer_pool_load_at_startup.

Socket throttling because client not reading data fast enough?

I have a client/server connection over a TCP socket, with the server writing to the client as fast as it can.
Looking over my network activity, the production client receives data at around 2.5 Mb/s.
A new lightweight client that I wrote to just read and benchmark the rate, has a rate of about 5.0Mb/s (Which is probably around the max speed the server can transmit).
I was wondering what governs the rates here, since the client sends no data to the server to tell it about any rate limits.
In TCP it is the client. If server's TCP window is full - it needs to wait until more ACKs from client came. It is hidden from you inside the TCP stack, but TCP introduces guaranteed delivery, which also means that server can't send data faster than rate at which client is processing them.
TCP has flow control and it happens automatically. Read about it at http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Flow_control
When the pipe fills due to flow control, the server I/O socket write operations won't complete untill the flow control is releaved.
The server is writing data at 5.0Mb/s, but if your client is the bottleneck here then server has to wait before the data in "Sent Buffer" is completely sent to client, or enough space is released to put in more data.
As you said that the light weight client was able to receive at 5.0Mb/s, then it will be the post-receiving operations in your client that you have to check. If you are receiving data and then processing it before you read more data, then this might be the bottleneck.
It is better to receive data asynchronously, and as soon as one receive is complete, ask the client sockets to start receiving data again, while you process the received data in a separate thread pool thread. This way your client is always available to receive incomming data, and server can send it at full speed.