Hey I read this jdbc docs
https://www.playframework.com/documentation/2.1.0/ScalaDatabase
and this question
Is it good to put jdbc operations in actors?
Now I have an ActorClass for my mysql transaction, and this actor instantiated several times, whenever request comes. So each request would instantiate new actor. Is it safe for connection pool?
Can I use
val connection = DB.getConnection()
is connection object could handle async transaction?
So I could just a singleton to handle mysql connection and used it in all instantiated actors. Also if I want to use anorm, how do I make an implicit connection object?
Thanks
Your DB.getConnection() should be a promise[Connection] or a future[Connection] if you don't want to block the actor. (caveats at the end of the answer)
If your DB.getConnection() is synchronous (returning only connection without wrapping type) your actor will hang until it actually gets a connection from the pool while processing the actual message. It doesn't matter your DB being singleton or not, in the end it will hit the connection pool.
That being said, you can create actors to handle the messaging and other actors to handle persistence in the database, put them in different thread dispatchers giving more thread to database intensive. This is suggested also in the PlayFramework.
Caveats:
If you run futures inside the actor you are not ensure of the thread/timing it will run, I'm assuming you did something in the line of these (read the comments)
def receive = {
case aMessage =>
val aFuture = future(db.getConnection)
aFuture.map { theConn => //from previous line when you acquire the conn and when you execute the next line
//it could pass a long time they run in different threads/time
//that's why you should better create an actor that handles this sync and let
//akka do the async part
theConn.prepareStatemnt(someSQL)
//omitted code...
}
}
so my suggestion would be
//actor A receives,
//actor B proccess db (and have multiple instances of this one due to slowness from db)
class ActorA(routerOfB : ActorRef) extends Actor {
def recieve = {
case aMessage =>
routerOfB ! aMessage
}
}
class ActorB(db : DB) extends Actor {
def receive = {
case receive = {
val conn = db.getConnection //this blocks but we have multiple instances
//and enforces to run in same thread
val ps = conn.prepareStatement(someSQL)
}
}
}
You will need routing: http://doc.akka.io/docs/akka/2.4.1/scala/routing.html
As I know you couldn't run multiple concurrent query on a single connection for RDBMS ( I've not seen any resource/reference for async/non-blocking call for mysql even in C-API; ). To run your queries concurrently, you most have multiple connection instances.
DB.getConnection isn't expensive while you have multiple instances of connection. The most expensive area for working with DB is running sql query and waiting for its response.
To being async with your DB-calls, you should run them in other threads (not in the main thread-pool of Akka or Play); Slick does it for you. It manages a thread-pool and run your DB-calls on them, then your main threads will be free for processing income requests. Then you dont need to wrap your DB-calls in actors to being async.
I think you should take a connection from pool and return when it is done. If we go by single connection per actor what if that connection gets disconnect, you may need to reinitialize it.
For transactions you may want to try
DB.withTransaction { conn => // do whatever you need with the connection}
For more functional way of database access I would recommend to look into slick which has a nice API that can be integrated with plain actors and going further with streams.
Related
I have Lambda that uses RDS. I wanted to improve it and use the Lambda connection caching. I have found several articles, and implemented it on my side, best to my knowledge. But now, I am not sure it is this the rigth way to go.
I have Lambda (running Node 8), which has several files used with require. I will start from the main function, until I reach the MySQL initializer, which is exact path. All will be super simple, showing only to flow of the code that runs MySQL:
Main Lambda:
const jobLoader = require('./Helpers/JobLoader');
exports.handler = async (event, context) => {
const emarsysPayload = event.Records[0];
let validationSchema;
const body = jobLoader.loadJob('JobName');
...
return;
...//
Job Code:
const MySQLQueryBuilder = require('../Helpers/MySqlQueryBuilder');
exports.runJob = async (params) => {
const data = await MySQLQueryBuilder.getBasicUserData(userId);
MySQLBuilder:
const mySqlConnector = require('../Storage/MySqlConnector');
class MySqlQueryBuilder {
async getBasicUserData (id) {
let query = `
SELECT * from sometable WHERE id= ${id}
`;
return mySqlConnector.runQuery(query);
}
}
And Finally the connector itself:
const mySqlConnector = require('promise-mysql');
const pool = mySqlConnector.createPool({
host: process.env.MY_SQL_HOST,
user: process.env.MY_SQL_USER,
password: process.env.MY_SQL_PASSWORD,
database: process.env.MY_SQL_DATABASE,
port: 3306
});
exports.runQuery = async query => {
const con = await pool.getConnection();
const result = con.query(query);
con.release();
return result;
};
I know that measuring performance will show the actual results, but today is Friday, and I will not be able to run this on Lambda until the late next week... And really, it would be awesome start of the weekend knowing I am in right direction... or not.
Thank for the inputs.
First thing would be to understand how require works in NodeJS. I do recommend you go through this article if you're interested in knowing more about it.
Now, once you have required your connection, you have it for good and it won't be required again. This matches what you're looking for as you don't want to overwhelm your database by creating a new connection every time.
But, there is a problem...
Lambda Cold Starts
Whenever you invoke a Lambda function for the first time, it will spin up a container with your function inside it and keep it alive for approximately 5 mins. It's very likely (although not guaranteed) that you will hit the same container every time as long as you are making 1 request at a time. But what happens if you have 2 requests at the same time? Then another container will be spun up in parallel with the previous, already warmed up container. You have just created another connection on your database and now you have 2 containers. Now, guess what happens if you have 3 concurrent requests? Yes! One more container, which equals one more DB connection.
As long as there are new requests to your Lambda functions, by default, they will scale out to meet demand (you can configure it in the console to limit the execution to as many concurrent executions as you want - respecting your Account limits)
You cannot safely make sure you have a fixed amount of connections to your Database by simply requiring your code upon a Function's invocation. The good thing is that this is not your fault. This is just how Lambda functions behave.
...one other approach is
to cache the data you want in a real caching system, like ElasticCache, for example. You could then have one Lambda function be triggered by a CloudWatch Event that runs in a certain frequency of time. This function would then query your DB and store the results in your external cache. This way you make sure your DB connection is only opened by one Lambda at a time, because it will respect the CloudWatch Event, which turns out to run only once per trigger.
EDIT: after the OP sent a link in the comment sections, I have decided to add a few more info to clarify what the mentioned article wants to say
From the article:
"Simple. You ARE able to store variables outside the scope of our
handler function. This means that you are able to create your DB
connection pool outside of the handler function, which can then be
shared with each future invocation of that function. This allows for
pooling to occur."
And this is exactly what you're doing. And this works! But the problem is if you have N connections (Lambda Requests) at the same time. If you don't set any limits, by default, up to 1000 Lambda functions can be spun up concurrently. Now, if you then make another 1000 requests simultaneously in the next 5 minutes, it's very likely you won't be opening any new connections, because they have already been opened on previous invocations and the containers are still alive.
Adding to the answer above by Thales Minussi but for a Python Lambda. I am using PyMySQL and to create a connection pool I added the connection code above the handler in a Lambda that fetches data. Once I did this, I was not getting any new data that was added to the DB after an instance of the Lambda was executed. I found bugs reported here and here that are related to this issue.
The solution that worked for me was to add a conn.commit() after the SELECT query execution in the Lambda.
According to the PyMySQL documentation, conn.commit() is supposed to commit any changes, but a SELECT does not make changes to the DB. So I am not sure exactly why this works.
There's a couple of earlier related questions, but none of which solve the issue for me:
https://dba.stackexchange.com/questions/160444/parallel-postgresql-queries-with-r
Parallel Database calls with RODBC
"foreach" loop : Using all cores in R (especially if we are sending sql queries inside foreach loop)
My use case is the following: I have a large database of data that needs to be plotted. Each plot takes a few seconds to create due to some necessary pre-processing of the data and the plotting itself (ggplot2). I need to do a large number of plots. My thinking is that I will connect to the database via dplyr without downloading all the data to memory. Then I have a function that fetches a subset of the data to be plotted. This approach works fine when using single-threading, but when I try to use parallel processing I run into SQL errors related to the connection MySQL server has gone away.
Now, I recently solved the same issue working in Python, in which case the solution was simply to kill the current connection inside the function, which forced the establishment of a new connection. I did this using connection.close() where connection is from Django's django.db.
My problem is that I cannot find an R equivalent of this approach. I thought I had found the solution when I found the pool package for R:
This package enables the creation of object pools for various types of
objects in R, to make it less computationally expensive to fetch one.
Currently the only supported pooled objects are DBI connections (see
the DBI package for more info), which can be used to query a database
either directly through DBI or through dplyr. However, the Pool class
is general enough to allow for pooling of any R objects, provided that
someone implements the backend appropriately (creating the object
factory class and all the required methods) -- a vignette with
instructions on how to do so will be coming soon.
My code is too large to post here, but essentially, it looks like this:
#libraries loaded as necessary
#connect to the db in some kind of way
#with dplyr
db = src_mysql(db_database, username = db_username, password = db_password)
#with RMySQL directly
db = dbConnect(RMySQL::MySQL(), dbname = db_database, username = db_username, password = db_password)
#with pool
db = pool::dbPool(RMySQL::MySQL(),
dbname = db_database,
username = db_username,
password = db_password,
minSize = 4)
#I tried all 3
#connect to a table
some_large_table = tbl(db, 'table')
#define the function
some_function = function(some_id) {
#fetch data from table
subtable = some_large_table %>% filter(id == some_id) %>% collect()
#do something with the data
something(subtable)
}
#parallel process
mclapply(vector_of_ids,
FUN = some_function,
mc.cores = num_of_threads)
The code you have above is not the equivalent of your Python code, and that is the key difference. What you did in Python is totally possible in R (see MWE below). However, the code you have above is not:
kill[ing] the current connection inside the function, which forced the establishment of a new connection.
What it is trying (and failing) to do is to make a database connection travel from the parent process to each child process opened by the call to mclapply. This is not possible. Database connections can never travel across process boundaries no matter what.
This is an example of the more general "rule" that the child process cannot affect the state of the parent process, period. For example, the child process also cannot write to memory locations. You can’t plot (to the parent process’s graphics device) from those child processes either.
In order to do the same thing you did in Python, you need to open a new connection inside of the function FUN (the second argument to mclapply) if you want it to be truly parallel. I.e. you have to make sure that the dbConnect call happens inside the child process.
This eliminates the point of pool (though it’s perfectly safe to use), since pool is useful when you reuse connections and generally want them to be easily accessible. For your parallel use case, since you can't cross process boundaries, this is useless: you will always need to open and close the connection for each new process, so you might as well skip pool entirely.
Here's the correct "translation" of your Python solution to R:
library(dplyr)
getById <- function(id) {
# create a connection and close it on exit
conn <- DBI::dbConnect(
drv = RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
on.exit(DBI::dbDisconnect(conn))
# get a specific row based on ID
conn %>% tbl("City") %>% filter(ID == id) %>% collect()
}
parallel::mclapply(1:10, getById, mc.cores = 12)
I am writing a .NET 4.0 console app that
Opens up a connection Uses a Data Reader to cursor through a list of keys
For each key read, calls a web service
Stores the result of a web service in the database
I then spawn multiple threads of this process in order to improve the maximum number of records that I can process per second.
When I up the process beyond about 30 or so threads, I get the following error:
System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
Is there an Server or client side option to tweak to allow me to obtain more connections fromn the connection pool?
I am calling a sql 2008 r2 DATABASE.
tHx
This sounds like a design issue. What's your total record count from the database? Iterating through the reader will be really fast. Even if you have hundreds of thousands of rows, going through that reader will be quick. Here's a different approach you could take:
Iterate through the reader and store the data in a list of objects. Then iterate through your list of objects at a number of your choice (e.g. two at a time, three at a time, etc) and spawn that number of threads to make calls to your web service in parallel.
This way you won't be opening multiple connections to the database, and you're dealing with what is likely the true bottleneck (the HTTP call to the web service) in parallel.
Here's an example:
List<SomeObject> yourObjects = new List<SomeObject>();
if (yourReader.HasRows) {
while (yourReader.Read()) {
SomeObject foo = new SomeObject();
foo.SomeProperty = myReader.GetInt32(0);
yourObjects.Add(foo);
}
}
for (int i = 0; i < yourObjects.Count; i = i + 2) {
//Kick off your web service calls in parallel. You will likely want to do something with the result.
Task[] tasks = new Task[2] {
Task.Factory.StartNew(() => yourService.MethodName(yourObjects[i].SomeProperty)),
Task.Factory.StartNew(() => yourService.MethodName(yourObjects[i+1].SomeProperty)),
};
Task.WaitAll(tasks);
}
//Now do your database INSERT.
Opening up a new connection for all your requests is incredibly inefficient. If you simply want to use the same connection to keep requesting things, that is more than possible. You can open a connection, and then run as many SqlCommand commands through that one connection. Simply keep the ONE connection around, and dispose of it after all your threading is done.
Please restart the IIS you will be able to connect
we use RabbitMQ to send jobs from a producer on one machine, to a small group of consumers distributed across several machines.
The producer generates jobs and places them on the queue, and the consumers check the queue every 10ms to see if there are any unclaimed jobs and fetch a job at a time if a job is available. If one particular worker takes too long to process a job (GC pauses or other transient issue), other consumers are free to remove jobs from the queue so that overall job throughput stays high.
When we originally set up this system, we were unable to figure out how to set up a subscriber relationship for more than one consumer on the queue that would prevent us from having to poll and introduce that little extra bit of latency.
Inspecting the documentation has not yielded satisfying answers. We are new to using message queues and it is possible that we don't know the words that accurately describe the above scenario. This is something like a blackboard system, but in this case the "specialists" are all identical and never consume each other's results -- results are always reported back to the job producer.
Any ideas?
Getting pub-subscribe is straight forward, i inital had same problems but works well. The project now has some great help pages at http://www.rabbitmq.com/getstarted.html
RabbitMQ has timeout and a resernt flag which can be used as you see fit.
You can also get the workers to be event driven as aposed to checking every 10ms etc. If you need help on this i have a small project at http://rabbitears.codeplex.com/ which might help slightly.
Here you have to keep in mind that rabbitMQ channel is not thread safe.
so create a singleton class that will handle all these rabbitmq operations
like
I am writing code sample in SCALA
Object QueueManager{
val FACTORY = new ConnectionFactory
FACTORY setUsername (RABBITMQ_USERNAME)
FACTORY setPassword (RABBITMQ_PASSWORD)
FACTORY setVirtualHost (RABBITMQ_VIRTUALHOST)
FACTORY setPort (RABBITMQ_PORT)
FACTORY setHost (RABBITMQ_HOST)
conn = FACTORY.newConnection
var channel: com.rabbitmq.client.Channel = conn.createChannel
//here to decare consumer for queue1
channel.exchangeDeclare(EXCHANGE_NAME, "direct", durable)
channel.queueDeclare(QUEUE1, durable, false, false, null)
channel queueBind (QUEUE1, EXCHANGE_NAME, QUEUE1_ROUTING_KEY)
val queue1Consumer = new QueueingConsumer(channel)
channel basicConsume (QUEUE1, false, queue1Consumer)
//here to decare consumer for queue2
channel.exchangeDeclare(EXCHANGE_NAME, "direct", durable)
channel.queueDeclare(QUEUE2, durable, false, false, null)
channel queueBind (QUEUE2, EXCHANGE_NAME, QUEUE2_ROUTING_KEY)
val queue2Consumer = new QueueingConsumer(channel)
channel basicConsume (QUEUE2, false, queue2Consumer)
//here u should mantion distinct ROUTING key for each queue
def addToQueueOne{
channel.basicPublish(EXCHANGE_NAME, QUEUE1_ROUTING_KEY, MessageProperties.PERSISTENT_TEXT_PLAIN, obj.getBytes)
}
def addToQueueTwo{
channel.basicPublish(EXCHANGE_NAME, QUEUE2_ROUTING_KEY, MessageProperties.PERSISTENT_TEXT_PLAIN, obj.getBytes)
}
def getFromQueue1:Delivery={
queue1Consumer.nextDelivery
}
def getFromQueue2:Delivery={
queue2Consumer.nextDelivery
}
}
i have written a code sample for 2 queues u can add more queues like above........
I've written a grails service with the following code:
EPCGenerationMetadata requestEPCs(String indicatorDigit, FilterValue filterValue,
PartitionValue partitionValue, String companyPrefix, String itemReference,
Long quantity) throws IllegalArgumentException, IllegalStateException {
//... code
//problematic snippet bellow
def serialGenerator
synchronized(this) {
log.debug "Generating epcs..."
serialGenerator = SerialGenerator.findByItemReference(itemReference)
if(!serialGenerator) {
serialGenerator = new SerialGenerator(itemReference: itemReference, serialNumber: 0l)
}
startingPoint = serialGenerator.serialNumber + 1
serialGenerator.serialNumber += quantity
serialGenerator.save(flush: true)
}
//code continues...
}
Being a grails service a singleton by default, I thought I'd be safe from concurrent inconsistency by adding the synchronized block above. I've created a simple client for testing concurrency, as the service is exposed by http invoker. I ran multiple clients at the same time, passing as argument the same itemReference, and had no problems at all.
However, when I changed the database from MySQL to PostgreSQL 8.4, I couldn't handle concurrent access anymore. When running a single client, everything is fine. However, if I add one more client asking for the same itemReference, I get instantly a StaleObjectStateException:
Exception in thread "main" org.springframework.orm.hibernate3.HibernateOptimisticLockingFailureException: Object of class [br.com.app.epcserver.SerialGenerator] with identifier [10]: optimistic locking failed; nested exception is org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect): [br.com.app.epcserver.SerialGenerator#10]
at org.springframework.orm.hibernate3.SessionFactoryUtils.convertHibernateAccessException(SessionFactoryUtils.java:672)
at org.springframework.orm.hibernate3.HibernateAccessor.convertHibernateAccessException(HibernateAccessor.java:412)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:411)
at org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(HibernateTemplate.java:374)
at org.springframework.orm.hibernate3.HibernateTemplate.flush(HibernateTemplate.java:881)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.SavePersistentMethod$1.doInHibernate(SavePersistentMethod.java:58)
(...)
at br.com.app.EPCGeneratorService.requestEPCs(EPCGeneratorService.groovy:63)
at br.com.app.epcclient.IEPCGenerator$requestEPCs.callCurrent(Unknown Source)
at br.com.app.epcserver.EPCGeneratorService.requestEPCs(EPCGeneratorService.groovy:29)
at br.com.app.epcserver.EPCGeneratorService$$FastClassByCGLIB$$15a2adc2.invoke()
(...)
Caused by: org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect): [br.com.app.epcserver.SerialGenerator#10]
Note: EPCGeneratorService.groovy:63 refers to serialGenerator.save(flush: true).
I don't know what to think, as the only thing that I've changed was the database. I'd appreciate any advice on the matter.
I'm using:
Grails 1.3.3
Postgres 8.4 (postgresql-8.4-702.jdbc4 driver)
JBoss 6.0.0-M4
MySQL:
mysqld Ver 5.1.41 (mysql-connector-java-5.1.13-bin driver)
Thanks in advance!
That's weird, try disabling transaction.
This is indeed a strange behavior, but you could try to workaround by using a "select ... for upgrade", via hibernate lock method.
Something like this:
def c = SerialGenerator.createCriteria()
serialgenerator = c.get {
eg "itemReferece", itemReference
lock true
}