Erlang - too many processes - exception

these are my first steps in Erlang so sorry for this newbie question :) I'm spawning a new Erlang process for every Redis request which is not what I want to ("Too many processes" at 32k Erlang processes) but how to throttle the amount of the processes to e.g. max. 16?
-module(queue_manager).
-export([add_ids/0, add_id/2]).
add_ids() ->
{ok, Client} = eredis:start_link(),
do_spawn(Client, lists:seq(1,100000)).
do_spawn(Client, [H|T]) ->
Pid = spawn(?MODULE, add_id, [Client, H]),
do_spawn(Client, T);
do_spawn(_, []) -> none.
add_id(C, Id) ->
{ok, _} = eredis:q(C, ["SADD", "todo_queue", Id]).

Try using the Erlang pg2 module. It allows you to easliy create process groups and provides an API to get the 'closest' (or a random) PID in the group.
Here is an example of a process group for the eredis client:
-module(redis_pg).
-export([create/1,
add_connections/1,
connection/0,
connections/0,
q/1]).
create(Count) ->
% create process group using the module name as the reference
pg2:create(?MODULE),
add_connections(Count).
% recursive helper for adding +Count+ connections
add_connections(Count) when Count > 0 ->
ok = add_connection(),
add_connections(Count - 1);
add_connections(_Count) ->
ok.
add_connection() ->
% start redis client connection
{ok, RedisPid} = eredis:start_link(),
% join the redis connection PID to the process group
pg2:join(?MODULE, RedisPid).
connection() ->
% get a random redis connection PID
pg2:get_closest_pid(?MODULE).
connections() ->
% get all redis connection PIDs in the group
pg2:get_members(?MODULE).
q(Argv) ->
% execute redis command +Argv+ using random connection
eredis:q(connection(), Argv).
Here is an example of the above module in action:
1> redis_pg:create(16).
ok
2> redis_pg:connection().
<0.68.0>
3> redis_pg:connection().
<0.69.0>
4> redis_pg:connections().
[<0.53.0>,<0.56.0>,<0.57.0>,<0.58.0>,<0.59.0>,<0.60.0>,
<0.61.0>,<0.62.0>,<0.63.0>,<0.64.0>,<0.65.0>,<0.66.0>,
<0.67.0>,<0.68.0>,<0.69.0>,<0.70.0>]
5> redis_pg:q(["PING"]).
{ok,<<"PONG">>}

You could use a connection pool, e.g., eredis_pool. This is a similar question which might be interesting for you.

You can use a supervisor to launch each new process (for your example it seems that you should use a simple_one_for_one strategy):
supervisor:start_child(SupRef, ChildSpec) -> startchild_ret().
You can access then to the process count using the function
supervisor:count_children(SupRef) -> PropListOfCounts.
The result is a proplist of the form
[{specs,N1},{active,N2},{supervisors,N3},{workers,N4}] (the order is not guaranteed!)
If you want more information about active processes, you can also use
supervisor:which_children(SupRef) -> [{Id, Child, Type, Modules}] but this is not recommended when a supervisor manage a "large" amount of children.

You are basically "on your own" when you implement limits. There are certain tools which will help you, but I think the general question "how do I avoid spawning too many processes?" still holds. The trick is to keep track of the process count somewhere.
An Erlang-idiomatic way would be to have a process which contains a counter. Whenever you want to spawn a new process, you ask it if you are allowed to do so by registering a need for tokens against it. You then wait for the counting process to respond back to you.
The counting process is then a nice modular guy maintaining a limit for you.

Related

How many tasks are created when spark read or write from mysql?

As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server

How can we run queries concurrently, using go routines?

I am using gorm v1 (ORM), go version 1.14
DB connection is created at the start of my app
and that DB is being passed throughout the app.
I have a complex & long functionality.
Let's say I have 10 sets of queries to run and the order doesn't matter.
So, what I did was
go queryset1(DB)
go queryset2(DB)
...
go queryset10(DB)
// here I have a wait, maybe via channel or WaitGroup.
Inside queryset1:
func queryset1(db *gorm.DB, /*wg or errChannel*/){
db.Count() // basic count query
wg.Done() or errChannel <- nil
}
Now, the problem is I encounter the error :1040 "too many connections" - Mysql.
Why is this happening? Does every go routine create a new connection?
If so, is there a way to check this & "live connections" in mysql
(Not the show status variables like connection)
How can I concurrently query the DB?
Edit:
This guy has the same problem
The error is not directly related to go-gorm, but to the underlying MySQL configuration and your initial connection configuration. In your code, you can manage the following parameters during your initial connection to the database.
maximum open connections (SetMaxOpenConns function)
maximum idle connections (SetMaxIdleConns function)
maximum timeout for idle connections (SetConnMaxLifetime function)
For more details, check the official docs or this article how to get the maximum performance from your connection configuration.
If you want to prevent a situation where each goroutine uses a separate connection, you can do something like this:
// restrict goroutines to be executed 5 at a time
connCh := make(chan bool, 5)
go queryset1(DB, &wg, connCh)
go queryset2(DB, &wg, connCh)
...
go queryset10(DB, &wg, connCh)
wg.Wait()
close(connCh)
Inside your queryset functions:
func queryset1(db *gorm.DB, wg *sync.WaitGroup, connCh chan bool){
connCh <- true
db.Count() // basic count query
<-connCh
wg.Done()
}
The connCh will allow the first 5 goroutines to write in it and block the execution of the rest of the goroutines until one of the first 5 goroutines takes the value from the connCh channel. This will prevent the situations where each goroutine will start it's own connection. Some of the connections should be reused, but that also depends on the initial connection configuration.

Parallel processing and querying SQL with dplyr or pool: MySQL server has gone away

There's a couple of earlier related questions, but none of which solve the issue for me:
https://dba.stackexchange.com/questions/160444/parallel-postgresql-queries-with-r
Parallel Database calls with RODBC
"foreach" loop : Using all cores in R (especially if we are sending sql queries inside foreach loop)
My use case is the following: I have a large database of data that needs to be plotted. Each plot takes a few seconds to create due to some necessary pre-processing of the data and the plotting itself (ggplot2). I need to do a large number of plots. My thinking is that I will connect to the database via dplyr without downloading all the data to memory. Then I have a function that fetches a subset of the data to be plotted. This approach works fine when using single-threading, but when I try to use parallel processing I run into SQL errors related to the connection MySQL server has gone away.
Now, I recently solved the same issue working in Python, in which case the solution was simply to kill the current connection inside the function, which forced the establishment of a new connection. I did this using connection.close() where connection is from Django's django.db.
My problem is that I cannot find an R equivalent of this approach. I thought I had found the solution when I found the pool package for R:
This package enables the creation of object pools for various types of
objects in R, to make it less computationally expensive to fetch one.
Currently the only supported pooled objects are DBI connections (see
the DBI package for more info), which can be used to query a database
either directly through DBI or through dplyr. However, the Pool class
is general enough to allow for pooling of any R objects, provided that
someone implements the backend appropriately (creating the object
factory class and all the required methods) -- a vignette with
instructions on how to do so will be coming soon.
My code is too large to post here, but essentially, it looks like this:
#libraries loaded as necessary
#connect to the db in some kind of way
#with dplyr
db = src_mysql(db_database, username = db_username, password = db_password)
#with RMySQL directly
db = dbConnect(RMySQL::MySQL(), dbname = db_database, username = db_username, password = db_password)
#with pool
db = pool::dbPool(RMySQL::MySQL(),
dbname = db_database,
username = db_username,
password = db_password,
minSize = 4)
#I tried all 3
#connect to a table
some_large_table = tbl(db, 'table')
#define the function
some_function = function(some_id) {
#fetch data from table
subtable = some_large_table %>% filter(id == some_id) %>% collect()
#do something with the data
something(subtable)
}
#parallel process
mclapply(vector_of_ids,
FUN = some_function,
mc.cores = num_of_threads)
The code you have above is not the equivalent of your Python code, and that is the key difference. What you did in Python is totally possible in R (see MWE below). However, the code you have above is not:
kill[ing] the current connection inside the function, which forced the establishment of a new connection.
What it is trying (and failing) to do is to make a database connection travel from the parent process to each child process opened by the call to mclapply. This is not possible. Database connections can never travel across process boundaries no matter what.
This is an example of the more general "rule" that the child process cannot affect the state of the parent process, period. For example, the child process also cannot write to memory locations. You can’t plot (to the parent process’s graphics device) from those child processes either.
In order to do the same thing you did in Python, you need to open a new connection inside of the function FUN (the second argument to mclapply) if you want it to be truly parallel. I.e. you have to make sure that the dbConnect call happens inside the child process.
This eliminates the point of pool (though it’s perfectly safe to use), since pool is useful when you reuse connections and generally want them to be easily accessible. For your parallel use case, since you can't cross process boundaries, this is useless: you will always need to open and close the connection for each new process, so you might as well skip pool entirely.
Here's the correct "translation" of your Python solution to R:
library(dplyr)
getById <- function(id) {
# create a connection and close it on exit
conn <- DBI::dbConnect(
drv = RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
on.exit(DBI::dbDisconnect(conn))
# get a specific row based on ID
conn %>% tbl("City") %>% filter(ID == id) %>% collect()
}
parallel::mclapply(1:10, getById, mc.cores = 12)

Update ttl for all records in aerospike

I was stuck in a situation that I have initialised a namesapce with
default-ttl to 30 days. There was about 5 million data with that (30-day calculated) ttl-value. Actually, my requirement is that ttl should be zero(0), but It(ttl-30d) was kept with unaware or un-recognise.
So, Now I want to update prev(old) 5 million data with new ttl-value (Zero).
I've checked/tried "set-disable-eviction true", but it is not working, it is removing data according to (old)ttl-value.
How do I overcome out this? (and I want to retrieve the removed data, How can I?).
Someone help me.
First, eviction and expiration are two different mechanisms. You can disable evictions in various ways, such as the set-disable-eviction config parameter you've used. You cannot disable the cleanup of expired records. There's a good knowledge base FAQ What are Expiration, Eviction and Stop-Writes?. Unfortunately, the expired records that have been cleaned up are gone if their void time is in the past. If those records were merely evicted (i.e. removed before their void time due to crossing the namespace high-water mark for memory or disk) you can cold restart your node, and those records with a future TTL will come back. They won't return if either they were durably deleted or if their TTL is in the past (such records gets skipped).
As for resetting TTLs, the easiest way would be to do this through a record UDF that is applied to all the records in your namespace using a scan.
The UDF for your situation would be very simple:
ttl.lua
function to_zero_ttl(rec)
local rec_ttl = record.ttl(rec)
if rec_ttl > 0 then
record.set_ttl(rec, -1)
aerospike:update(rec)
end
end
In AQL:
$ aql
Aerospike Query Client
Version 3.12.0
C Client Version 4.1.4
Copyright 2012-2017 Aerospike. All rights reserved.
aql> register module './ttl.lua'
OK, 1 module added.
aql> execute ttl.to_zero_ttl() on test.foo
Using a Python script would be easier if you have more complex logic, with filters etc.
zero_ttl_operation = [operations.touch(-1)]
query = client.query(namespace, set_name)
query.add_ops(zero_ttl_operation)
policy = {}
job = query.execute_background(policy)
print(f'executing job {job}')
while True:
response = client.job_info(job, aerospike.JOB_SCAN, policy={'timeout': 60000})
print(f'job status: {response}')
if response['status'] != aerospike.JOB_STATUS_INPROGRESS:
break
time.sleep(0.5)
Aerospike v6 and Python SDK v7.

Exceptions in Yesod

I had made a daemon that used a very primitive form of ipc (telnet and send a String that had certain words in a certain order). I snapped out of it and am now using JSON to pass messages to a Yesod server. However, there were some things I really liked about my design, and I'm not sure what my choices are now.
Here's what I was doing:
buildManager :: Phase -> IO ()
buildManager phase = do
let buildSeq = findSeq phase
jid = JobID $ pack "8"
config = MkConfig $ Just jid
flip C.catch exceptionHandler $
runReaderT (sequence_ $ buildSeq <*> stages) config
-- ^^ I would really like to keep the above line of code, or something like it.
return ()
each function in buildSeq looked like this
foo :: Stage -> ReaderT Config IO ()
data Config = MkConfig (Either JobID Product) BaseDir JobMap
JobMap is a TMVar Map that tracks information about current jobs.
so now, what I have are Handlers, that all look like this
foo :: Handler RepJson
foo represents a command for my daemon, each handler may have to process a different JSON object.
What I would like to do is send one JSON object that represents success, and another JSON object that espresses information about some exception.
I would like foos helper function to be able to return an Either, but I'm not sure how I get that, plus the ability to terminate evaluation of my list of actions, buildSeq.
Here's the only choice I see
1) make sure exceptionHandler is in Handler. Put JobMap in the App record. Using getYesod alter the appropriate value in JobMap indicating details about the exception,
which can then be accessed by foo
Is there a better way?
What are my other choices?
Edit: For clarity, I will explain the role ofHandler RepJson. The server needs some way to accept commands such as build stop report. The client needs some way of knowing the results of these commands. I have chosen JSON as the medium with which the server and client communicate with each other. I'm using the Handler type just to manage the JSON in/out and nothing more.
Philosophically speaking, in the Haskell/Yesod world you want to pass the values forward, rather than return them backwards. So instead of having the handlers return a value, have them call forwards to the next step in the process, which may be to generate an exception.
Remember that you can bundle any amount of future actions into a single object, so you can pass a continuation object to your handlers and foos that basically tells them, "After you are done, run this blob of code." That way they can be void and return nothing.