Does increase in number of mining pool decrease the block generation time - ethereum

Can any one clarify whether (or, not) an increase in mining pool in Ethereum network with decrease the average block generation time? (For example, if another pool like ethermine join the network today and start mining). Since all the pools are competing with each other, I am getting confused

No, block generation times are driven by the current difficulty for solving the the algorithm used in the Proof of Work model. Only when a solution is found, is the block accepted to the chain and the difficulty determines how long it will take to find that solution. This difficulty automatically adjusts to speed up or slow down block generation times.
From the mining section of the Ethereum wiki:
The proof of work algorithm used is called Ethash (a modified version of Dagger-Hashimoto) involves finding a nonce input to the algorithm so that the result is below a certain threshold depending on the difficulty. The point in PoW algorithms is that there is no better strategy to find such a nonce than enumerating the possibilities while verification of a solution is trivial and cheap. If outputs have a uniform distribution, then we can guarantee that on average the time needed to find a nonce depends on the difficulty threshold, making it possible to control the time of finding a new block just by manipulating difficulty.
The difficulty dynamically adjusts so that on average one block is produced by the entire network every 12 seconds.
(Note that the current block generation time is closer to 15 seconds. You can find the block generation times on Etherscan)

Related

Defining observation space and reward for traffic signal phase optimization for reinforcement learning

I am trying to use Reinforcement Learning for traffic signal phase optimization for improving traffic flow at intersections.
I am aware that in practice we won't be able to get the information about all the vehicles in each of the lanes.
If we use a camera for getting information about the queue length then we can get accurate data only upto, say 200 meters.
Should I take this into consideration while defining my observation space or can I directly use the data from sumo?
Furthermore, what should be the ideal observation space for such a task?
sumo_rl allows to use various metrics for reward calucation such as pressure metric, queue length metric, etc. What will be a good choice of rewards for my use case or what factors should I consider while defining my reward?
I have tried getting metrics from the e2 detector's output file such as throughput, lane delay and queue length. For the agent however, I might not be able to use them (as traci/sumo wrappers offer better implementations?) So how do I use traci for getting this modified information?
Yes, you should try to match your observation space as close to the real world as possible. SUMO can also filter the data directly (for instance with an E3 detector).
If you want to maximize flow than the reward should also include the flow metric (throughput). It's quite easy to get it via traci (as you already noticed) but I cannot tell how it integrates with your framework since you did not give details about it.

How to process a task of arbitrary size using CUDA?

I'm starting to learn CUDA, and have to dive straight into a project, so I currently am lacking a solid theoretical background; I'll be picking it up along the way.
While I understand that the way the hardware is built requires the programmer to deal with thread blocks and grids, I haven't been able to find an answer to the following questions in my introductory book:
What happens when the task size is greater than the amount of threads a GPU can process at a time? Will the GPU then proceed through the array the same way a CPU would, i.e. sequentially?
Thus, should I worry if the amount of thread blocks that a given task requires exceeds the amount that can simultaneously run on the GPU? I've found a notion of the "thread block limit" so far, and it's obviously higher that what a GPU can be processing at a given moment in time, thus, is that the real (and only) limit I should be concerned with?
Other than choosing the right block size for the given hardware, are there any problems to consider when setting up a kernel for execution? I'm at loss regarding launching a task of arbitrary size. Even considered going OpenCL instead of CUDA because there appears to be no explicit block size calculation involved when launching a kernel to execute over an array.
I'm fine with this being closed as duplicate in case it is, just be sure to point at the original question.
The number of thread blocks can be arbitrary. The hardware can handle them sequentially if the number is large. This link gives you a basic view.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model
On the other hand you could use limited number of threads to handle task of arbitrary sizes by increasing the work per thread. This link shows you how to do that and why it is better.
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
You may want to read the following two for a full answer.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

How to correctly calculate RAM requirement for a bucket in Couchbase

We have a bucket of about 34 million items in a Couchbase cluster setup of 6 AWS nodes. The bucket has been allocated 32.1GB of RAM (5482MB per node) and is currently using 29.1GB. If I use the formula provided in the Couchbase documentation (http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html) it should use approx. 8.94GB of RAM.
Am I calculating it incorrectly? Below is link to google spreadsheet with all the details.
https://docs.google.com/spreadsheets/d/1b9XQn030TBCurUjv3bkhiHJ_aahepaBmFg_lJQj-EzQ/edit?usp=sharing
Assuming that you indeed have a working set of 0.5%, which as Kirk pointed out in his comment, is odd but not impossible, then you are calculating the result of the memory sizing formula correctly. However, it's important to understand that the formula is not a hard and fast rule that fits all situations. Rather, it's a general guideline and serves as a good starting point for you to go and begin your performance tests. Also, keep in mind that the RAM sizing isn't the only consideration for deciding on cluster size, because you also have to consider data safety, total disk write throughput, network bandwidth, CPU, how much a single node failure affects the rest of the cluster, and more.
Using the result of the RAM sizing formula as a starting point, you should now actually test whether your working assumptions were correct. Which means putting real (or close to representative) load on the bucket and seeing whether the % of cache misses is low enough and the operation lacency is within your acceptable limits. There is no general rule for this, what's acceptable to some applications might be too slow for others.
Just as an example, if you see that under load your cache miss ratio is 5% and while the average read latency is 3ms, the top 1% latency is 100ms - then you have to consider whether having one out of every 100 reads take that much longer is acceptable in your application. If it is - great, if not - you need to start increasing the RAM size until it matches your actual working set. Similarly, you should keep an eye on the disk throughput, CPU usage, etc.

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

"Work stealing" vs. "Work shrugging"?

Why is it that I can find lots of information on "work stealing" and nothing on "work shrugging" as a dynamic load-balancing strategy?
By "work-shrugging" I mean pushing surplus work away from busy processors onto less loaded neighbours, rather than have idle processors pulling work from busy neighbours ("work-stealing").
I think the general scalability should be the same for both strategies. However I believe that it is much more efficient, in terms of latency & power consumption, to wake an idle processor when there is definitely work for it to do, rather than having all idle processors periodically polling all neighbours for possible work.
Anyway a quick google didn't show up anything under the heading of "Work Shrugging" or similar so any pointers to prior-art and the jargon for this strategy would be welcome.
Clarification
I actually envisage the work submitting processor (which may or may not be the target processor) being responsible for looking around the immediate locality of the preferred target processor (based on data/code locality) to decide if a near neighbour should be given the new work instead because they don't have as much work to do.
I dont think the decision logic would require much more than an atomic read of the immediate (typically 2 to 4) neighbours' estimated q length here. I do not think this is any more coupling than implied by the thieves polling & stealing from their neighbours. (I am assuming "lock-free, wait-free" queues in both strategies).
Resolution
It seems that what I meant (but only partially described!) as "Work Shrugging" strategy is in the domain of "normal" upfront scheduling strategies that happen to be smart about processor, cache & memory loyality, and scaleable.
I find plenty of references searching on these terms and several of them look pretty solid. I will post a reference when I identify one that best matches (or demolishes!) the logic I had in mind with my definition of "Work Shrugging".
Load balancing is not free; it has a cost of a context switch (to the kernel), finding the idle processors, and choosing work to reassign. Especially in a machine where tasks switch all the time, dozens of times per second, this cost adds up.
So what's the difference? Work-shrugging means you further burden over-provisioned resources (busy processors) with the overhead of load-balancing. Why interrupt a busy processor with administrivia when there's a processor next door with nothing to do? Work stealing, on the other hand, lets the idle processors run the load balancer while busy processors get on with their work. Work-stealing saves time.
Example
Consider: Processor A has two tasks assigned to it. They take time a1 and a2, respectively. Processor B, nearby (the distance of a cache bounce, perhaps), is idle. The processors are identical in all respects. We assume the code for each task and the kernel is in the i-cache of both processors (no added page fault on load balancing).
A context switch of any kind (including load-balancing) takes time c.
No Load Balancing
The time to complete the tasks will be a1 + a2 + c. Processor A will do all the work, and incur one context switch between the two tasks.
Work-Stealing
Assume B steals a2, incurring the context switch time itself. The work will be done in max(a1, a2 + c) time. Suppose processor A begins working on a1; while it does that, processor B will steal a2 and avoid any interruption in the processing of a1. All the overhead on B is free cycles.
If a2 was the shorter task, here, you have effectively hidden the cost of a context switch in this scenario; the total time is a1.
Work-Shrugging
Assume B completes a2, as above, but A incurs the cost of moving it ("shrugging" the work). The work in this case will be done in max(a1, a2) + c time; the context switch is now always in addition to the total time, instead of being hidden. Processor B's idle cycles have been wasted, here; instead, a busy processor A has burned time shrugging work to B.
I think the problem with this idea is that it makes the threads with actual work to do waste their time constantly looking for idle processors. Of course there are ways to make that faster, like have a queue of idle processors, but then that queue becomes a concurrency bottleneck. So it's just better to have the threads with nothing better to do sit around and look for jobs.
The basic advantage of 'work stealing' algorithms is that the overhead of moving work around drops to 0 when everyone is busy. So there's only overhead when some processor would otherwise have been idle, and that overhead cost is mostly paid by the idle processor with only a very small bus-synchronization related cost to the busy processor.
Work stealing, as I understand it, is designed for highly-parallel systems, to avoid having a single location (single thread, or single memory region) responsible for sharing out the work. In order to avoid this bottleneck, I think it does introduce inefficiencies in simple cases.
If your application is not so parallel that a single point of work distribution causes scalability problems, then I would expect you could get better performance by managing it explicitly as you suggest.
No idea what you might google for though, I'm afraid.
Some issues... if a busy thread is busy, wouldn't you want it spending its time processing real work instead of speculatively looking for idle threads to offload onto?
How does your thread decide when it has so much work that it should stop doing that work to look for a friend that will help?
How do you know that the other threads don't have just as much work and you won't be able to find a suitable thread to offload onto?
Work stealing seems more elegant, because solves the same problem (contention) in a way that guarantees that the threads doing the load balancing are only doing the load balancing while they otherwise would have been idle.
It's my gut feeling that what you've described will not only be much less efficient in the long run, but will require lots of of tweaking per-system to get acceptable results.
Though in your edit you suggest that you want submitting processor to handle this, not the worker threads as you suggested earlier and in some of the comments here. If the submitting processor is searching for the lowest queue length, you're potentially adding latency to the submit, which isn't really a desirable thing.
But more importantly it's a supplementary technique to work-stealing, not a mutually exclusive technique. You've potentially alleviated some of the contention that work-stealing was invented to control, but you still have a number of things to tweak before you'll get good results, these tweaks won't be the same for every system, and you still risk running into situations where work-stealing would help you.
I think your edited suggestion, with the submission thread doing "smart" work distribution is potentially a premature optimization against work-stealing. Are your idle threads slamming the bus so hard that your non-idle threads can't get any work done? Then comes the time to optimize work-stealing.
So, by contrast to "Work Stealing", what is really meant here by "Work Shrugging", is a normal upfront work scheduling strategy that is smart about processor, cache & memory loyalty, and scalable.
Searching on combinations of the terms / jargon above yields many substantial references to follow up. Some address the added complication of machine virtualisation, which wasn't infact a concern of the questioner, but the general strategies are still relevent.