looking for simple cluster configuration - google-compute-engine

I am using compute engine for embarrassingly parallel scientific calculations. Some of my calculations require a single core and some require 64-cores machines. I am currently using my own scripts: I have a qsub-like command that creates a new instance with the required number of cores, booting it from a custom image with the pre-installed software, connects to a storage bucket via gcsfuse, runs the required command and then kills the instance after it's done.
Do I really need to do all of that with my own scripts, or is there any tool that I should use instead? I'd much rather use some ready made tool for all of the management.
My usage fluctuates widely (hundreds of cores in parallel for 3 hours, then 2 days with nothing, etc). So I don't want constant sized machines: I like to be billed by the minute for my computations.

You may want to use auto-scaling feature for managed instance group in Google Compute Engine(GCE). This feature adds more instances to your instance group when there is more load (upscaling), and removes instances when there is less load (downscaling). Moreover, you can define autoscaling policy based upon CPU utilization, or Load balancer utilization or request per seconds. Please refer autoscaler decisions document to understand decisions that autoscaler might make when scaling instance groups.

Related

Compute Engine - Automatic scale

I have one VM Compute Engine to host simple apps. My apps is growing and the number of users too.
Now my users work basicaly from 08:00 AM to 07:00 PM, in this period the usage os CPU and Memory is High and the speed of work is very important.
I'm preparing to expand the memory and processor in the next days, but i search a more scalable and cost efective way.
Is there a way for automatic add resources when i need and reduce after no more need?
Thanks
The cost of running your VMs is directly related to a number of different factors i.e. the type of network in use (premium vs standard), the machine type, the boot disk image you use (premium vs open-source images) and the region/zone where your workloads are running, among other things.
Your use case seems to fit managed instance groups (MIGs). With MIGs you essentially configure a template for VMs that share the same attributes. During the configuration of your MIG, you will be able to specify the CPU/memory limit beyond which the MIG autoscaler will kick off. When your CPU/memory reading goes below that threshold, MIG scales your VMs down to the number of instances specified in your template.
You can also use requests per second as a threshold for autoscaling and I would recommend you explore the docs to know more about it.
See docs

Benefit of running Kubernetes in bare metal and cloud with idle VM or machines?

I want to know the high level benefit of Kubernetes running in bare metal machines.
So let's say we have 100 bare metal machines ready with kubelet being deployed in each. Doesn't it mean that when the application only runs on 10 machines, we are wasting the rest 90 machines, just standing by and not using them for anything?
For cloud, does Kubernetes launch new VMs as needed, so that clients do not pay for idle machines?
How does Kubernetes handle the extra machines that are needed at the moment?
Yes, if you have 100 bare metal machines and use only 10, you are wasting money. You should only deploy the machines you need.
The Node Autoscaler works at certain Cloud Providers like AWS, GKE, or Open Stack based infrastructures.
Now, Node Autoscaler is useful if your load is not very predictable and/or scales up and down widely over the course of a short period of time (think Jobs or cyclic loads like a Netflix type use case).
If you're running services that just need to scale eventually as your customer base grows, that is not so useful as it is as easy to simply add new nodes manually.
Kubernetes will handle some amount of auto-scaling with an assigned number of nodes (i.e. you can run many Pods on one node, and you would usually pick your machines to run in a safe range but still allow handling of spikes in traffic by spinning more Pods on those nodes.
As a side note: with bare metal, you typically gain in performance, since you don't have the overhead of a VM / hypervisor, but you need to supply distributed storage, which a cloud provider would typically provide as a service.

Google Compute Engine auto scaling based on queue length

We host our infrastructure on Google Compute Engine and are looking into Autoscaling for groups of instances. We do a lot of batch processing of binary data from a queue. In our case, this means:
When a worker is processing data the CPU is always 100%
When the queue is empty we want to terminate all workers
Depending on the length of the queue we want a certain amount of workers
However I'm finding it hard to figure out a way to auto-scale this on Google Compute Engine because they appear to scale on instance-only metrics such as CPU. From the documentation:
Not all custom metrics can be used by the autoscaler. To choose a
valid custom metric, the metric must have all of the following
properties:
The metric must be a per-instance metric.
The metric must be a valid utilization metric, which means that data from the metric can be used to proportionally scale up or down
the number of virtual machines.
If I'm reading the documentation properly this makes it hard to use the auto scaling on a global queue length?
Backup solutions
Write a simple auto-scale handler using the Google Cloud API to create or destroy new workers using Instances API
Write a simple auto-scale handler using instance groups and then manually insert/remove instances using the InstanceGroups: insert
Write a simple auto-scaling handler using InstangeGroupManagers: resize
Create a custom per-instance metric which measures len(queue)/len(workers) on all workers
As of February 2018 (Beta) this is possible via "Per-group metrics" in stackdriver.
Per-group metrics allow autoscaling with a standard or custom metric
that does not export per-instance utilization data. Instead, the group
scales based on a value that applies to the whole group and
corresponds to how much work is available for the group or how busy
the group is. The group scales based on the fluctuation of that group
metric value and the configuration that you define.
More information at https://cloud.google.com/compute/docs/autoscaler/scaling-stackdriver-monitoring-metrics#per_group_metrics
The how-to is too long to post here.
As far as I understand this is not implemented yet (as at January 2016). At the moment autoscaling is only targeted at web serving scenarios, where you want to serve web pages/other web services from your machines and keep some reasonable headroom (e.g. in terms of CPU or other metrics) for spikes in traffic. Then the system will adjust the number of instances/VMs to match your target.
You are looking for autoscaling for batch processing scenarios, and this is not catered for at the moment.

Performing a distributed CUDA/OpenCL based password cracking

Is there a way to perform a distributed (as in a cluster of a connected computers) CUDA/openCL based dictionary attack?
For example, if I have a one computer with some NVIDIA card that is sharing the load of the dictionary attack with another coupled computer and thus utilizing a second array of GPUs there?
The idea is to ensure a scalability option for future expanding without the need of replacing the whole set of hardware that we are using. (and let's say cloud is not an option)
This is a simple master / slave work delegation problem. The master work server hands out to any connecting slave process a unit of work. Slaves work on one unit and queue one unit. When they complete a unit, they report back to the server. Work units that are exhaustively checked are used to estimate operations per second. Depending on your setup, I would adjust work units to be somewhere in the 15-60 second range. Anything that doesn't get a response by the 10 minute mark is recycled back into the queue.
For queuing, offer the current list of uncracked hashes, the dictionary range to be checked, and the permutation rules to be applied. The master server should be able to adapt queues per machine and per permutation rule set so that all machines are done their work within a minute or so of each other.
Alternately, coding could be made simpler if each unit of work were the same size. Even then, no machine would be idle longer than the amount of time for the slowest machine to complete one unit of work. Size your work units so that the fastest machine doesn't enter a case of resource starvation (shouldn't complete work faster than five seconds, should always have a second unit queued). Using that method, hopefully your fastest machine and slowest machine aren't different by a factor of more than 100x.
It would seem to me that it would be quite easy to write your own service that would do just this.
Super Easy Setup
Let's say you have some GPU enabled program X that takes a hash h as input and a list of dictionary words D, then uses the dictionary words to try and crack the password. With one machine, you simply run X(h,D).
If you have N machines, you split the dictionary into N parts (D_1, D_2, D_3,...,D_N). Then run P(x,D_i) on machine i.
This could easily be done using SSH. The master machine splits the dictionary up, copies it to each of the slave machines using SCP, then connects to the slaves and tells them to run the program.
Slightly Smarter Setup
When one machine cracks the password, they could easily notify the master that they have completed the task. The master then kills the programs running on the other slaves.

Does there exist an open-source distributed logging library?

I'm talking about a library that would allow me to log events from different machines and would align these events on a "global" time axis with sufficiently high precision.
Actually, I'm asking because I've written such a thing myself in the course of a cluster computing project, I found it terrifically useful, and I was surprised that I couldn't find any analogues.
Therefore, the point is whether something like this exists (and I better contribute to it) or nothing exists (and I better write an open-source analogue of my solution).
Here are the features that I'd expect from such a library:
Independence on the clock offset between different machines
Timing precision on the order of at least milliseconds, preferably microseconds
Scalability to thousands of concurrent logging processes, with at least several megabytes of aggregated logs per second
Soft real-time operation (t.i. I don't want to collect 200 big logs from 200 machines and then compute clock offsets and merge them - I want to see what happens "live", perhaps with a small lag like 10s)
Facebook's contribution in the matter is called 'Scribe'.
Excerpt:
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
...
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
The API is Thrift-based, so you have a good platform coverage, but in case you're looking for simple integration for Java you may want to have a look at Digg's log4j appender for Scribe.
You could use log4j/log4net targeting a central syslog daemon. log4j has a builtin SyslogAppender, and in log4net you can do it as shown here. log4cpp docs here.
There are Windows implementations of Syslog around if you don't have a Unix system to hand for this.
Use Chukwa, Its Open source and Large scale Log Monitoring System