what's difference between dataproc cluster on GKE vs Compute engine? - google-compute-engine

We can now create dataproc clusters using compute engine or GKE. What are the major advantages of creating a cluster on GKE vs Compute Engine. We have faced problem of insufficient resources in zone error multiple times while creating cluster on compute engine. Will it solve this issue if we use GKE for cluster and what are the cost difference between them.

To solve the error message “insufficient resources in zone”, You may refer to this GCP documentation.
To answer your question, what's the difference between dataproc cluster on GKE vs GCE.
In GKE, You can directly create a Dataproc cluster and do your deployments.
You may also check the advantages of GKE in documentation or check the GKE features below:
Run your apps on a fully managed Kubernetes cluster with GKE autopilot.
Start quickly with single-click clusters and scale up to 15000 nodes
Leverage a high-availability control plane including multi-zonal and regional clusters
Eliminate operational overhead with industry-first four-way auto scaling
Secure by default, including vulnerability scanning of container images and data encryption
While in GCE, you have to manually install Kubernetes, then create your Dataproc cluster and do your deployments.

Related

Google Cloud Compute no storage left

Hi when i try to ssh to google cloud VM instance it doesn't connect and when i check the logs it says there is no storage available.
but when i connect using google cloud console it connects and when i check the storage there is enough storage
also one thing my current persistent disk is 20gb but here it shows twice the amount. if anyone can explain me whats going this would help me out a lot
The output that you are posting is from Cloud Shell link.
When you start Cloud Shell, it provisions a g1-small Google Compute
Engine virtual machine running a Debian-based Linux operating system.
Cloud Shell instances are provisioned on a per-user, per-session
basis. The instance persists while your Cloud Shell session is active;
after an hour of inactivity, your session terminates and its VM,
discarded. For more on usage quotas, refer to the limitations guide.
With the default Cloud Shell experience, you are allocated with an
ephemeral, pre-configured VM and the environment you work with is a
Docker container running on that VM. You can also choose to use a
custom environment to save your configurations, in which case, your
environment will be your very own custom Docker image.
Cloud Shell provisions 5 GB of free persistent disk storage mounted as
your $HOME directory on the virtual machine instance.
As Travis mentioned you run df -h --total in the Cloud Shell storage not the VM.
Here you can find a SO related question with possible solutions to fix your issue.
Disk is full, and I can't SSH to instance.

dataproc cluster on google cloud

My understanding is that running a Dataproc cluster instead of setting up your own compute engine cluster is that it takes care of installing the storage connector (and other connectors). What else does it do for you?
The most significant feature of Dataproc beyond a DIY cluster is the ability to submit Jobs (Hadoop & Spark jars, Hive queries etc.) via an API, WebUI and CLI without configuring tricky network firewalls and exposing YARN to the world.
Cloud Dataproc also takes care of a lot of configuration and initialization such as setting up A shared Hive Metastore for Hive and Spark. And allows specifying Hadoop, Spark, etc. properties at boot time.
It boots a cluster in ~90s, which in my experience is faster than most cluster setups. This allows you to tear down the cluster when you are not interested and not have to wait tens of minutes to bring a new one up.
I'd encourage you to look at a more comprehensive list of features.

Google Container Engine Architecture

I was exploring the architecture of Google's IaaS/PaaS oferings, and I am confused as to how GKE (Google Container Engine) runs in Google data centers. From this article (http://www.wired.com/2012/07/google-compute-engine/) and also from some of the Google IO 2012 sessions, I gathered that GCE (Google Compute Engine) runs the provisioned VMs using KVM (Kernel-based Virtual Machine); these VMs run inside Google's cgroups-based containers (this allows Google to schedule user VMs the same way they schedule their existing container-based workloads; probably using Borg/Omega). Now how does Kubernetes figure into this, given that it makes you run Docker containers on GCE provisioned VMs, and not on bare metal? If my understanding is correct, then Kubernetes-scheduled Docker containers run inside KVM VMs which themselves run inside Google cgroups containers scheduled by Borg/Omega...
Also, how does Kubernetes networking fit into Google's existing GCE Andromeda software-defined networking?
I understand that this is a very low-level architectural question, but I feel understanding of the internals will ameliorate my understanding of how user workloads eventually run on bare metal. Also, I'm curious, if the whole running containers on VMs inside containers is necessary from a performance point of view? E.g. doesn't networking performance degrade by having multiple layers? Google mentions in its Borg paper (http://research.google.com/pubs/archive/43438.pdf) that they run their container-based workloads without a VM (they don't want to pay the "cost of virtualization"); I understand the logic of running public external workloads in VMs (better isolation, more familiar model, heteregeneous workloads, etc.), but with Kubernetes, can not our workloads be scheduled directly on bare metal, just like Google's own workloads?
It is possible to run Kubernetes on both virtual and physical machines see this link. Google's Cloud Platform only offers virtual machines as a service, and that is why Google Container Engine is built on top of virtual machines.
In Borg, containers allow arbitrary sizes, and they don't pay any resource penalties for odd-sized tasks.

Can I install MySQL on the VMs provided in Azure Cloud Services?

From what I gather, the only way to use a MySQL database with Azure websites is to use Cleardb but can I install MySQL on VMs provided in Azure Cloud Services. And if so how?
This question might get closed and moved to ServerFault (where it really belongs). That said: ClearDB provides MySQL-as-a-Service in Azure. It has nothing to do with what you can install in your own Virtual Machines. You can absolutely do a VM-based MySQL install (or any other database engine that you can install on Linux or Windows). In fact, the Azure portal even has a tutorial for a MySQL installation on OpenSUSE.
If you're referring to installing in web/worker roles: This simply isn't a good fit for database engines, due to:
the need to completely script/automate the install with zero interaction (which might take a long time). This includes all necessary software being downloaded/installed to the vm images every time a new instance is spun up.
the likely inability for a database cluster to cope with arbitrary scale-out (the typical use case for web/worker roles). Database clusters may or may not work well when a scale-out occurs (adding an additional vm). Same thing when scaling in (removing a vm).
less-optimal attached-storage configuration
inability to use Linux VMs
So, assuming you're still ok with Virtual Machines (vs stateless Cloud Service vm's): You'll need to carefully plan your deployment, with decisions such as:
Distro (Ubuntu, CentOS, etc). Azure-supported Linux distro list here
Selecting proper VM size (the DS series provide SSD attached disk support; the G series scale to 448GB RAM)
Azure Storage attached disks being non-Premium or Premium (premium disks are SSD-backed, durable disks scaling to 1TB/5000 IOPS per disk, up to 32 disks per VM depending on VM size)
Virtual network configuration (for multi-node cluster)
Accessibility of database cluster (whether your app is in the vnet or accesses it through a public endpoint; and if the latter, setting up ACL's)
Backup / HA / DR planning
Someone else mentioned using a pre-built VM image from VM Depot. Just realize that, if you go that route, you're relying on someone else to configure the database engine install for you. This may or may not be optimal for what you're trying to achieve. And the images may or may not be up-to-date with the latest versions, patches, etc.
Of course, what I wrote applies to any database engine you install in your own virtual machines, where a service provider (such as ClearDB) tends to take care of most of these things for you.
If you are talking about standard VMs then you can use a pre-built images on VMDepot for that.
If you are talking about web or worker roles (PaaS) I wouldn't recommend it, but if you really want to you could. You would need to fully script the install of the solution on the host. The only downside (and it's a big one) you would have would be the that the host will be moved to a new host at some point which would mean your MySQL data files would be lost - if you backed up frequently and were happy to lose some data then this option may work for you.
I think, that the main question is "what You want to achieve?". As I see, You want to use PaaS solution with Web Apps or Cloud Service and You need a MySQL database. If Yes, You have two options (both technically as David Makogon said). First one is to deploy Your own (one) server with MySQL and connect to it from the outside (internet side). Second solution is to create one MySQL server or cluster and connect Your application internally in Azure virtual network. WIth Cloud Service it is simple but with Web App it is not. You must create VPN gateway in Azure VM and connect Your Web App to this gateway. In this way You will have internal connection wfrom Your application to Your own MySQL cluster.

Google Compute Engine as an alternative to Amazon Web Services (EC2, ELB, etc...)

I am trying evaluate Google Compute Engine (GCE) for a cloud project in our company. We have some experience in working with Amazon Web Services but would like to know if GCE is a better alternative for our project.
I have following questions. Our choice for the project will be based on the answers for the questions so please help me with these queries.
Is there an equivalent of AWS Route53 and Elastic Load Balancer on Google cloud? If they are not available then how do we load balance GCE instances?
Is there a concept like regions? (such as us-east-coast-1, us-west-coast-1, etc…). Helpful in making sure that the service is not affected during natural calamities.
Is there an equivalent of Cloud Watch to help us auto scale compute engine instances based on load?
Can we setup a private cloud on Google cloud platform?
Can we get persistent public IP addresses for GCE instances?
Are there any advantages (in terms of tighter integration OR pricing) when using Google services such as Google Analytics, YouTube, DoubleClick, etc?
Load Balancing
Google Cloud Platform's Compute Engine (GCE) recently added a Load Balancing feature. It's lower level than ELB (it only supports UDP / TCP, not HTTP(S)).
Regions
GCE has feature parity. AWS Regions correspond to GCE Regions, and AWS Availability Zones to GCE Zones
Autoscaling (CloudWatch)
Google Compute Engine does not have autoscaling, but Google App Engine does. Third party tools such as Scalr or RightScale are however compatible with Google Compute Engine
Disclaimer: I do work at Scalr.
Private Cloud
Did you mean dedicated instances? Those are not available in GCE.
If you meant VPC, then you can use GCE networks to achieve isolation. You'll also wish to disable ephemeral external IP addresses for the instances you want to isolate.
Persistent IPs
GCE has persistent IPs, they are called "Reserved Addresses"
Integration with other services
You will likely get better latency to Google services you use in your backend (I recall a couple presentations at Google I/O talking about Google App Engine + BigQuery).
For frontend services (Google Analytics), you'll likely see not benefit, since this depends on your users, not your servers.