dataproc cluster on google cloud - google-compute-engine

My understanding is that running a Dataproc cluster instead of setting up your own compute engine cluster is that it takes care of installing the storage connector (and other connectors). What else does it do for you?

The most significant feature of Dataproc beyond a DIY cluster is the ability to submit Jobs (Hadoop & Spark jars, Hive queries etc.) via an API, WebUI and CLI without configuring tricky network firewalls and exposing YARN to the world.
Cloud Dataproc also takes care of a lot of configuration and initialization such as setting up A shared Hive Metastore for Hive and Spark. And allows specifying Hadoop, Spark, etc. properties at boot time.
It boots a cluster in ~90s, which in my experience is faster than most cluster setups. This allows you to tear down the cluster when you are not interested and not have to wait tens of minutes to bring a new one up.
I'd encourage you to look at a more comprehensive list of features.

Related

what's difference between dataproc cluster on GKE vs Compute engine?

We can now create dataproc clusters using compute engine or GKE. What are the major advantages of creating a cluster on GKE vs Compute Engine. We have faced problem of insufficient resources in zone error multiple times while creating cluster on compute engine. Will it solve this issue if we use GKE for cluster and what are the cost difference between them.
To solve the error message “insufficient resources in zone”, You may refer to this GCP documentation.
To answer your question, what's the difference between dataproc cluster on GKE vs GCE.
In GKE, You can directly create a Dataproc cluster and do your deployments.
You may also check the advantages of GKE in documentation or check the GKE features below:
Run your apps on a fully managed Kubernetes cluster with GKE autopilot.
Start quickly with single-click clusters and scale up to 15000 nodes
Leverage a high-availability control plane including multi-zonal and regional clusters
Eliminate operational overhead with industry-first four-way auto scaling
Secure by default, including vulnerability scanning of container images and data encryption
While in GCE, you have to manually install Kubernetes, then create your Dataproc cluster and do your deployments.

Can I install MySQL on the VMs provided in Azure Cloud Services?

From what I gather, the only way to use a MySQL database with Azure websites is to use Cleardb but can I install MySQL on VMs provided in Azure Cloud Services. And if so how?
This question might get closed and moved to ServerFault (where it really belongs). That said: ClearDB provides MySQL-as-a-Service in Azure. It has nothing to do with what you can install in your own Virtual Machines. You can absolutely do a VM-based MySQL install (or any other database engine that you can install on Linux or Windows). In fact, the Azure portal even has a tutorial for a MySQL installation on OpenSUSE.
If you're referring to installing in web/worker roles: This simply isn't a good fit for database engines, due to:
the need to completely script/automate the install with zero interaction (which might take a long time). This includes all necessary software being downloaded/installed to the vm images every time a new instance is spun up.
the likely inability for a database cluster to cope with arbitrary scale-out (the typical use case for web/worker roles). Database clusters may or may not work well when a scale-out occurs (adding an additional vm). Same thing when scaling in (removing a vm).
less-optimal attached-storage configuration
inability to use Linux VMs
So, assuming you're still ok with Virtual Machines (vs stateless Cloud Service vm's): You'll need to carefully plan your deployment, with decisions such as:
Distro (Ubuntu, CentOS, etc). Azure-supported Linux distro list here
Selecting proper VM size (the DS series provide SSD attached disk support; the G series scale to 448GB RAM)
Azure Storage attached disks being non-Premium or Premium (premium disks are SSD-backed, durable disks scaling to 1TB/5000 IOPS per disk, up to 32 disks per VM depending on VM size)
Virtual network configuration (for multi-node cluster)
Accessibility of database cluster (whether your app is in the vnet or accesses it through a public endpoint; and if the latter, setting up ACL's)
Backup / HA / DR planning
Someone else mentioned using a pre-built VM image from VM Depot. Just realize that, if you go that route, you're relying on someone else to configure the database engine install for you. This may or may not be optimal for what you're trying to achieve. And the images may or may not be up-to-date with the latest versions, patches, etc.
Of course, what I wrote applies to any database engine you install in your own virtual machines, where a service provider (such as ClearDB) tends to take care of most of these things for you.
If you are talking about standard VMs then you can use a pre-built images on VMDepot for that.
If you are talking about web or worker roles (PaaS) I wouldn't recommend it, but if you really want to you could. You would need to fully script the install of the solution on the host. The only downside (and it's a big one) you would have would be the that the host will be moved to a new host at some point which would mean your MySQL data files would be lost - if you backed up frequently and were happy to lose some data then this option may work for you.
I think, that the main question is "what You want to achieve?". As I see, You want to use PaaS solution with Web Apps or Cloud Service and You need a MySQL database. If Yes, You have two options (both technically as David Makogon said). First one is to deploy Your own (one) server with MySQL and connect to it from the outside (internet side). Second solution is to create one MySQL server or cluster and connect Your application internally in Azure virtual network. WIth Cloud Service it is simple but with Web App it is not. You must create VPN gateway in Azure VM and connect Your Web App to this gateway. In this way You will have internal connection wfrom Your application to Your own MySQL cluster.

Integrate HDFS in Red Hat OpenShift or in the Infrastructure?

I have a cluster of five virtual machines (with KVM hypervisor), and I want to find the best way to integrate HDFS in order to optimize storage management of Data.
Since HDFS is a distributed file system that can allows client to access in parallel to a file, I want to take advantage of this feature.
So, it is possible to install HDFS in the cluster to manage the disk space of VMs or to integrate it in OpenShift to manage data of PaaS end user?
If you are thinking of using this with OpenShift Origin or OpenShift Enterprise then you can just expose the HDFS to the OpenShift nodes as a user disk space and they can use it. Remember when you install OpenShift on your own infrastructure you can expose any file system you want as long as you can normally do it for Linux users.

Manual deployment vs. Amazon Elastic Beanstalk

What are the advantages we get by using Elastic Beanstalk over maually creating EC2 instance and setting up tomcat server and deploy etc for a typical java web applicaion. Are load balancing, Monitoring and autoscaling the only advantages?
Suppose for my web application which uses database I installed the database in the EC2 instance itself. When Autoscalling takes place will the database gets created in the newly created instance or it will be accessing the database I created in the master instance... If it creates just a replica when autoscaling happens how will be data sync happens between the instances?
All the things you mentioned like load balancing, monitoring and auto-scaling are definitely advantages.
However, you have to kind of think about it this way: In a true Platform as a Service (PAAS), the goal is to separate the application from the platform. As a developer, you only worry about your application. The platform is "rented" to you. The platform "instances" are automatically updated, administered, scaled, balanced, etc. for you. You just upload your WAR file and it just works (at least theoretically).
EC2 by itself is not PAAS. It is more like IAAS (Infrastructure as a Service). You still have to take care of the server instances, install software on them, keep them updated, etc.
Elastic Beanstalk is a PAAS system. So are App Engine and Azure among many others.
In a true PAAS system, the DBMS is a separate component from the web application server(s). The reason is obvious: The DBMS cannot be possibly installed on the instances that are being used for the application server because, as instances are created and destroyed based on your traffic, the DBMS would be lost! Having the DBMS and application server on the same machine/instance is not generally a good idea anyway.
In a PAAS system, the DBMS is a separate service. For Amazon, it would be Amazon RDS. Just like with Elastic Beanstalk, where you don't have to worry about the application server and you just upload your WAR file, with RDS, you don't have to worry about the DBMS and you just deploy your database(s).
Elastic Beanstalk and RDS work very well together, especially when deployed in the same availability zone, where the latency would be very low.
Finally, using Elastic Beanstalk doesn't cost anything more than the deployed resources (EC2 instances and the load balancer). However, RDS is not cheap and would definitely be more expensive than using a single EC2 instance for both the application server and the DBMS.
Elastic Beanstalk does more than just load balancing, monitoring, and autoscaling.
1) Manages application versions by storing and managing different versions of your application, allowing you to easily switch back and forth between different versions of your applications.
2) Has the concept of "environments" for each application, allowing you to deploy different versions of your application in each environment. This is handy for example if you want to set up separate QA and DEV environments, and you want to easily deploy a build first in DEV then deploy the same version of the application in QA when your QA team is ready for the next build.
3) Externalizes the important container configuration properties (Tomcat memory settings, for example) to the Elastic Beanstalk console and API. Because of this you can easily save the settings and copy them between environments.
4) View application log files through the console and automatically roll and archive log files to S3. (Admittedly this feature is currently a little weak.)
I had an app deployed both in EC2 dedicated(Nginx & Gunicorn) and Beanstalk Environment(CentOS & Apache2).
My observations:
BeanStalk is Paas. Manually creating an EC2 instance(IAAS), is like doing everything from scratch, but you have solid control.
BeanStalk comes with by default CentOS and Apache(Httpd). You could choose OS in dedicated instance.
These things that mattered to me,
There were lots of 504 errors showing up in Beanstalk environment.
It was difficult to debug when BeanStalk server crashed, as logs would also not show up and could not ssh into machine. This is very important.
Installing/configuring tools like Celery, Redis (need to run another port) etc.,. in dedicated instance is lot more easier.
In my case, I had to scale up (Beanstalk)server in order to run installation of some packages(like pandoc). These things are more simpler in Ubuntu.
Scaling is a lot more easier in BeanStalk. Cloning servers is straightforward in BeanStalk.
I had taken micro in both the cases (dedicated & Beanstalk). I felt dedicated micro instance was better.
Automated deployment in Beanstalk. I had to write scripts to automate the same, which is fine, since it is only once.

Java EE application deployment on Amazon EC2

We have a Java EE application (EAR file deployed on JBoss, MySQL, MongoDB) which we would like to deploy on an Amazon EC2 instance. I have several questions regarding deployment best practices.
What is the most commonly used Linux AMI which we can rely on for a robust deployment (There are so many Linux variants, and I am not sure which AMI is commonly used, is it Fedora, CentOS, Red Hat, SUSE ...)
How do we handle production upgrades (EAR file modifications or schema upgrades). Are there any tools which are available to handle this installation or rollback of these changes.
What kind of data backup capability is available for the database?
Should I rely on Amazon RDS for MySQL support?
How should I handle support for MongoDB?
This is the first time, I am hosting an web-app and would appreciate some inputs on how to manage the production instance.
I agree with Mark Robinson's answer: Use whichever Unix variant you're most comfortable with. It may pay to pick one with decent cloud support. For my site I use Ubuntu.
I have a common image which is the base of every version deploy I do. I have www.mysite.com pointing to an Elastic IP so I can decide which instance it goes to. The common image has all the software I need installed (Postgres/Postgis/Tomcat/etc) but the database and web server data folders and symlinked to Elastic Block Store (EBS) instances.
When it comes time to do a deploy I start a new instance up, freeze and snapshot the EBS volumes on production and make new volumes. I point my new instance at the new volumes and then install whatever I need to onto that. Once I've smoke tested everything successfully I can switch the Elastic IP to point to the new instance and everything keeps on going.
I'll note that I currently have the advantage where only I can modify the database; no users can. This will become a problem shortly.
If you use the XFS filesystem on top of the EBS volume then you can tell XFS to freeze the file system (so no updates happen) then call the EC2 api to snapshot the volume then unfreeze the file system. The result is that the snapshot is taken quickly and sent to S3. I have a nightly script which does this.
If RDS looks like it will suit your needs then use it. Amazon is building lots of solid tools quickly and this will ease your scalability issues if you have any.
I'm sorry, I have no idea.
Good question!
1) I would recommend going with whatever Linux variant you are most comfortable with. If you have someone who is really keen on CentOS, go with that. Once you have selected your AMI, take it and customize it by configuring how you want it. Then save that AMI as you base-layout. It will make rolling out new machines much easier and save your bacon if EC2 goes down.
2) Upgrades with EC2 can be tres cool. Instead of upgrading a live system, take your pre-configured AMI, update that and save that AMI as myAMI-1.1 (or whatever). That way, you can flip over to the new system almost instantly AND roll back to a previous version in case something breaks. You can also back-up DB instances to S3. It's cheap at about $0.10/GB/Month.
3) It depends where you are storing your DB. If you are storing it on your EC2 instance you are in trouble. The EC2 instances have no persistence storage. So if your machine crashes, you lose everything. I'm not familiar with Amazon DB system but you should also look into Elastic Block Store. It's basically an actual hard-drive you can write to. When you want to upgrade your schema, do a full DB dump to S3 and then do an upgrade of your actual schema. If something goes wrong, you can pull the previous version out of S3.
4) & 5) I have never used those so I can't help you.
What is the most commonly used Linux AMI which we can rely on for a robust deployment (There are so many Linux variants, and I am not sure which AMI is commonly used, is it Fedora, CentOS, Red Hat, SUSE ...)
How do we handle production upgrades (EAR file modifications or schema upgrades). Are there any tools which are available to handle this installation or rollback of these changes.
What kind of data backup capability is available for the database?
Should I rely on Amazon RDS for MySQL support?
How should I handle support for MongoDB?
Any Linux AMI will do the job, what you need is a JRE only. (assuming development work not required). If you need to monitor the JVM behavior then get JConsole installed.
Easiest and painless way is to SSH into the local home directory, transfer the updated class file/EAR file (depends the number of changes applied) and copy and replace into the Tomcat deployment directory, restart apache. (make sure you tested locally before upload to production).
Depends on which database you are using, if you are using MySQL then just do scheduled backup that writes to your home directory so that from time to time you could SSH in and download a copy for backup purpose.
I would not consider reply on Amazon RDS for MySQL support due to 2 reasons: MySQL is small enough and manageable, and also I would want to have total complete control of the database and why pay for more when you can do it yourself FOC?
The usage of MongoDB should be align with the purpose of your application and benefits you gain from that. I would recommend you use MongoDB for static data retrieval like state, country, area etc... where MySQL to be use for transaction data only.
If you can live with deploying your Java EE application on TomEE instead of JBoss, Boxfuse does what you want.
For you Java EE application you literally only have to execute (TomEE uses war files instead of ear files):
boxfuse run my-tomee-app-1.0.war -env=prod
This will
Create AMI containing TomEE and your application ready to boot
Create an Elastic IP or ELB
Create a security group with the correct ports defined
Create an auto-scaling group
Launch your instance(s)
Any subsequent update will be done as a zero downtime blue/green deployment.
More info: https://boxfuse.com/blog/javaee-aws