How to deploy a HDP cluster without HDFS, as I don't want HDFS for storage and will be using in-house in-memory storage system. How can this be done?
HDFS and MapReduce are the main internal part of Hadoop. They comes inbuilt with the package of Hadoop . You can not exclude HDFS while HDP cluster Deployment. You can exlcude other sevices then HDFS but you can not exclude HDFS.
HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior.
You are probably referring to Object Stores in places, often using the term Blobstore. Hadoop does provide FileSystem client classes for some of these even though they violate many of the requirements. This is why, although Hadoop can read and write data in an object store, the two which Hadoop ships with direct support for — Amazon S3 and OpenStack Swift — cannot be used as direct replacements for HDFS.
For further reading, Object Stores vs Filesystems
Related
I am trying to load json messages into a Postgres database, using the Postgres sink connector. I have been reading online and have only found the option to have the schema in the JSON message, however, ideally, i would like not to include the schema in the message. Is there a way to register the JSON schema in the schema registry and use that like it's done with Avro?
Also, i'm currently running kafka by downloading the bin, as I had several problems with running kafka connect with docker due to ARM compatibility issues. Is there a similar install for schema registry? Because i'm only finding the option of downloading it through confluent and running it on docker. Is it possible to only run schema registry with docker, keeping my current set up?
Thanks
JSON without schema
The JDBC sink requires a schema
Is there a way to register the JSON schema in the schema registry and use that like it's done with Avro?
Yes, the Registry supports JSONSchema (as well as Protobuf), in addition to Avro. This requires you to use a specific serializer; you cannot just send plain JSON to the topic.
currently running kafka by downloading the bin... Is there a similar install for schema registry?
The Confluent Schema Registry is not a standalone package outside of Docker. You'd have to download Confluent Platform in place of Kafka and then copy over your existing zookeeper.properties and server.properties into that. Then run Schema Registry. Otherwise, compile it from source and build a standalone distribution of it with mvn -Pstandalone package
There are other registries that exist, such as Apicurio
I am using ssis to upload multiple files to azure blob storage. Requirement is when any of the file upload fail we need to rollback transactions. I have tried the transaction option in ssis but so far i am not able to rollback data from blob storage.
Has anyone tried the rollback option in azure blob storage? Please do let me know your thoughts on this.
Thanks
Vidya
There are two options. If you are importing .csv or AVRO files, you can use the SQL Server Integration Services Feature Pack for Azure installed where:
The Azure Blob Source component enables an SSIS package to read data from an Azure blob. The supported file formats are: CSV and AVRO.
If you are not working with these file formats, then the solution offered in this SO thread: How to transaction rollback in ssis? with the use of Sequence Containers.
The SSIS Feature Pack for Azure contains connectors for various other data services and might be useful if you are going to be consuming additional services in the future.
My understanding is that running a Dataproc cluster instead of setting up your own compute engine cluster is that it takes care of installing the storage connector (and other connectors). What else does it do for you?
The most significant feature of Dataproc beyond a DIY cluster is the ability to submit Jobs (Hadoop & Spark jars, Hive queries etc.) via an API, WebUI and CLI without configuring tricky network firewalls and exposing YARN to the world.
Cloud Dataproc also takes care of a lot of configuration and initialization such as setting up A shared Hive Metastore for Hive and Spark. And allows specifying Hadoop, Spark, etc. properties at boot time.
It boots a cluster in ~90s, which in my experience is faster than most cluster setups. This allows you to tear down the cluster when you are not interested and not have to wait tens of minutes to bring a new one up.
I'd encourage you to look at a more comprehensive list of features.
It seems that SQL Alchemy can connect to MySQL table running on Google Cloud SQL. However, I spent time to look for wrapper of Google Cloud Bigtable, a NoSQL database, and could not find anything enough mature.
Just wondering how to manage Google Cloud Bigtable from SQL Alchemy.
There is some Python API to connect to Big Table Cloud:
https://googlecloudplatform.github.io/google-cloud-python/stable/
The google-cloud library is pip install-able:
$ pip install google-cloud
Cloud Datastore
from google.cloud import datastore
client = datastore.Client()
key = client.key('Person')
entity = datastore.Entity(key=key)
entity['name'] = 'Your name'
entity['age'] = 25
client.put(entity)
However, this is still not integrated through SQL Alchemy, this is not clear that Schema can be easily integrated.
This is not possible, because SQLAlchemy can only manage SQL-based RDBMS-type systems, while Bigtable (and HBase) are NoSQL, non-relational systems.
Here's my detailed response on a feature request that was filed for the Google Cloud Python library project which has more context and alternative suggestions:
The integration between SQLAlchemy and Google Cloud Bigtable would
have to be done in SQLAlchemy. I was going to file a bug on SQLAlchemy
on your behalf, but looks like you've already filed a feature
request
and it was closed as wontfix:
unfortunately Google bigtable is non-relational and non-SQL, SQLAlchemy does not have support for key/value stores.
and a previous email thread on the sqlalchemy#
list
about adding support for NoSQL databases like HBase (which is very
similar to Bigtable) ended up without any answers.
Thus, I am afraid we won't be able to help you use SQLAlchemy together
with Bigtable.
That said, as an alternative, consider using Apache
Hue, which works with Apache HBase, and can be
made to work similarly with Bigtable. We don't have a simple howto for
how to connect Apache Hue to Cloud Bigtable yet, but I imagine it can
be done as follows:
Apache Hue -> (a: Thrift API) -> Apache HBase Thrift proxy -> (b: gRPC API) -> Google Cloud Bigtable
The first connection (a) should work out-of-the-box for Hue and
HBase. The second connection can use the Google Cloud Bigtable Java
client for
HBase.
This is not as complicated as it looks, although there are several
parts to connect together to make it all work.
Apache Hue -> (gRPC API) -> Google Cloud Bigtable
This could be done using the Google Cloud Bigtable Java client for
HBase, but
it requires Apache Hue to use the HBase 1.x API (which I believe is
not yet the case, I believe it's using 0.9x API and/or Thrift), so I
would recommend following option (1) above for now instead.
Hope this is helpful.
I have a cluster of five virtual machines (with KVM hypervisor), and I want to find the best way to integrate HDFS in order to optimize storage management of Data.
Since HDFS is a distributed file system that can allows client to access in parallel to a file, I want to take advantage of this feature.
So, it is possible to install HDFS in the cluster to manage the disk space of VMs or to integrate it in OpenShift to manage data of PaaS end user?
If you are thinking of using this with OpenShift Origin or OpenShift Enterprise then you can just expose the HDFS to the OpenShift nodes as a user disk space and they can use it. Remember when you install OpenShift on your own infrastructure you can expose any file system you want as long as you can normally do it for Linux users.