Using DataFlow to read data from Compute Engine - google-compute-engine

I want to read data from MariaDB on Google Compute Engine and to write data to BigQuery by DataFlow but I always get the exception as below when I run the DataFlow program on DataFlowRunner.
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Could not connect to address=(host=xxx.xxx.xxx.xxx)(port=3306)(type=master) : connect timed out)
I can access successfully the MariaDB by DBeaver.
I can run successfully the DataFlow program on DirectRunner.
Can give me some ideals, Thanks.

To restrict it so that only Dataflow jobs can access it you can leverage the fact that Dataflow's harness VMs are created with the dataflow tag. Otherwise, you can allocate the GCE instance and DF workers on a specific network/subnetwork.
For example, create a GCE instance with a network tag such as mariadb so it can be used as the target to apply firewall rules to and/or select a specific VPC network/subnetwork. Install MariaDB (another option is to use an initialization script or a preinstalled solution through Cloud Launcher).
For the firewall rules, you'll need the Database to be reachable on port tcp:3306. For the GCE instance (target tag mariadb) you'll need to allow ingress traffic from either source tags dataflow or coming from within the subnetwork on the aforementioned port. Take into account that, for the latter option, you'll need to also allow internal communication between DF workers inside the subnetwork.
Now, on the Dataflow side, add the JdbcIO and mariadb connector dependencies to the pom.xmlfile:
<!-- https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-jdbc -->
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-jdbc</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.mariadb.jdbc/mariadb-java-client -->
<dependency>
<groupId>org.mariadb.jdbc</groupId>
<artifactId>mariadb-java-client</artifactId>
<version>1.1.7</version>
</dependency>
A sample Dataflow snippet to connect (use internal IP in the JDBC connection string if using the subnetwork approach):
public class MariaDB {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(PipelineOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<String> account = p.apply(JdbcIO.<String>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.mariadb.jdbc.Driver", "jdbc:mariadb://INTERNAL_IP:3306/database").withUsername("root").withPassword("pwd"))
.withQuery("SELECT … FROM table")
.withCoder(SerializableCoder.of(String.class))
.withRowMapper(new JdbcIO.RowMapper<String>() {
public String mapRow(ResultSet rs) throws Exception {
...
}}
));
p.run();
}
}
And launch the job specifying the subnetwork and matching zone if needed:
mvn compile exec:java \
-Dexec.mainClass=com.example.MariaDB \
-Dexec.args="--project=PROJECT_ID \
--stagingLocation=gs://BUCKET_NAME/mariadb/staging/ \
--output=gs://BUCKET_NAME/mariadb/output \
--network="dataflow-network" \
--subnetwork="regions/europe-west1/subnetworks/subnet-europe-west" \
--zone="europe-west1" \
--runner=DataflowRunner"

Related

How to upload a file to OCI Object storage

I am trying to use UploadObjectExample.java code to upload a file to OCI object storage. I am running into connection timeout error while connecting to the object storage URL. The same config file is used by OCI CLI to successfully upload files to OCI config.
Here is the Error log:
Exception in thread "main" com.oracle.bmc.model.BmcException: (-1, null, true) Timed out while communicating to: https://objectstorage.us-ashburn-1.oraclecloud.com (outbound opc-request-id: 1EB5AA4A7FD64D58A54F876AD0C9E83B)
at com.oracle.bmc.http.internal.RestClient.convertToBmcException(RestClient.java:572)
at com.oracle.bmc.http.internal.RestClient.put(RestClient.java:380)
at com.oracle.bmc.objectstorage.ObjectStorageClient.putObject(ObjectStorageClient.java:1053)
at com.oracle.bmc.objectstorage.transfer.internal.SimpleRetry$1.apply(SimpleRetry.java:34)
at com.oracle.bmc.objectstorage.transfer.internal.SimpleRetry$1.apply(SimpleRetry.java:26)
at com.oracle.bmc.objectstorage.transfer.UploadManager.singleUpload(UploadManager.java:111)
at com.oracle.bmc.objectstorage.transfer.UploadManager.upload(UploadManager.java:73)
at UploadObjectExample.main(UploadObjectExample.java:74)
Caused by: javax.ws.rs.ProcessingException: java.net.SocketTimeoutException: connect timed out
at org.glassfish.jersey.client.internal.HttpUrlConnector.apply(HttpUrlConnector.java:284)
at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:278)
at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:753)
at org.glassfish.jersey.internal.Errors.process(Errors.java:316)
at org.glassfish.jersey.internal.Errors.process(Errors.java:298)
at org.glassfish.jersey.internal.Errors.process(Errors.java:229)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:752)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:445)
at org.glassfish.jersey.client.JerseyInvocation$Builder.put(JerseyInvocation.java:334)
at com.oracle.bmc.http.internal.ForwardingInvocationBuilder.put(ForwardingInvocationBuilder.java:141)
at com.oracle.bmc.http.internal.RestClient.put(RestClient.java:377)
Please test curl -v https://objectstorage.us-ashburn-1.oraclecloud.com from the same machine where the Java client times out, just to make sure there are no connection issues. If it works fine you may try to change the timeout value in ClientConfiguration. You can see more details here: https://github.com/oracle/oci-java-sdk/issues/92
Before creating a support ticket you might also try to create a new issue on github/oci-java-sdk.
without knowing more about the config file (I do not suggest you post it here), your home region and other elements it is very hard to help.
I would suggest you open a support ticket at https://support.oracle.com, making sure that you select the Cloud tab and the Service as "Oracle Cloud Infrastructure".
Are you using a proxy? If so, you may need to use the OCI Java SDK ApacheConnector.
This was an issue with the proxy. This was resolved by using the ash7 proxy.

Unable to connect to MySQL from Ballerina.io on Mac OS X

I want to build a simple app that connects to remote MySQL server. However, I can't make it work.
import ballerina/io;
import ballerina/jdbc;
import ballerina/mysql;
endpoint jdbc:Client jiraDB {
host: "jdbc:mysql://DB-SERVER:3306/jira",
username: "jira",
password: "PWD",
poolOptions: { maximumPoolSize: 5 }
};
type Domain record {
string domain,
string jira,
};
function main(string... args) {
var ret = jiraDB->select("SELECT * FROM `domains`", ());
table domainTable;
match ret {
table tableReturned => domainTable = tableReturned;
error e => io:println("Select data from domains table failed: " + e.message);
}
while(domainTable.hasNext()) {
var domain = <Domain>domainTable.getNext();
match domain {
Domain d => io:println("Domain: " + d.domain);
error e => io:println("Error in get employee from table: "
+ e.message);
}
}
}
The structure of MySQL is not really important. I think it has to do with missing / wrongly used JDBC/MySQL library.
Do you please have any ideas how to make it work on Mac OS X ?
$ ballerina run hello.bal
error: ballerina/runtime:CallFailedException, message: call failed
at ..<stop>(hello.bal:5)
caused by error
at ballerina/jdbc:stop(endpoint.bal:66)
I'm using latest Mac OS X with:
$ ballerina --version
Ballerina 0.980.1
First, the latest ballerina version is 0.981.0. It would be great if you could use the latest version since it would include latest bug fixes and improvements.
In Ballerina, there is a generic jdbc client which can be used to connect to any database which has a jdbc driver. In addition, for mysql and h2 there are two clients implemented specifically for those two databases.
When connecting to mysql, you could either use the generic jdbc client or the mysql specific client. The recommendation is to use the mysql specific client.
In your code snippet, I can see you are using jdbc client. As Anoukh mentioned above, the endpoint configuration is incorrect.
Following is a sample configuration for generic jdbc client endpoint.
endpoint jdbc:Client testDB {
url: "jdbc:mysql://localhost:3306/testdb",
username: "user1",
password: "pass1",
poolOptions: { maximumPoolSize: 5 }
};
And following is a sample configuration of mysql client endpoint.
endpoint mysql:Client testDB {
host: "localhost",
port: 3306,
name: "testDB",
username: "user1",
password: "pass1",
poolOptions: { maximumPoolSize: 5 }
};
In order to use either of the clients, you need to copy the mysql jdbc driver to ${BALLERINA_HOME}/bre/lib.
Even after correcting your configuration and copying the driver, if you still face the issue, please check whether file named ballerina-internal.log is created where you are running your bal file and share. Also please share the mysql database and driver version you are using.
Have you copied the MySQL JDBC driver to the BALLERINA_HOME/bre/lib folder?
You can find the ballerina home using which ballerina command.
You can download the mysql jdbc driver from http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.6/mysql-connector-java-5.1.6.jar
The issue might be in the jiraDB endpoint configurations. As per the API docs, the config for the URL of the database is to be given as url instead of host.
I was not able to connect to Mysql and I faced a driver instance error. I solved it! I'm not sure to post my answer at the good place but I think it will be a good resource to fix some problems with Mysql connections issues in Ballerina.
In my terminal : echo $BALLERINA_HOME
/Library/Ballerina/ballerina-0.990.2
Copy the good jar in the right place !
Go to : http://central.maven.org/maven2/mysql/mysql-connector-java/
I have downloaded the latest stable version (at the time of writing 8.0.15).
Copy the jar in $BALLERINA_HOME/bre/lib/
I had an error with a prior version.
Be careful that your jar have the right extension (the .jar not the repository with the same name).
Also be sure to have fulfilled the recommandations (see the doc of Oracle when installing a jar, i.e setting the classpath)
In your terminal, set the class path :
export CLASSPATH=$CLASSPATH:/Library/Ballerina/ballerina-0.990.2/bre/lib/mysql-connector-java-8.0.15
Then it will work !

Get Error storing cluster namespace secret (E0025) trying to bind service to a cluster

I am following Tutorial: Creating Kubernetes clusters in IBM Bluemix Container Service but when I try to bind a service to my cluster I get:
$ bx cs cluster-service-bind kub_cluster myns cloudant
FAILED
Error storing cluster namespace secret (E0025)
Incident ID: ebdbdd0d-5d6a-4373-8e54-b7dd84733a29
I have a worker node:
$ bx cs workers kub_cluster
will list one in State 'normal' and Status 'Ready'.
I tried with different services (messageHub and Cloudant) and different names for the namespace. These are services I already have. Anyone know how to get around this?
I was able to test this out following the same guide. I used the tone analyzer service. For testing I used the default namespace.
Are you able to see the namespace you are using when you list out available kubernetes namespaces? The option "myns" will need to be a kubernetes namespace.
$ kubectl get namespaces
This should print out the default namespace as well as other system namespaces + any namespaces you created.
Earlier in the guide a namespace is setup for the docker registry, it is possible that you are using that namespace.
Other instances of this issue appear to be related to the status of the cluster. It looks like your cluster has an available node(normal and ready), so it should be able to store the secret in an available namespace.
You might be missing the specific namespace in your cluster.
You can create one by calling:
kubectl create namespace <your namespace>

Compute Engine accessing DataStore get Invalid Credentials (code: 401)

I am following the tutorial on
https://cloud.google.com/datastore/docs/getstarted/start_nodejs/
trying to use datastore from my Compute Engine project.
Step 2 in the tutorial mentioned I do not have to create new service account credentials when running from Compute Engine.
I run the sample with:
node test.js abc-test-123
where abc-test-123 is my Project Id and that project have enabled all cloud API access including DataStore API.
After uploaded the code and executed the sample, I got the following error:
Adams: { 'rpc error': { [Error: Invalid Credentials] code: 401,
errors: [ [Object] ] } }
Update:
I did a workaround by changing the default sample code to use the JWT credential way (with a generated .json key file) and things are working now.
Update 2:
This is the scope config when I run
gcloud compute instances describe abc-test-123
And the result:
serviceAccounts:
scopes:
- https://www.googleapis.com/auth/cloud-platform
According to the doc:
You can set scopes only when you create a new instance, and cannot
change or expand the list of scopes for existing instances. For
simplicity, you can choose to enable full access to all Google Cloud
Platform APIs with the https://www.googleapis.com/auth/cloud-platform
scope.
I still welcome any answer about why the original code not work in my case~
Thanks for reading
This most likely means that when you created the instance, you didn't specify the right scopes (datastore and userinfo-email according to the tutorial). You can check that by executing the following command:
gcloud compute instances describe <instance>
Look for serviceAccounts/scopes in the output.
There are 2 way to create an instance with right credential:
gcloud compute instances create $INSTANCE_NAME --scopes datastore,userinfo-email
Using web: on Access & Setting Enable User Info & Datastore

How to automatically exit/stop the running instance

I have managed to create an instance and ssh into it. However, I have couple of questions regarding the Google Compute Engine.
I understand that I will be charged for the time my instance is running. That is till I exit out of the instance. Is my understanding correct?
I wish to run some batch job (java program) on my instance. How do I make my instance stop automatically after the job is complete (so that I don't get charged for the additional time it may run)
If I start the job and disconnect my PC, will the job continue to run on the instance?
Regards,
Asim
Correct, instances are charged for the time they are running. (to the minute, minimum 10 minutes). Instances run from the time they are started via the API until they are stopped via the API. It doesn't matter if any user is logged in via SSH or not. For most automated use cases users never log in - programs are installed and started via start up scripts.
You can view your running instances via the Cloud Console, to confirm if any are currently running.
If you want to stop your instance from inside the instance, the easiest way is to start the instance with the compute-rw Service Account Scope and use gcutil.
For example, to start your instance from the command line with the compute-rw scope:
$ gcutil --project=<project-id> addinstance <instance name> --service_account_scopes=compute-rw
(this is the default when manually creating an instance via the Cloud Console)
Later, after your batch job completes, you can remove the instance from inside the instance:
$ gcutil deleteinstance -f <instance name>
You can put halt command at the end of your batch script (assuming that you output your results on persistent disk).
After halt the instance will have a state of TERMINATED and you will not be charged.
See https://developers.google.com/compute/docs/pricing
scroll downn to "instance uptime"
You can auto shutdown instance after model training. Just run few extra lines of code after the model training is complete.
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('compute', 'v1', credentials=credentials)
# Project ID for this request.
project = 'xyz' # Project ID
# The name of the zone for this request.
zone = 'xyz' # Zone information
# Name of the instance resource to stop.
instance = 'xyz' # instance id
request = service.instances().stop(project=project, zone=zone, instance=instance)
response = request.execute()
add this to your model training script. When the training is complete GCP instance automatically shuts down.
More info on official website:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/stop
If you want to stop the instance using the python script, you can follow this way:
from google.cloud.compute_v1.services.instances import InstancesClient
from google.oauth2 import service_account
instance_client = InstancesClient().from_service_account_file(<location-path>)
zone = <zone>
project = <project>
instance = <instance_id>
instance_client.stop(project=project, instance=instance, zone=zone)
In the above script, I have assumed you are using service-account for authentication. For documentation of libraries used you can go here:
https://googleapis.dev/python/compute/latest/compute_v1/instances.html