How to call or import a table from Google cloud SQL into Spark dataframe? - mysql

I have created an instance in Google Dataproc and I am running pyspark over it. I am trying to import data from a table into this pyspark. So I created a table in Google cloud platform SQL. But I don't know how to call or import this table from other pyspark. Like I dont have any url kind of thing to point to this table. Could you please help in this regard.

Normally, you could use spark.read.jdbc(): How to work with MySQL and Apache Spark?
The challenge with Cloud SQL is networking -- figuring out how to connect to the instance. There's two main ways to do this:
1) Install the Cloud SQL proxy
You can use this initialization action to do that for you. Follow the instructions under "without configuring Hive metastore", since you don't need to do that:
gcloud dataproc clusters create <CLUSTER_NAME> \
--scopes sql-admin \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--metadata "enable-cloud-sql-hive-metastore=false"
The proxy is a local daemon that you can connect to on localhost:3306 and proxies to the cloud sql instance. You'd need to include localhost:3306 in your jdbc connection uri in spark.read.jdbc().
2) If you're instead willing to add to your driver classpath, you can consider installing the Cloud SQL Socket factory.
There's some discussion about how to do this here: https://groups.google.com/forum/#!topic/cloud-dataproc-discuss/Ns6umF_FX9g and here: Spark - Adding JDBC Driver JAR to Google Dataproc.
It sounds like you can either package it into a shaded application jar in pom.xml, or just provide it at runtime by adding it via --jars.

Related

Is it safe to connect DataFlow job to Cloud MySQL with proxy (by os.system)

I am trying to create a Python job on DataFlow that need a Cloud SQL connection (and I'm a total beginner). I need to execute several MySQL queries in ParDo (Apache Beam). I am using PyMySQL and have problem authenticating, so I tried this answer and apparently it works:
class MyDoFn(beam.DoFn):
def setup(self):
os.system("wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy")
os.system("chmod +x cloud_sql_proxy")
os.system(f"./cloud_sql_proxy -instances={self.sql_args['cloud_sql_connection_name']}=tcp:3306 &")
The thing is, I find this to be more of a work-around. Is it safe to authenticate this way? I would appreciate any help! Thank you in advance.
Yes, this is a safe way to use a Cloud SQL connection. The cloud_sql_proxy uses authentication info from the Compute Engine instance to properly authenticate the connection. See https://cloud.google.com/sql/docs/mysql/sql-proxy#authentication-options for more about this.

MySQL and Keycloak setup

Hello I am trying to add mysql database in my Keycloak server.
I've added module.xml and mysql-connector-java-5.1.42-bin.jar under /modules/system/layers/base/com/mysql/main.
When I am running the command to add mysql module,
./jboss-cli.sh, it errors out with
Exception in thread "CLI Terminal Connection (uninterruptable)"
java.lang.ArithmeticException: / by zero
And when i am trying to start Keycloak, I am also notified that there is a missing service.
service jboss.jdbc-driver.mysql (missing)
Please help!!
When I am running the command to add mysql module, ./jboss-cli.sh, it errors out with
Can you post your Command? You dont't have to do this with the cli. It's also possible to modify the config in a editor. At least for testing you should try this.
Keycloak docs have a pretty good part about database setup: https://www.keycloak.org/docs/latest/server_installation/index.html#_database
The basic steps are:
Locate and download a JDBC driver for your database
Package the driver JAR into a module and install this module into the server (module.xml)
Declare the JDBC driver in the configuration profile of the server (standalone.xml)
Modify the datasource configuration to use your database’s JDBC driver
Modify the datasource configuration to define the connection parameters to your database
There is an error in the Keycloak documentation. Driver should be in the
modules/system/layers/base/com/mysql/driver/main
folder.
The full valid instruction is here
https://github.com/v-ladynev/keycloak-nodejs-example#keycloak-configuration
Also you can use docker images to experiment
https://github.com/v-ladynev/keycloak-nodejs-example#keycloak-docker-image

Kafka Connect with MySQL Source

Before I begin, I'd like to start by saying I am completely new to Kafka and am fairly new to Linux, so if this ends up being a ridiculously simple answer, please be kind! :)
The high level idea of what I'm trying to do is use Confluent's Kafka Connect to read from a MySQL database that is having sensor data streamed to it on a minute or sub-minute basis and then use Kafka as an "ETL pipeline" to instantly route that data to a Data Warehouse and/or MongoDB for reporting or even tie in directly to Kafka from our web-app.
I am using Robin Moffatt's series as well as Confluent's JDBC Source Connector Quickstart as my initial guide. As far as where these are hosted, I am using an Amazon RDS MySQL database and a separate AWS EC2 t2.large instance with Ubuntu 16.04.2 to run Kafka Connect.
Using Robin's workflow, I am to the point where I have created the configuration file, but I am not using the json format he uses. I am using the format from the quickstart article.
name=jdbc_source_mysql_4427_Data
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://lndbtest.cdveaddpnevv.us-east-2.rds.amazonaws.com:3306/LNDBv1?user=adminRDS&password=*****
table.whitelist=4427_Data
mode=timestamp
timestamp.column.name=TmStamp
validate.non.null=false
topic.prefix=mysql-
And that is saved at:
/etc/kafka-connect-jdbc/kafka-connect-jdbc-source.properties
I then run:
/usr/bin/confluent load jdbc_source_mysql_4427_Data -d /etc/kafka-connect-jdbc/kafka-connect-jdbc-source.properties
and get this error:
{
"error_code": 400,
"message": "Connector configuration is invalid and contains the following 2 error(s):\nInvalid value java.sql.SQLException: No suitable driver found for jdbc:mysql://lndbtest.cdveaddpnevv.us-east-2.rds.amazonaws.com:3306/LNDBv1?user=adminRDS&password=*** for configuration Couldn't open connection to jdbc:mysql://lndbtest.cdveaddpnevv.us-east-2.rds.amazonaws.com:3306/LNDBv1?user=adminRDS&password=***\nInvalid value java.sql.SQLException: No suitable driver found for jdbc:mysql://lndbtest.cdveaddpnevv.us-east-2.rds.amazonaws.com:3306/LNDBv1?user=adminRDS&password=*** for configuration Couldn't open connection to jdbc:mysql://lndbtest.cdveaddpnevv.us-east-2.rds.amazonaws.com:3306/LNDBv1?user=adminRDS&password=***\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"
}
It seems to be a driver issue. My question at this point is, "Do I need to download the MySQL JDBC driver to my EC2 instance, or should that have been included in the Confluent Platform package?"
Also, does my overall idea sound like a good fit for Kafka Connect?
As I mentioned earlier, I am new to these technologies, but have found the best way to learn something is to jump right in and try to solve a problem. Any ideas and suggestions would be more than welcome. Thank you!
The overall concept makes sense to me. You do need to download the driver and add it to your worker classpath. It isn't packaged for licensing reasons I assume.
As #dawsaw says, you do need to make the MySQL JDBC driver available to the connector.
My observation here would be–given a free hand in all the application and architecture you describe– it would be best to stream from the sensor into Kafka, and then from there Kafka into MySQL, Mongo, webapp, etc.
Streaming into a DB to then stream out of the DB is not a perfect choice, if you have the option.
It's because there's no mysql driver in the distribution of confluent. I think you can solve the problem by downloading a mysql driver jar file, then putting it in confluent/share/java/kafka-connect-jdbc folder and re-run the program.

Unable to connect to snappydata store with spark-shell command

SnappyData v0.5
My goal is to start a "spark-shell" from my SnappyData install's /bin directory and issue Scala commands against existing tables in my SnappyData store.
I am on the same host as my SnappyData store, locator, and lead (and yes, they are all running).
To do this, I am running this command as per the documentation here:
Connecting to a Cluster with spark-shell
~/snappydata/bin$ spark-shell --master local[*] --conf snappydata.store.locators=10.0.18.66:1527 --conf spark.ui.port=4041
I get this error trying to create a spark-shell to my store:
[TRACE 2016/08/12 15:21:55.183 UTC GFXD:error:FabricServiceAPI
tid=0x1] XJ040 error occurred while starting server :
java.sql.SQLException(XJ040): Failed to start datab
ase 'snappydata', see the cause for details.
java.sql.SQLException(XJ040): Failed to start database 'snappydata',
see the cause for details.
at com.pivotal.gemfirexd.internal.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:124)
at com.pivotal.gemfirexd.internal.impl.jdbc.Util.newEmbedSQLException(Util.java:110)
at com.pivotal.gemfirexd.internal.impl.jdbc.Util.newEmbedSQLException(Util.java:136)
at com.pivotal.gemfirexd.internal.impl.jdbc.Util.generateCsSQLException(Util.java:245)
at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedConnection.bootDatabase(EmbedConnection.java:3380)
at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedConnection.(EmbedConnection.java:450)
at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedConnection30.(EmbedConnection30.java:94)
at com.pivotal.gemfirexd.internal.impl.jdbc.EmbedConnection40.(EmbedConnection40.java:75)
at com.pivotal.gemfirexd.internal.jdbc.Driver40.getNewEmbedConnection(Driver40.java:95)
at com.pivotal.gemfirexd.internal.jdbc.InternalDriver.connect(InternalDriver.java:351)
at com.pivotal.gemfirexd.internal.jdbc.InternalDriver.connect(InternalDriver.java:219)
at com.pivotal.gemfirexd.internal.jdbc.InternalDriver.connect(InternalDriver.java:195)
at com.pivotal.gemfirexd.internal.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:141)
at com.pivotal.gemfirexd.internal.engine.fabricservice.FabricServiceImpl.startImpl(FabricServiceImpl.java:290)
at com.pivotal.gemfirexd.internal.engine.fabricservice.FabricServerImpl.start(FabricServerImpl.java:60)
at io.snappydata.impl.ServerImpl.start(ServerImpl.scala:32)
Caused by: com.gemstone.gemfire.GemFireConfigException: Unable to
contact a Locator service (timeout=5000ms). Operation either timed out
or Locator does not exist. Configured list of
locators is "[dev-snappydata-1(null):1527]".
at com.gemstone.gemfire.distributed.internal.membership.jgroup.GFJGBasicAdapter.getGemFireConfigException(GFJGBasicAdapter.java:533)
at com.gemstone.org.jgroups.protocols.TCPGOSSIP.sendGetMembersRequest(TCPGOSSIP.java:212)
at com.gemstone.org.jgroups.protocols.PingSender.run(PingSender.java:82)
at java.lang.Thread.run(Thread.java:745)
hmm! I assume you are trying the Spark-shell from your desktop and connecting to the cluster in AWS?
Not sure this is going to work because the local JVM launched by spark-shell will attempt to connect to the p2p cluster in Snappydata which is not likely to work.
Snappy-shell on the other hand merely uses the JDBC client to connect (and, hence will work).
And, you cannot use the locator client port (1527), anyway. See here
Can you try with snappydata.store.locators=10.0.18.66:10334 NOT 1527 as the port ? Unlikely this will work but worth a try.
Maybe there is a way to open up all ports and access to these nodes on AWS. Not recommended for production, though.
I am curious for other responses from the engg team.
Until then, you may have to start the spark-shell from within the network (AWS node).

Using SSL with command-line Flywaydb ( flyway ) to deploy DB changes

I'm working on a proof of concept to deploy using flyway's command-line tool from a centralized server to deploy to multiple database platforms. (MySQL, Postgres, and SQL Server)
I'm able to deploy successfully without SSL, however it is using unencrypted host information such as logins/passwords/ports to the destination Database Server. My concern is there's a chance the un-encrypted traffic could be seen.
Does anyone have experience with the flyway command line tool using SSL to deploy to:
MySQL
SQL Server
I did not see any information in the documentation unless I missed it.
Thanks for any help and suggestions!
In the flyway examples in flyway.conf it shows how to add additional values to the jdbc url for example
# MySQL : jdbc:mysql://<host>:<port>/<database>?<key1>=<value1>&<key2>=<value2>...
# PostgreSQL : jdbc:postgresql://<host>:<port>/<database>?<key1>=<value1>&<key2>=<value2>...
# Redshift : jdbc:postgresql://<host>:<port>/<database>?<key1>=<value1>&<key2>=<value2>...
So for Redshift/Postgres for example you can include the ssl=true flag
flyway.url=jdbc:postgresql://yourserver:5439/dbname?ssl=true
You need to add the public key that the DB server key was signed with to your hosts trust store (for Redshift see http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html for details on that), e.g
${JAVA_HOME}/bin/keytool -keystore ${JAVA_HOME}/lib/security/cacerts -import -alias <alias> -file <certificate_filename>
I then had to hack the flyway startup script /flyway to include the truststore and password in the JAVA_ARGS (it probably should have these as variables) e.g
JAVA_ARGS="-Djava.security.egd=file:/dev/../dev/urandom -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts -Djavax.net.ssl.trustStorePassword=changeit"
For MySQL I used the following URL to connect using SSL.
jdbc:mysql://hostname:3306/wpastudy?useSSL=true
Note the useSSL=true parameter.