Cant connect to Mysql database from pyspark, getting jdbc error - mysql

I am learning pyspark, and trying to connect to a mysql database.
But i am getting a java.lang.ClassNotFoundException: com.mysql.jdbc.Driver Exception while running the code. I have spent a whole day trying to fix it, any help would be appreciated :)
I am using pycharm community edition with anaconda and python 3.6.3
Here is my code:
from pyspark import SparkContext,SQLContext
sc= SparkContext()
sqlContext= SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
url ="jdbc:mysql://192.168.0.11:3306/my_db_name",
driver = "com.mysql.jdbc.Driver",
dbtable = "billing",
user="root",
password="root").load()
Here is the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o27.load.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

This got asked 9 months ago at the time of writing, but since there's no answer, there it goes. I was in the same situation, searched stackoverflow over and over, tried different suggestions but the answer finally is absurdly simple: You just have to COPY the MySQL driver into the "jars" folder of Spark!
Download here https://dev.mysql.com/downloads/connector/j/5.1.html
I'm using the 5.1 version, although 8.0 exists, but I had some other problems when running the latest version with Spark 2.3.2 (had also other problems running Spark 2.4 on Windows 10).
Once downloaded you can just copy it into your Spark folder
E:\spark232_hadoop27\jars\ (use your own drive:\folder_name -- this is just an example)
You should have two files:
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47-bin.jar
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47.jar
After that the following code launched through pyCharm or jupyter notebook should work (as long as you have a MySQL database set up, that is):
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe_mysql = spark.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/uoc2",
driver = "com.mysql.jdbc.Driver",
dbtable = "company",
user="root",
password="password").load()
dataframe_mysql.show()
Bear in mind, I'm working currently locally with my Spark setup, so no real clusters involved, and also no "production" kind of code which gets submitted to such a cluster. For something more elaborate this answer could help: MySQL read with PySpark

On my computer, #Kondado 's solution works only if I change the driver in the options:
driver = 'com.mysql.cj.jdbc.Driver'
I am using Spark 8.0 on Windows. I downloaded mysql-connector-java-8.0.15.jar, Platform Independent version from here. And copy it to 'C:\spark-2.4.0-bin-hadoop2.7\jars\'
My code in Pycharm looks like this:
#import findspark # not necessary
#findspark.init() # not necessary
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

I dont know how to add jar file to ClassPath(can someone tell me how??) so I put it in the SparkSession config and it works fine.
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "<path to mysql-connector-java-5.1.49-bin.jar>") \
.getOrCreate()
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()
df.show()

This worked for me, pyspark with mssql
java version is 1.7.0_191
pyspark version is 2.1.2
Download the below jar files
sqljdbc41.jar
mssql-jdbc-6.2.2.jre7.jar
Paste the above jars inside jars folder in the virtual environment
test_env/lib/python3.6/site-packages/pyspark/jars
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practise').getOrCreate()
url = 'jdbc:sqlserver://your_host_name:your_port;databaseName=YOUR_DATABASE_NAME;useNTLMV2=true;'
df = spark.read.format('jdbc'
).option('url', url
).option('user', 'your_db_username'
).option('password','your_db_password'
).option('dbtable', 'YOUR_TABLE_NAME'
).option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
).load()

Related

connect prestodb through sqlalchemy

I'd like to connect to prestodb with SQLalchemy interface. I'm running prestodb==0.7.0 and SQLalchemy== 1.4.20 and SQLalchemy doesn't seem to have prestodb baked in:
NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:presto
Not much luck with registering the prestodb either:
from sqlalchemy.dialects import registry
import prestodb
from prestodb.dbapi import Connection
registry.register('presto', 'prestodb.dbapi', 'Connection')
from sqlalchemy.engine import create_engine
port = 8889
user = os.environ["USER"]
engine = create_engine(f'presto://{user}#presto:{port}/hive',
connect_args={'protocol': 'https', 'requests_kwargs': {'verify': False}})
db = engine.raw_connection()
# AttributeError: type object 'Connection' has no attribute 'get_dialect_cls'
Any ideas?
If you have a look at the Dialects docs you will see that Presto is a external dialect and needs to be installed separately. The Presto dialect is supported through PiHyve and can be installed using pip install 'pyhive[presto]'.

ModuleNotFoundError: No module named 'fastai.vision'

I am trying to use ImageDataBunch from fastai, and it worked fine, but recently when I ran my code, it showed this error ModuleNotFoundError: No module named 'fastai.vision'
Then, I upgraded my fastai version pip install fastai --upgrade. This error got cleared but landed in NameError: name 'ImageDataBunch' is not defined
Here's my code:
import warnings
import numpy as np
from fastai.vision import *
warnings.filterwarnings("ignore", category=UserWarning, module="torch.nn.functional")
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train='.', valid_pct=0.2,
ds_tfms=get_transforms(), size=224, num_workers=4, no_check=True).normalize(imagenet_stats)
How can I fix this?
I actually ran into this same issue when I started using Colab, but haven't been able to reproduce it. Here was the thread describing what I and another developer did to troubleshoot: https://forums.fast.ai/t/no-module-named-fastai-data-in-google-colab/78164/4
I would recommend trying to factory reset your runtime ( "Runtime" -> "Factory Reset Runtime")
Then you can check which version of fastai you have (you have to restart the runtime to use the new version if you've already imported it)
import fastai
fastai.__version__
I'm able to run fastai.vision import * on fastai version 1.0.61 and 2.0.13
In Google Colab:
Upgrade fastai on colab:
! [ -e /content ] && pip install -Uqq fastai
Import necessary libraries:
from fastai.vision.all import *
from fastai.text.all import *
from fastai.collab import *
from fastai.tabular.all import *
Get the images and annotations:
path = untar_data(URLs.PETS)
path_anno = path/'annotations'
path_img = path/'images'
print( path_img.ls() ) # print all images
fnames = get_image_files(path_img) # -->> 7390 images
print(fnames[:5]) # print first 5 images
The solution that worked for me is to copy to (connect) my google drive & then run the cells. Source
You might have installed the older version of fastai. You need to upgrade to fastaiv2. You can upgrade fastai by using pip as shown below.
!pip install fastai --upgrade
Also check your fastai version using
import fastai
print(fastai.__version__)

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Make permanent connection with MySQL (Apache Spark)

If I execute these commands in spark-shell it correctly returns the data in the "people" table:
val dataframe_mysql = spark.sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/db_spark").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "people").option("user", "root").option("password", "****").load()
dataframe_mysql.show
The problem is if I close spark-shell and return it open, the connection to the database is not maintained.
As per Spark's documentation, SparkContext and HiveContext get created inside the spark shell (when spark-shell command is executed) with HiveContext defined as SQLContext. As the connection is mapped to SQLContext, closing the spark shell will mean you won't be able to access SQLContext and hence, you won't be able to connect.
Here's another reference:
When you run spark-shell, which is your interactive driver
application, it automatically creates a SparkContext defined as sc and
a HiveContext defined as sqlContext.

Spark Read.json cant find file

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).
I try running this code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("people.json")
I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-
Py4JJavaError: An error occurred while calling o238.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist
Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation