Spark Read.json cant find file - json

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).
I try running this code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("people.json")
I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-
Py4JJavaError: An error occurred while calling o238.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist

Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation

Related

Databrick csv cannot find local file

In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.
I found the issue now!
The reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster("yarn-cluster"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster("local").
FIX:
conf = SparkConf().setAppName('test').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
csv = "file:///home/rxie/csv_out/wamp.csv"
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load(csv)
Yes, you are right, the file should be present at all worker nodes.
well. you can still read a local file in yarn cluster mode. you just need to add your file using addFile.
spark.sparkContext.addFile("file:///your local file path ")
spark will copy the file to each node where executor will be created and can be able to process your file in cluster mode as well.
I am using spark 2.3 version so you can change your spark context accordingly but addFile method remains same.
try this with your yarn (cluster mode) and let me know if it works for you.

Cant connect to Mysql database from pyspark, getting jdbc error

I am learning pyspark, and trying to connect to a mysql database.
But i am getting a java.lang.ClassNotFoundException: com.mysql.jdbc.Driver Exception while running the code. I have spent a whole day trying to fix it, any help would be appreciated :)
I am using pycharm community edition with anaconda and python 3.6.3
Here is my code:
from pyspark import SparkContext,SQLContext
sc= SparkContext()
sqlContext= SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
url ="jdbc:mysql://192.168.0.11:3306/my_db_name",
driver = "com.mysql.jdbc.Driver",
dbtable = "billing",
user="root",
password="root").load()
Here is the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o27.load.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
This got asked 9 months ago at the time of writing, but since there's no answer, there it goes. I was in the same situation, searched stackoverflow over and over, tried different suggestions but the answer finally is absurdly simple: You just have to COPY the MySQL driver into the "jars" folder of Spark!
Download here https://dev.mysql.com/downloads/connector/j/5.1.html
I'm using the 5.1 version, although 8.0 exists, but I had some other problems when running the latest version with Spark 2.3.2 (had also other problems running Spark 2.4 on Windows 10).
Once downloaded you can just copy it into your Spark folder
E:\spark232_hadoop27\jars\ (use your own drive:\folder_name -- this is just an example)
You should have two files:
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47-bin.jar
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47.jar
After that the following code launched through pyCharm or jupyter notebook should work (as long as you have a MySQL database set up, that is):
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe_mysql = spark.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/uoc2",
driver = "com.mysql.jdbc.Driver",
dbtable = "company",
user="root",
password="password").load()
dataframe_mysql.show()
Bear in mind, I'm working currently locally with my Spark setup, so no real clusters involved, and also no "production" kind of code which gets submitted to such a cluster. For something more elaborate this answer could help: MySQL read with PySpark
On my computer, #Kondado 's solution works only if I change the driver in the options:
driver = 'com.mysql.cj.jdbc.Driver'
I am using Spark 8.0 on Windows. I downloaded mysql-connector-java-8.0.15.jar, Platform Independent version from here. And copy it to 'C:\spark-2.4.0-bin-hadoop2.7\jars\'
My code in Pycharm looks like this:
#import findspark # not necessary
#findspark.init() # not necessary
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()
I dont know how to add jar file to ClassPath(can someone tell me how??) so I put it in the SparkSession config and it works fine.
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "<path to mysql-connector-java-5.1.49-bin.jar>") \
.getOrCreate()
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()
df.show()
This worked for me, pyspark with mssql
java version is 1.7.0_191
pyspark version is 2.1.2
Download the below jar files
sqljdbc41.jar
mssql-jdbc-6.2.2.jre7.jar
Paste the above jars inside jars folder in the virtual environment
test_env/lib/python3.6/site-packages/pyspark/jars
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practise').getOrCreate()
url = 'jdbc:sqlserver://your_host_name:your_port;databaseName=YOUR_DATABASE_NAME;useNTLMV2=true;'
df = spark.read.format('jdbc'
).option('url', url
).option('user', 'your_db_username'
).option('password','your_db_password'
).option('dbtable', 'YOUR_TABLE_NAME'
).option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
).load()

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Make permanent connection with MySQL (Apache Spark)

If I execute these commands in spark-shell it correctly returns the data in the "people" table:
val dataframe_mysql = spark.sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/db_spark").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "people").option("user", "root").option("password", "****").load()
dataframe_mysql.show
The problem is if I close spark-shell and return it open, the connection to the database is not maintained.
As per Spark's documentation, SparkContext and HiveContext get created inside the spark shell (when spark-shell command is executed) with HiveContext defined as SQLContext. As the connection is mapped to SQLContext, closing the spark shell will mean you won't be able to access SQLContext and hence, you won't be able to connect.
Here's another reference:
When you run spark-shell, which is your interactive driver
application, it automatically creates a SparkContext defined as sc and
a HiveContext defined as sqlContext.

SQLAlchemy and adodbapi Database connection error

I'm attempting to connect to a mssql SQLExpress 2012 database using sqlalchemy 0.7.8 and adodapi 2.4.2.2 on IronPython 2.7.3
I am able to create a sqlalchemy engine, however when a query is made I get :
"TypeError: 'NoneType' object is unsubscriptable"
TraceBack:
Traceback (most recent call last):
File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\SQLAlchemy-0.7.8-py2.7.egg\sqlalchemy\engine\base.py", line 878, in __init__
File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\SQLAlchemy-0.7.8-py2.7.egg\sqlalchemy\engine\base.py", line 2558, in raw_connection
File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\SQLAlchemy-0.7.8-py2.7.egg\sqlalchemy\pool.py", line 183, in unique_connection
File "<string>", line 9, in <module>
File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\SQLAlchemy-0.7.8-py2.7.egg\sqlalchemy\engine\base.py", line 2472, in connect
TypeError: 'NoneType' object is unsubscriptable
Code being used:
def conn():
return adodbapi.connect('Provider=SQLOLEDB; Data Source=SERVER\SQLEXPRESS;
Initial Catalog=db; User ID=user; Password=pass;')
engine = create_engine('mssql+adodbapi:///', creator=conn,
echo = True, module=adodbapi)
adodbapi seems to work fine on it's own, ie. i can create a connection and then use a cursor to query without any problems, it seems to be something in sqlalchemy.
Anyone any ideas?
And we have a workaround:
import adodbapi
from sqlalchemy.engine import create_engine
from sqlalchemy.orm import sessionmaker
import sqlalchemy.pool as pool
def connect():
return adodbapi.connect('Provider=SQLOLEDB.1;Data Source=mypcname\SQLEXPRESS;\
Initial Catalog=dbname;User ID=user; Password=pass;')
mypool = pool.QueuePool(connect)
conn = mypool.connect()
curs = conn.cursor()
curs.execute('select 1') #anything that forces open the connection
engine = create_engine('mssql+adodbapi://', module=adodbapi, pool = mypool)
Session = sessionmaker()
Session.configure(bind=engine)
sess = Session()
With this my session object works as normal.
I'm probably not using the adodbapi dialect as intended by whoever made it, but I can't find any documentation, so this is what I've gone with for now.
Pretty sure adodbapi doesn't work with SQLAlchemy.
The adodbapi dialect is not implemented for 0.6 at this time.
Scroll to the very bottom, (this is 0.7x documentation), I also checked 0.8 documentation and it says the same thing.
Sounds like you'll have to change which driver you're using.
I use sqlalcmy to connect to a postgresql database using the psycopg2. I am not sure, but by reading the documentation, i think you need to download the pyodbc, it seems to be better supported than adodbapi. Once you have installed it, try the following statement to create the engine
engine = create_engine(mssql+pyodbc://user:pass#host/db)
Or you can check out different ways of writing the connection string here.