Pyspark : Unable to import csv file in Zeppilin instance - csv

I'm unable to run following line of code.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_t = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3a://Bucket_name/Train - Copy.csv')
it throws below error:
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
I tried restarting the interpreter but no help.
Can someone please help with this issue?
Thanks,
Naseer

It seems, hive metastore is not running, you can try starting the service
hive --service metastore
you can use following code, to read csv which doesn't use SQLContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Reading CSV") \
.getOrCreate()
df_t = spark.read.csv('s3a://Bucket_name/Train - Copy.csv',header=True, inferSchema=True)
df_t.show()

Related

GLUE - o93.getDynamicFrame. com.mysql.cj.jdbc.Driver ERROR

I am trying to connect to Mysql. I have uploaded the corresponding java jar we are using , which is mysql-connector-java-5.1.49.jar , I uploaded to s3 bucket. I am using the following code to access to Mysql and I am failing with the Error
An error occurred while calling o93.getDynamicFrame.
com.mysql.cj.jdbc.Driver
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import boto3
import json
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("jars", "s3://xxxxx/jdbc-drivers/mysql-connector-java-5.1.49.jar")
client = boto3.client("secretsmanager" , region_name = "eu-west-1")
get_secret_value_response = client.get_secret_value(SecretId = "etl-1")
secret = get_secret_value_response["SecretString"]
secret=json.loads(secret)
username = secret.get("mysql_username")
password = secret.get("mysql_password")
url = secret.get("mysql_url")
table = secret.get("mysql_table")
connection_mysql_options_source_session = {
"url": url,
"dbtable": table,
"user": username,
"password": password,
"customJdbcDriverS3Path": "s3://xxxxx/jdbc-drivers/mysql-connector-java-5.1.49.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
# Read from JDBC databases with custom driver
df_session = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql_options_source_session)
df_session.printSchema()
In job details section, I have referred to jar libs:
And I didn't define any connection in the connection section of job properties. I can't figure out why I am getting this error.
The strange thing is that I can connect with Crawler, data catalogue and also direct connection to the same server, but via Notebook & script I can't.

generate html and json reports for selenium python automation test using pytest

I have a selenium python automation test, it works fine, now I want to generate Html and JSON reports and have screenshots in the report using pytest. I am new to automation and python so I am not much aware of how its done.
following is my code
test_screenshot.py
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pytest_html
from selenium.common.exceptions import InvalidSessionIdException
def test_Openurl(setup):
driver = setup["driver"]
url = setup["url"]
try:
driver.get(url)
except Exception as e:
print(e.message)
assert driver.current_url == URL
driver.save_screenshot("ss.png")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.save_screenshot("ss1.png")
driver.close()
conftest.py
import pytest
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
def pytest_addoption(parser):
parser.addoption("--url", action="store", default="https://google.com/")
#pytest.fixture()
def setup(pytestconfig):
s = Service("C:/Users/Yash/Downloads/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(service=s)
driver.maximize_window()
yield {"driver":driver, "url": pytestconfig.getoption("url")}
I ran this using
pytest test_screenshot.py --url "https://www.netflix.com/in/"
Test case is passed. How do I generate HTML and JSON report?
I tried this
pytest -v -s --json-report --json-report-indent=4 --json-report-file=report/report.json --html=report/report.html test_screenshot.py
but got this error
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --json-report --json-report-indent=4 --json-report-file=report/report.json
inifile: None
You need to install these two libraries : https://pypi.org/project/pytest-json-report/ & https://pypi.org/project/pytest-html/

How to run Django and Spark application

I am working on a Spark Application and I want to create a rest API in Django, below is my code
from django.shortcuts import render
from django.http import Http404
from rest_framework.views import APIView
from rest_framework.decorators import api_view
from rest_framework.response import Response
from rest_framework import status
from django.http import JsonResponse
from django.core import serializers
from django.conf import settings
import json
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkContext()
sql = SQLContext(sc)
df = Sql.read.format("jdbc").options(
url = "jdbc:mysql://127.0.0.1:3306/demo",
driver = "com.mysql.cj.jdbc.Driver",
dbtable = "tablename",
user = "xyz",
password = "abc"
).load()
totalrecords = df.count()
# Create your views here.
#api_view(["GET"])
def Demo(self):
try:
a = str(totalrecords)
return JsonResponse(a,safe=False)
except ValueError as e:
return Response(e.args[0],status.HTTP_400_BAD_REQUEST)
I want to know how will I run this code, as I have directly tried "python manage.py runserver" which is not working, so how to run this spark and django with django api and spark-submit with all required spark jar file?
To run this code you have to use spark submit only,
spark-submit --jars mysql.jar manage.py runserver 0.0.0.0:8000
or
spark-submit manage.py runserver

com.mysql.jdbc.Driver not found in spark2 scala

I am using Jupyter Notebook with Scala kernel, below is my code to import mysql table to a dataframe:
val sql="""select * from customer"""
val df_customer = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/ccfd")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", s"( $sql ) t")
.option("user", "root")
.option("password", "xxxxxxx")
.load()
Below is the error:
Name: java.lang.ClassNotFoundException
Message: com.mysql.jdbc.Driver
StackTrace: at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:79)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
Can anyone share a working code snippet here? I am using Spark2, session named spark is ready when I start the kernel in a new notebook.
Thank you in advance.

Using JBDC to read sql file in spark scala collecting Warehouse error

I am trying to read MySQL file using Spark Scala. Following is the code I tried
val dataframe_mysql = sqlContext.read.format("jdbc")
.option("url","jdbc:mysql://xx.xx.xx.xx:xx")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "schema.xxxx")
.option("user", "xxxx").option("password", "xxxxx").load()
but I am collecting Warehouse path error as following:
Warehouse path is 'file:/C:/Users/Owner/eclipse-workspace/stProject/spark-ware‌​house/'. Exception in thread "main" java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.res‌​olveTable(JDBCRDD.sc‌​ala:72) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation‌​.(JDBCRelation‌​.scala:113) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelation‌​Provider.createRelat‌​ion(JdbcRelationProv‌​ider.scala:45) at