Pyspark 2.4 Issue faced while passing properties file in spark submit - mysql

I have a pyspark program which connects to MySQL db successfully and reads a table. Now, I am trying to pass the database credentials from a properties file, instead of embedding them in the code, but not able to make it work.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
#spark-submit –packages mysql:mysql-connector-java:8.0.13 workWithMySQL.py
spark = SparkSession.builder.appName(“MySQL connection”).getOrCreate()
#create spart context from spark session
sc = spark.sparkContext
#read from mysql
#configuration details
hostname=”localhost”
jdbcport=3306
dbname=”TEST”
username=”kanchan#localhost”
password=”password”
mysql_url = “jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}”.format(hostname,jdbcport,dbname,username,password)
mysql_driver = “com.mysql.jdbc.Driver”
query = “(select * from cats) t1_alias”
df4 = spark.read.format(“jdbc”).options(driver=mysql_driver,url=mysql_url, dbtable=query).load()
df4.show()
Now, I have created a properties file jdbc.properties at $SPARK_HOME/conf
spark.mysql.user kanchan#localhost
spark.mysql.password password
And add it in spark-submit call
spark-submit –packages mysql:mysql-connector-java:8.0.13 --files $SPARK_HOME/conf/jdbc.properties workWithMySQL.py
replaced the assignments:
username=sc.getConf.getOption("spark.mysql.user")
password=sc.getConf.getOption("spark.mysql.user")
when run. it throws an error saying the function has no attribute as get option. I could not locate the appropriate documentation for it. Can anyone help?
Further, is it possible to encrypt the credentials or ensure data security by any other means?

The method getOption should be replace with the method get.
username=sc.getConf().get("spark.mysql.user")

Related

Palantir Foundry How to allow dynamic number of input in compute (Code repository)

I have a folder where I will upload one file every month. The file will have the same format in every month.
First problem
The idea is to concatenate all the files in this folder into one file. Currently I am hardcoding the filenames (filename[0], filename[1], filename[2]..) but imagine later I will have 50 files, should I explicitly add them to the transform_df decorator? Is there any other method to handle this?
Second problem:
Currently I have let's say 4 files (2021_07, 2021_08, 2021_09, 2021_10) and I want whenever I add the file presenting 2021_12 data to avoid changing the code.
If I add input_5 = Input(path_to_2021_12_do_not_exists) the code will not be run and give an error.
How can I implement the code for future files and let the code ignore the input if it does not exist without manually each month add a new value to my code?
Thank you
# from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
from functools import reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
#transform_df(
Output("/Company/Project_name/Data/clean/File_concat"),
input_1=Input(filenames[0]),
input_2=Input(filenames[1]),
input_3=Input(filenames[2]),
input_4=Input(filenames[3]),
)
def compute(input_1, input_2, input_3, input_4):
input_dfs = [input_1, input_2, input_3, input_4]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
dfs = reduce(DataFrame.unionByName, dfs)
return dfs
This question comes up a lot, the simple answer is that you don't. Defining datasets and executing a build on them are two different steps executed at different stages.
Whenever you commit your code and run the checks, your overall python code is executed during the renderSchrinkwrap stage, except for the compute part. This allows Foundry to discover what datasets exist and publish.
Publishing involves creating your dataset and putting whatever is inside your compute function is published into the jobspec of the dataset, so foundry knows what code to execute whenever you run a build.
Once you hit build on the dataset, Foundry will only pick up whatever is on the jobspec and execute it. Any other code has already run during your checks, and it has run just once.
So any dynamic input/output would require you to re-run checks on your repo, which means that some code change would have had to happen since the Checks is part of the CI process, not part of the build.
Taking a step back, assuming each of your input files has the same schema, Foundry would expect you to have all of those files in the same dataset as append transactions.
This might not be possible though, if for instance, the only indication of the "year" of the data is embedded in the filename, but your sample code would indicate that you expect all these datasets to have the same schema and easily union together.
You can do this manually through the Dataset Preview - just use the Upload File button or drag-and-drop the new file into the Preview window - or, if it's an "end user" workflow, with a File Upload Widget in a Workshop app. You may need to coordinate with your Foundry support team if this widget isn't available.
Bit late to the post although for anyone who is interested in an answer to most of the question. Dynamically determining file names from within a folder is not doable although having some level of dynamic input is possible as follows:
# from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
# from functools import reduce
from transforms.verbs.dataframes import union_many # use this instead of reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
inputs = {('input{}'.format(index)): Input(filename) for (index, filename) in enumerate(filenames))}
#transform(
output=Output("/Company/Project_name/Data/clean/File_concat"),
**inputs
)
def compute(output, **kwargs):
# Extract dataframes from input datasets
input_dfs = [dataset_df.dataframe() for dataset_name, dataset_df in kwargs.items()]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
# dfs = reduce(DataFrame.unionByName, dfs)
unioned_dfs = union_many(*dfs)
return unioned_dfs
Couple points:
Created dynamic input dict.
That dict is read into the transform using **kwargs.
Using transform decorator not transform_df, we can extract the dataframes.
(not in question) Combine multiple dataframes using union_many function from transforms_verbs library.

In Django rest framework: GDAL_ERROR 1: b'PROJ: proj_as_wkt: Cannot find proj.db '

I am using the versions for my project:
Django==2.2.7
djangorestframework==3.10.3
mysqlclient==1.4.5
In my database I work with geometric types, and for this I have configured the library: GDAL-3.0.2-cp38-cp38-win32
To run this library, I have to include these variables in the django properties file:
GEOS_LIBRARY_PATH
GDAL_LIBRARY_PATH
Now on my models, I do the following import:
from django.contrib.gis.db import models
For types like:
coordinates = models.GeometryField (db_column = 'Coordinates', blank = True, null = True)
It seems that the queries work correctly, but when creating a new element, I get the following error:
GDAL_ERROR 1: b'PROJ: proj_as_wkt: Cannot find proj.db '
But after this error, the object is persisted correctly in the database.
I would like to know how to solve this error.
I have not found information on the network, I have only tried to declare a new variable in the DJANGO properties file:
PROJ_LIB = 'APP / backend / env / Lib / site-packages / osgeo / data / proj / proj.db'
But the error still appears, and you may have problems in the production environment in an OpenSuse image
Why can't you find proj.db?
How do I solve it?

How to use SQLAlchemy Utils in a SQLAlchemy model

I'm trying to create a user model that uses UUID as primary key:
from src.db import db # SQLAlchemy instance
import sqlalchemy_utils
import uuid
class User(db.Model):
__tablename__ = 'user'
id = db.Column(sqlalchemy_utils.UUIDType(binary=True), primary_key=True, nullable=False)
But when I generate the migrations I receive:
File "/home/pc/Downloads/project/auth/venv/lib/python3.6/site-packages/alembic/runtime/environment.py", line 836, in run_migrations
self.get_context().run_migrations(**kw)
File "/home/pc/Downloads/project/auth/venv/lib/python3.6/site-packages/alembic/runtime/migration.py", line 330, in run_migrations
step.migration_fn(**kw)
File "/home/pc/Downloads/project/auth/migrations/versions/efae4166f832_.py", line 22, in upgrade
sa.Column('id', sqlalchemy_utils.types.uuid.UUIDType(length=16), nullable=False),
NameError: name 'sqlalchemy_utils' is not defined`
I had try to explicity inform the module I'm using like this and use a 'internal' implementation that SQLAlchemy
Obs: If I manualy import the sqlalchemy_utils in the /migrations/version/efae4166f832_.py and remove the length that is generated automaticaly sa.Column('id', sqlalchemy_utils.types.uuid.UUIDType(length=16), nullable=False) it works fine
I generate the migrations using a generate.py script:
from src import create_app
from src.db import db
from flask_migrate import Migrate
# Models
from src.user.models.user import User
app = create_app()
migrate = Migrate(app, db)`
Obs: MySQL engine
I expect that when I generate migration it generate a user model that uses UUID implemented from SQLAlchemy Utils as primary key
You have just to add:
import sqlalchemy_utils
to your script.py.mako inside migrations folder
Thanks, Marco, but I have already fixed it. I have put the import import sqlalchemy_utils inside env.py and script.py.mako, I have also put the following function:
def render_item(type_, obj, autogen_context):
"""Apply custom rendering for selected items"""
if type_ == "type" and isinstance(obj, sqlalchemy_utils.types.uuid.UUIDType):
# Add import for this type
autogen_context.imports.add("import sqlalchemy_utils")
autogen_context.imports.add("import uuid")
return "sqlalchemy_utils.types.uuid.UUIDType(), default=uuid.uuid4"
# Default rendering for other objects
return False
Inside the env.py, and at the same file I have set render_item=render_item in the function run_migrations_online:
context.configure(
...,
render_item=render_item,
...
)
I researched to do this automatically, but I couldn't find nothing that could help me.
The order of the operations matter:
export FLASK_APP=manage.py
flask db init
Do the tutorial above
flask db migrate
flask db upgrade
Background
It would be ideal if you didn't have to go and manually edit each migration file with an import sqlalchemy_utils statement.
Looking at the Alembic documentation, script.py.mako is "a Mako template file which is used to generate new migration scripts." Therefore, you'll need to re-generate your migration files, with Mako already importing sqlalchemy_utils as part of the migration file generation.
Fix
If possible, remove your old migrations (they're probably broken anyway), and add the import sqlalchemy_utils like so to your script.py.mako file:
from alembic import op
import sqlalchemy as sa
import sqlalchemy_utils #<-- line you add
${imports if imports else ""}
then just rerun your alembic migrations:
alembic revision --autogenerate -m "create initial tables"
when you go to look at your migration file you should see sqlalchemy_utils already imported via the mako script.
hope that helps.
Adding import sqlalchemy_utils in the script.py.mako file will automatically import this line on all the migration files that are generated and resolve the issue.
from alembic import op
import sqlalchemy as sa
import sqlalchemy_utils
${imports if imports else ""}
Add the import sqlalchemy_utils line to the newly-created migrations/versions/{hash}_my_comment.py file. However, this will only fix the problem for that specific step of the migration. If you expect that you'll be making lots of changes to columns which reference sqlalchemy_utils, you should probably do something more robust like Walter's suggestion. Even then, though, it looks like you may need to add code to properly deal with each column type you end up using.
NB: Despite seeing the suggestion in multiple places of just adding the import line to the script.py.mako file, that did not work for me.

How to import/export RavenDB data from file?

I have an application that uses embedded RavenDB. I would like to be able to import/export a specific sets of documents (a document with all nested/referenced documents) to a file.
My ideal function would work like:
var session = store.OpenSession();
MyDocument d1 = session.Load<MyDocument>(someId);
ImportExport.Export(store, d1, "file.xyz");
and then with a different IDocumentStore:
ImportExport.Import(store, "file.xyz");
var session = store.OpenSession();
MyDocument d2 = session.Load<MyDocument>(someId);
And of course d1 equals d2 in any way.
AFAIK Smuggler utility exports all documents at once.
My only other idea was using Json.NET to serialize MyDocument object, save it to file, and then deserialize it (and store it). I have a feeling this is a way to go, but will it work with when MyDocument has many other documents inside?
I ended up using the "Raven.Smuggler.exe" program to get things done using "large" ravendump files. It's unclear to me, however, whether this import process [drops and replaces from scratch] or [merges] the data -- I would perform a database drop and re-create before doing the import process below, to guarantee data integrity.
Download a copy of a matching RavenDB build.
Extract it someplace simple (ex: C:\RavenDB-Build-2956)
Invoke the Smuggler.
Command (setup/replace variable placeholders as needed)
C:\RavenDB-Build-2956\Smuggler\Raven.Smuggler.exe in $instance $dump "-u=$user" "-p=$plainpassword" -d=$dbname
Example variables
$instance = http://localhost/ravendb_iis_web_app/ or maybe http://localhost:8080/
$dump = C:\dump.ravendump
$user = User
$plainpassword = Password
$dbname = MyDatabase
You have the Smuggler API to handle this. See:
Export:
https://github.com/ravendb/ravendb/blob/d54bfba11995e915cf94f35ef3887fcb7d747033/Raven.Database/Smuggler/SmugglerDatabaseApi.cs#L163
Import:
https://github.com/ravendb/ravendb/blob/d54bfba11995e915cf94f35ef3887fcb7d747033/Raven.Database/Smuggler/SmugglerDatabaseApi.cs#L90

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.