As my flask app should not write anything in my database, I set up Flask-SQLAlchemy to reflect my database. This way I do not have to change my models, when I change my schema:
# app/__init__.py
from flask import Flask
from flask_sqlalchemy import SQLAlchemy
db = SQLAlchemy()
def create_app():
app = Flask(__name__)
db.init_app(app)
with app.app_context():
db.Model.metadata.reflect(db.engine)
# app/models.py
from app import db
class Data(db.Model):
__table__ = db.Model.metadata.tables['data']
This all works fine and dandy. But now, I wanted to implement tests using unittest. But I could not find anything how that is supposed to work? I am used to make a new sqlite database to test, but I don't have any Models to write there. What is the standard procedure here? Do you copy everything to sqlite? If so, how?
There's no general rule for this situation: you database is decoupled from your application so you need to somehow get a copy of the database's schema to recreate locally.
Many database engines provide a way to dump a database schema to a file which in turn can be used to load a schema onto another server (or onto the same server with a different name).
If you want to stick to using Python and SQLAlchemy tools you could populate the database metadata via reflection on your production database, then use the metadata to create the tables on your local database.
Something like this: on the production server:
import pickle
import sqlalchemy as sa
engine = sa.create_engine(PRODUCTION_DATABASE_URI)
metadata = sa.MetaData()
metadata.reflect(engine)
# Save the metadata so that it can be transferred to another machine.
with open('metadata.pkl', 'wb') as f:
pickle.dump(metadata, f)
Then locally
# Restore the metadata object
with open('metadata.pkl', 'rb') as f:
metadata = pickle.load(f)
engine = sa.create_engine(TEST_DATABASE_URI)
# Create the tables
metadata.create_all(engine)
Related
I was unable to find this problem in the numerous Stack Overflow similar questions "how to read csv into a pyspark dataframe?" (see list of similar sounding but different questions at end).
The CSV file in question resides in the tmp directory of the driver of the cluster, note that this csv file is intentionally NOT in the Databricks DBFS cloud storage. Using DBFS will not work for the use case that led to this question.
Note I am trying to get this working on Databricks runtime 10.3 with Spark 3.2.1 and Scala 2.12.
y_header = ['fruit','color','size','note']
y = [('apple','red','medium','juicy')]
y.append(('grape','purple','small','fresh'))
import csv
with (open('/tmp/test.csv','w')) as f:
w = csv.writer(f)
w.writerow(y_header)
w.writerows(y)
Then use python os to verify the file was created:
import os
list(filter(lambda f: f == 'test.csv',os.listdir('/tmp/')))
Now verify that the databricks Spark API can see the file, have to use file:///
dbutils.fs.ls('file:///tmp/test.csv')
Now, optional step, specify a dataframe schema for Spark to apply to the csv file:
from pyspark.sql.types import *
csv_schema = StructType([StructField('fruit', StringType()), StructField('color', StringType()), StructField('size', StringType()), StructField('note', StringType())])
Now define the PySpark dataframe:
x = spark.read.csv('file:///tmp/test.csv',header=True,schema=csv_schema)
Above line runs no errors, but remember, due to lazy execution, the spark engine still has not read the file. So next we will give Spark a command that forces it to execute the dataframe:
display(x)
And the error is:
FileReadException: Error while reading file file:/tmp/test.csv. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
Caused by: FileNotFoundException: File file:/tmp/test.csv does not exist. . .
and digging into the error I found this: java.io.FileNotFoundException: File file:/tmp/test.csv does not exist. And I already tried restarting the cluster, restart did not clear the error.
But I can prove the file does exist, only for some reason Spark and Java are unable to access it, because I can read in the same file with pandas no problem:
import pandas as p
p.read_csv('/tmp/test.csv')
So how do I get spark to read this csv file?
appendix - list of similar spark read csv questions I searched through that did not answer my question: 1 2 3 4 5 6 7 8
I guess databricks file loader doesn't seem to recognize the absolute path /tmp/.
you can try the following work around.
Read the file using path using Pandas Dataframe
Pass the pandas dataframe to Spark using CreateDataFrame function
Code :
df_pd = pd.read_csv('File:///tmp/test.csv')
sparkDF=spark.createDataFrame(df_pd)
sparkDF.display()
Output :
I made email contact with a Databricks architect, who confirmed that Databricks can only read locally (from the cluster) in a single node setup.
So DBFS is the only option for random writing/reading of text data files in a typical cluster which contains >1 node.
How do you import a h5 model locally from Foundry into code workbook?
I want to use the hugging face library as shown below, and in its documentation the from_pretrained method expects a URL path to the where the pretrained model lives.
I would ideally like to download the model onto my local machine, upload it onto Foundry, and have Foundry read in said model.
For reference I’m trying to do this on code workbook or code authoring. It looks like you can work directly with files from there, but I’ve read the documentation and the given example was for a CSV file whereas this model contains a variety of files like h5 and json format. Wondering how I can access these files and have them passsed into the from_pretrained method from the transformers package
Relevant links:
https://huggingface.co/transformers/quicktour.html
Pre-trained Model:
https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/tree/main
Thank you!
I've gone ahead and added the transformers (hugging face) package onto the platform.
As for the uploading the package you can follow these steps:
Use your dataset with the model-related files as an input to your code workbook transform
Use pythons raw file access to access the contents of the dataset: https://docs.python.org/3/library/filesys.html
Use pythons built-in tempfile to build a folder and add the files from step 2, https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp , https://www.kite.com/python/answers/how-to-write-a-file-to-a-specific-directory-in-python
Pass in the tempfile (tempfile.mkdtemp() return the absolute path) to the from_pretrained method
import tempfile
def sample (dataset_with_model_folder_uploaded):
full_folder_path = tempfile.mkdtemp()
all_file_names = ['config.json', 'tf_model.h5', 'ETC.ot', ...]
for file_name in all_file_names:
with dataset_with_model_folder_uploaded.filesystem().open(file_name) as f:
pathOfFile = os.path.join(fullFolderPath, file_name)
newFile = open(pathOfFile, "w")
newFile.write(f.read())
newFile.close()
model = TF.DistilSequenceClassification.from_pretrained(full_folder_path)
tokenizer = TF.Tokenizer.from_pretrained(full_folder_path)
Thanks,
I have a Firebase realtime database and would like to import some json. However, to import data, Firebase seems to want to delete all existing data in the database. I don't want to do that, I just want to add data by importing it. Is this possible?
;
It's possible if you write code to read the JSON file and perform the necessary updates against the existing data. There is no automatic process for this.
My backend code using Python 2.7 able to convert from dataframe to json using df.to_json()but I need to export this json file into MySQL database since frontend code using angular 2 is javascript.
import pandas as pd
from sqlalchemy import create_engine
df.to_csv("abc.csv")
df.to_json("abc_json.json")
engine = create_engine('mysql+mysqldb://user:pw#sbc.mysql.pythonanywhere-services.com/abc$default')
df.to_sql(name='KLSE', con=engine, if_exists='replace')
Code above was run without problem but I want to MySQL database in json format so that frontend code can query.
I can not find related link in google or stackoverflow with similar issues.Thanks for help.
For my dev workflow purposes, I'd like to create a new orientdb database given a JSON schema, on the fly. I dont believe this is natively supported in orientdb, are there any existing solutions that do this - provide a JSON schema and point to a orientdb instance, and it auto-creates the database (edges, vertices, indexes and perhaps some sample data).
I ended up creating a .sh script to create the DB on the fly. The .sh files looks something like:
# (file: createmydb.sh)
# script to create my database declaratively
set echo true
# use this to ignore errors and continue, if needed
# set ignoreErrors true
# create database
create database plocal:../databases/MyDB root root plocal graph
# create User vertex
create class User extends V
create property User.Email STRING
create property User.Firstname STRING
...
And then call it like:
/usr/local/src/orientdb/bin/console.sh createmydb.sh
This works well for my purposes. The DB creation script is very easy to read, can be modified easily. And I am sure very backwards compatible (which may not have been the case with importing an exported JSON version of the db schema).
So far I've found that pre-loading the schema using an external definition stored in either JSON or OSQL has been most successful for me. Currently I am using an OSQL script that contains a whole bunch of CREATE CLASS ... and CREATE PROPERTY ... commands. It works well enough.
Pretty soon I'll have to start supporting dynamic changes to the data model, at which point I will have to write code to read a JSON schema definition and convert that to appropriate calls into OrientDB, either through the Blueprints API or through SQL batches.
I've not found a tool that does what you need "automatically." If you find one, please let me (and everyone else here) know.