Import pre-trained Deep Learning Models into Foundry Codeworkbooks - palantir-foundry

How do you import a h5 model locally from Foundry into code workbook?
I want to use the hugging face library as shown below, and in its documentation the from_pretrained method expects a URL path to the where the pretrained model lives.
I would ideally like to download the model onto my local machine, upload it onto Foundry, and have Foundry read in said model.
For reference I’m trying to do this on code workbook or code authoring. It looks like you can work directly with files from there, but I’ve read the documentation and the given example was for a CSV file whereas this model contains a variety of files like h5 and json format. Wondering how I can access these files and have them passsed into the from_pretrained method from the transformers package
Relevant links:
https://huggingface.co/transformers/quicktour.html
Pre-trained Model:
https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/tree/main
Thank you!

I've gone ahead and added the transformers (hugging face) package onto the platform.
As for the uploading the package you can follow these steps:
Use your dataset with the model-related files as an input to your code workbook transform
Use pythons raw file access to access the contents of the dataset: https://docs.python.org/3/library/filesys.html
Use pythons built-in tempfile to build a folder and add the files from step 2, https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp , https://www.kite.com/python/answers/how-to-write-a-file-to-a-specific-directory-in-python
Pass in the tempfile (tempfile.mkdtemp() return the absolute path) to the from_pretrained method
import tempfile
def sample (dataset_with_model_folder_uploaded):
full_folder_path = tempfile.mkdtemp()
all_file_names = ['config.json', 'tf_model.h5', 'ETC.ot', ...]
for file_name in all_file_names:
with dataset_with_model_folder_uploaded.filesystem().open(file_name) as f:
pathOfFile = os.path.join(fullFolderPath, file_name)
newFile = open(pathOfFile, "w")
newFile.write(f.read())
newFile.close()
model = TF.DistilSequenceClassification.from_pretrained(full_folder_path)
tokenizer = TF.Tokenizer.from_pretrained(full_folder_path)
Thanks,

Related

finBert Model - Config JSON File - Outputs Nothing

This is for running the ProsusAI finBert Model.
(https://github.com/ProsusAI/finBERT - GitHub)
(https://huggingface.co/ProsusAI/finbert - HuggingFace)
I downloaded the pytorch_model.bin file and used the config.json file that is shown on its GitHub repository.
There is no download button for the config.json file, so inside Python, i created a JSON File from Python and saved it as "config.json". I than placed both the pytorch_model.bin and the config.json files into a folder called "FinBertProsus". The actual Code for program i saved it as file name "FinBert Model", and this is outside of the folder, just placed where all my python programs are.
When i run the "FinBert Model" Program, it outputs nothing, i get nothing on my screen in the shell, it outputs blank. Why is it ouputting nothing, and how do you correct this ? ( i have tried passing a name/path key-value to the model in the config.json file, and also passing the architecture key-value pair in the config.json file, but i get the same result, nothing outputs from the program.)
Also would it be necessary that i have to download and install GitLFS for this type of Model ?
The code for the config.json file that is in the GitHub Repository, is different from the one that is in HuggingFace Repository. When running the huggingface repository, it is giving error for config file. I decided to run from the actual github repositry as that is more clean code for the config.json file and following the excercise, it is recommended to do this from the actual GitHub account. Below is the Code for the FinBert Model Program, and the JSON Config File.
FinBert Model Program below,
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # bert-base-uncased
model = BertForSequenceClassification.from_pretrained('FinBertProsus/pytorch_model.bin', config = 'FinBertProsus/config.json', num_labels=3)
inputs = tokenizer('We had a great year', return_tensors='pt')
outputs = model(**inputs)
config.json file is below,
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}

How do I read a file from a FileSystem with pyarrow.csv.read_csv?

I want to read a single CSV file in a google bucket with pyarrow. How do I do this?
I can create a FileSystem object with gcsfs, but I don't see a way to provide this to pyarrow.csv.read_csv.
Do I need to create some sort of file stream from the file system? What's the best way to do this?
import gcsfs
import pyarrow.csv as csv
fs = gcsfs.GCSFileSystem(project='foo')
csv.read_csv("bucket/foo/bar.csv", filesystem=fs)
TypeError: read_csv() got an unexpected keyword argument 'filesystem'
Using pyarrow version 6.0.1
I'm guessing you are working with this doc. You're correct that the approach listed there does not work with read_csv because there is no filesystem parameter. We can still generally do this but the process is a bit different.
Pyarrow has its own filesystem abstraction. If you have a pyarrow filesystem then you can first open a file and then use that file to read the CSV:
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.fs as fs
local_fs = fs.LocalFileSystem()
with local_fs.open_input_file('foo/bar.csv') as csv_file:
csv.read_csv(csv_file)
Unfortunately, a gcsfs.GCSFileSystem is not a "pyarrow filesystem" but you have a few options.
The method gcsfs.GCSFileSystem.open can give you a "python file object" which you can use as input to pyarrow.csv.read_csv.
import gcsfs
import pyarrow.csv as csv
fs = gcsfs.GCSFileSystem(project='foo')
with fs.open("bucket/foo/bar.csv", 'rb') as csv_file:
csv.read_csv(csv_file)

Export from Sketch App to JSON

I want to be able to export Layer Names and properties from Sketch to JSON format. I think I can figure out how to pull the info I need from Sketch, but I haven't started to code anything, because I haven't been able to find any info about this export issue.
I'm wondering if anyone can help confirm that Sketch can only export their supported formats or if export to JSON is possible. I don't want to dive into this project only to find out that I can't end up with a JSON file.
I have been trying to work with this as well, and it turns out there are a few ways to get access to a JSON file in sketch.
use the npm package sketch2json
Turns out that if you unzip the .sketch file, there is a JSON file hiding inside.
unzip sketch-header.sketch
This creates a folder called 'pages' with the .json file inside. To get the 'Layer Names', you can just read/serialize the .json file into a string, and then the path to collect the layer names is
const obj = JSON.parse(fileString);
object.layers.forEach((layer) => {
console.log(layer.name);
});
If you rename the .sketch extension file to .zip extension file you will see as many JSON files as pages your sketch document has inside a folder called "Pages". Also some BMP previews images and other JSON related to user and document information.

Pandas module read_csv reads file within Eclipse+pydev while fail if I run standalone

I'm currently developing a GUI using Python and Tkinter.
On of the task is to open and read some *.csv files.
I order to perform this task I have written the following code:
ReadData=pd.read_csv(ResultFile,skipinitialspace=True).values
While I'm running the code within the IDE Eclipse+Pydev everything work fine. But as soon as I run my code form a Dos window, i.e. python MainGrap.py, the code bugs stating that the file doesn't exists???????
I first load the path to a file via self.Inp_Filename=askopenfilename() then I create a list of the folders by means of the following function:
def PathDisintegrator(Inp_File):
Folders = os.path.split(Inp_File)
LastFolder = Folders[1]
RootPath = Folders[0]
Dirs=[]
while not(LastFolder==''):
Dirs.insert(0,LastFolder)
Folders = os.path.split(RootPath)
LastFolder = Folders[1]
RootPath = Folders[0]
Dirs.insert(0,RootPath[:-1])
Dirs=Dirs[:-1]
return(Dirs)
Then I can recreate the full path to file via the following function:
def PathAndFile(Folders,File):
FileOut=''
for item in Folders:
FileOut=FileOut+os.sep+item
#FileOut=FileOut+r"\\"+item
FileOut=FileOut[1:]+os.sep+File
return(FileOut)
I have printed out the file path even within the parser of Pandas and it looks fine to me: D:\Abaqus_Runs\DOWLEX_PET_LAMINATE_PROTO_REFERENCE_SI_Version_2_Revision_2_MDangle0_Rate0_01_MOVING_NODE_out.csv
The problem here is that your python environment in eclipse can see the folder where your csv resides but the terminal one does not.
You can observe what the system paths are by doing:
In [331]:
import sys
sys.path
Out[331]:
['',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\python34.zip',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\DLLs',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\win32',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\win32\\lib',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\Pythonwin',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\IPython\\extensions']
So you need to provide a complete path or append the path to where the csv resides to your sys path. Note that backslashes must be escaped e.g. 'c:\\data\\my.csv' but if you use forward slashes then it works fine: e.g. 'c:/data/my.csv'

Read a CSV file from a stream using Roo in Rails 4

I have another question on this here Open a CSV file from S3 using Roo on Heroku but I'm not getting any bites - so a reword:
I have a CSV file in an S3 bucket
I want to read it using Roo in a Heroku based app (i.e. no local file access)
How do I open the CSV file from a stream?
Or is there a better tool for doing this?
I am using Rails 4, Ruby 2. Note I can successfuly open the CSV for reading if I post it from a form. How can I adapt this to snap the file from an S3 bucket?
Short answer - don't use Roo.
I ended up using the standard CSV commands, working with small CSV files you can very simply read the file contents into memory using something like this:
body = file.read
CSV.parse(body, col_sep: ",", headers: true) do |row|
row_hash = row.to_hash
field = row_hash["FieldName"]
reading a file passed in from a form, just reference the params:
file = params[:file]
body = file.read
To read in form S3 you can use the AWS gem:
s3 = AWS::S3.new(access_key_id: ENV['AWS_ACCESS_KEY_ID'], secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'])
bucket = s3.buckets['BUCKET_NAME']
# check each object in the bucket
bucket.objects.each do |obj|
import_file = obj.key
body = obj.read
# call the same style import code as above...
end
I put some code together based on this:
Make Remote Files Local With Ruby Tempfile
and Roo seems to work OK when handed a temp file. I couldn't get it to work with S3 directly. I don't particularly like the copy approach, but my processing runs on delayed job, and I want to keep the Roo features a little more than I dislike the file copy. Plain CSV files work without fishing out the encoding info, but XLS files would not.