HuggingFace Pipeline exceeds 512 tokens of BERT

HuggingFace Pipeline exceeds 512 tokens of BERT - deep-learning

I tested this following HuggingFace Pipeline:
https://huggingface.co/deepset/gelectra-base-germanquad
While testing it, I noticed that the pipeline has no limit for the input size.
I passed inputs with over approx. 5.400 tokens and it always gave me good results (even for answers being at the end of the input).
I tried to do it similarly (not using the pipeline but instead importing the model) by just creating several subtexts of the maximum size of 512 tokens. However, using this approach, I often get wrong results...
What approach is the HuggingFace Pipeline following here in order to be able to exceed the 512 tokens limit?
For example, lets retrieve the content of a wikipedia page, which is way over 512 tokens. Here doc is the retrieved content...
url="https://de.wikipedia.org/wiki/Gesch%C3%A4ftsbericht"
r = requests.get(url)
raw_html = r.text
doc = cleanhtml(raw_html)
question = "Wie teuer ist ein Geschäftsbericht?"
How the Pipeline handels the prediction:
from transformers import pipeline
german_qa_model = "deepset/gelectra-base-germanquad"
qa_model = pipeline("question-answering", model=german_qa_model)
qa_model(question=question,context=doc)
>>> {'answer':'über 100.000' ,'score': 0.7242466807365417}
Loading the model itself:
from transformers import ElectraForQuestionAnswering, Trainer
import torch
tokenizer = AutoTokenizer.from_pretrained(german_qa_model)
model = AutoModelForSequenceClassification.from_pretrained(german_qa_model)
model = Trainer(model=model)
encoding = tokenizer(question, doc, add_special_tokens=True, return_tensors="pt")
output = model(**encoding)
>>> Error: token indices sequence length is longer than the specified maximum sequence length for this model (10402 > 512)

Related

DeepSORT with custom detectors

I have created a class in which am loading my custom weights for yolo v5, v7, and v8 to get detection out of this class I am getting xyxy values and class name. now I want to deploy deep SORT on it by using this specific class with no extra files. in my detection class I am simply using torch.hub.load() and access to my basic functions in yolo v5, v7, and v8. now I am looking for specific and straightforward techniques to apply deep sort by using my detection class. is it possible? if it is possible please tell me how can I do it. if it is not possible, what is the simplest method to implement deep sort?
import torch
import cv2
class YoloDetector:
def __init__(self, conf_thold, device, weights_path, expected_objs):
# taking the image file path, confidence level and GPU or CPU selection
self._model = torch.hub.load("WongKinYiu/yolov7", "custom", f"{weights_path}", trust_repo=True)
self._model.conf = conf_thold # NMS confidence threshold
self._model.classes = expected_objs # (optional list) filter by class
self._model.to(device) # specifying device type
def process_image(self, image):
results = self._model(image)
predictions = [] # final list to return all detections
detection = results.pandas().xyxy[0]
for i in range(len(detection)):
# getting bbox and class name one by one
class_name = detection["name"]
xmin = detection["xmin"][i]
ymin = detection["ymin"][i]
xmax = detection["xmax"][i]
ymax = detection["ymax"][i]
# parallely appending the values in list using dictionary
predictions.append({'class': class_name[i], 'bbox': [int(xmin), int(ymin), int(xmax), int(ymax)]})
return predictions
that is the code I am using for getting detections now please tell me how can I implement deep sort by using this.

Count the number of people having a property bounded by two numbers

The following code goes over the 10 pages of JSON returned by GET request to the URL.
and checks how many records satisfy the condition that bloodPressureDiastole is between the specified limits. It does the job, but I was wondering if there was a better or cleaner way to achieve this in python
import urllib.request
import urllib.parse
import json
baseUrl = 'https://jsonmock.hackerrank.com/api/medical_records?page='
count = 0
for i in range(1, 11):
url = baseUrl+str(i)
f = urllib.request.urlopen(url)
response = f.read().decode('utf-8')
response = json.loads(response)
lowerlimit = 110
upperlimit = 120
for elem in response['data']:
bd = elem['vitals']['bloodPressureDiastole']
if bd >= lowerlimit and bd <= upperlimit:
count = count+1
print(count)

There is no access through fields to json content because you get dict object from json.loads (see translation scheme here). It realises access via __getitem__ method (dict[key]) instead of __getattr__ (object.field) as keys may be any hashible objects not only strings. Moreover, even strings cannot serve as fields if they starts with digits or are the same with built-in dictionary methods.
Despite this, you can define your own custom class realising desired behavior with acceptable key names. json.loads has an argument object_hook wherein you may put any callable object (function or class) that take a dict as the sole argument (not only the resulted one but every one in json recursively) & return something as the result. If your jsons match some template, you can define a class with predefined fields for the json content & even with methods in order to get a robust Python-object, a part of your domain logic.
For instance, let's realise the access through fields. I get json content from response.json method from requests but it has the same arguments as in json package. Comments in code contain remarks about how to make your code more pythonic.
from collections import Counter
from requests import get
class CustomJSON(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
LOWER_LIMIT = 110 # Constants should be in uppercase.
UPPER_LIMIT = 120
base_url = 'https://jsonmock.hackerrank.com/api/medical_records'
# It is better use special tools for handling URLs
# in order to evade possible exceptions in the future.
# By the way, your option could look clearer with f-strings
# that can put values from variables (not only) in-place:
# url = f'https://jsonmock.hackerrank.com/api/medical_records?page={i}'
counter = Counter(normal_pressure=0)
# It might be left as it was. This option is useful
# in case of additional counting any other information.
for page_number in range(1, 11):
records = get(
base_url, data={"page": page_number}
).json(object_hook=CustomJSON)
# Python has a pile of libraries for handling url requests & responses.
# urllib is a standard library rewritten from scratch for Python 3.
# However, there is a more featured (connection pooling, redirections, proxi,
# SSL verification &c.) & convenient third-party
# (this is the only disadvantage) library: urllib3.
# Based on it, requests provides an easier, more convenient & friendlier way
# to work with url-requests. So I highly recommend using it
# unless you are aiming for complex connections & url-processing.
for elem in records.data:
if LOWER_LIMIT <= elem.vitals.bloodPressureDiastole <= UPPER_LIMIT:
counter["normal_pressure"] += 1
print(counter)

GridSearchCV without Cross Validation CV = 1

I have a special dataset and this dataset could be trains with a %1 error. I need to do hyperparameter tuning for MLPRegressor without a split train set. Meanly cv = 1. Is this possible with GridSearchCV?

One of the options for cv parameter is:
An iterable yielding (train, test) splits as arrays of indices.
So, if you have X input matrix, y target vector, mlp classifier, and params grid you can do just one train-test split.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
indices = np.arange(len(X))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
clf = GridSearchCV(mlp, params, cv=[(train_idx, test_idx)])
But keep in mind that using 1 split for hyper-parameter sweep is a bad practice. Do not make many steps with such a grid search.

Trying to get CSV ready for keras model with tensorflow dataset

I do have a keras CNN model ready which expects [None,20,20,3] arrays as input. (20 is image size here...) On the other side I do have a CSV with 1200 (20*20*3) columns ready in my cloud storage.
I want to write an ETL pipeline with tensorflow to obtain a [20,20,3] shape tensor for each row in the csv.
My code so far:
I spent days of work already and feel confident, that this approoach might work out in the end.
import tensorflow as tf
BATCH_SIZE = 30
tf.enable_eager_execution()
X_csv_path = 'gs://my-bucket/dataX.csv'
X_dataset = tf.data.experimental.make_csv_dataset(X_csv_path, BATCH_SIZE, column_names=range(1200) , header=False)
X_dataset = X_dataset.map(lambda x: tf.stack(list(x.values())))
iterator = X_dataset.make_one_shot_iterator()
image = iterator.get_next()
I would expect to have a [30,1200] shape but I still get 1200 tensors of shape [30] instead. My idea is to read every line into a [1200] shaped tensor and then reshape the line to a [20,20,3] tensor to feed my model with. Thanks for your time!

tf.data.experimental.make_csv_dataset creates a OrderedDict of column arrays. For your task I'd use tf.data.TextLineDataset.
def parse(filename):
string = tf.strings.split([filename], sep=',').values
return string
dataset = tf.data.TextLineDataset('sample.csv').map(parse).batch(BATCH_SIZE)
for i in dataset:
print(i)
This will output tensor of shape (BATCH_SIZE, row_length), where row_length is a row from csv file. You can apply any additional preprocessing, depending on your task

Variable length output in keras

I'm trying to create an autoencoder in keras with bucketing where the input and the output have different time steps.
model = Sequential()
#encoder
model.add(Embedding(vocab_size, embedding_size, mask_zero=True))
model.add(LSTM(units=hidden_size, return_sequences=False))
#decoder
model.add(RepeatVector(max_out_length))
model.add(LSTM(units=hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(num_class, activation='softmax')))
For the input there is no problem as the network can accept different length inputs as long as the whole batch has the same length. However the problem is with the output size as its determined by the RepeatVector length and there is not easy way to change it.
Is there a solution for such a problem?

If you mean "inputs with variable lengths" and "outputs with the same lengths as the inputs", you can do this:
Warning: this solution must work with batch size = 1
You will need to create an external loop and pass each sample as a numpy array with the exact length
You cannot use masking in this solution, and the right output depends on the correct length of the input
This is a working code using Keras + Tensorflow:
Imports:
from keras.layers import *
from keras.models import Model
import numpy as np
import keras.backend as K
from keras.utils.np_utils import to_categorical
Custom functions to use in Lambda layers:
#this function gets the length from the original input
#and stores it in the final output of the encoder
def storeLength(x):
inputTensor = x[0]
storeInto = x[1] #the final output
length = K.shape(inputTensor)[1]
length = K.cast(length,K.floatx())
length = K.reshape(length,(1,1))
#will put length as the first element in the final output
return K.concatenate([length,storeInto])
#this function expands the length of the input in the decoder
def expandLength(x):
#lenght is the first element in the encoded input
length = K.cast(x[0,0],'int32') #or int64 if necessary
#the remaining elements are the actual data to be decoded
data = x[:,1:]
#a tensor with shape (length,)
length = K.ones_like(K.arange(0,length))
#make both length tensor and data tensor 3D and with paired dimensions
length = K.cast(K.reshape(length,(1,-1,1)),K.floatx())
data = K.reshape(data,(1,1,-1))
#this automatically repeats the elements based on the paired shapes
return data*length
Creating the models:
I assumed the output is equal to the input, but since you're using an Embedding, I made "num_classes" equal to the number of words.
For this solution, we use a branching, thus I had to use the functional API Model. Which will be way better later, because you will want to train with autoencoder.train_on_batch and then just encode with encoder.predict() or just decode with decoder.predict().
vocab_size = 100
embedding_size = 7
num_class=vocab_size
hidden_size = 3
#encoder
inputs = Input(batch_shape = (1,None))
outputs = Embedding(vocab_size, embedding_size)(inputs)
outputs = LSTM(units=hidden_size, return_sequences=False)(outputs)
outputs = Lambda(storeLength)([inputs,outputs])
encoder = Model(inputs,outputs)
#decoder
inputs = Input(batch_shape=(1,hidden_size+1))
outputs = Lambda(expandLength)(inputs)
outputs = LSTM(units=hidden_size, return_sequences=True)(outputs)
outputs = TimeDistributed(Dense(num_class, activation='softmax'))(outputs)
decoder = Model(inputs,outputs)
#autoencoder
inputs = Input(batch_shape=(1,None))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
#see each model's shapes
encoder.summary()
decoder.summary()
autoencoder.summary()
Just an example with fake data and the method that should be used for training:
inputData = []
outputData = []
for i in range(7,10):
inp = np.arange(i).reshape((1,i))
inputData.append(inp)
outputData.append(to_categorical(inp,num_class))
autoencoder.compile(loss='mse',optimizer='adam')
for epoch in range(1):
for inputSample,outputSample in zip(inputData,outputData):
print(inputSample.shape,outputSample.shape)
autoencoder.train_on_batch(inputSample,outputSample)
for inputSample in inputData:
print(autoencoder.predict(inputSample).shape)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

HuggingFace Pipeline exceeds 512 tokens of BERT - deep-learning

Related

DeepSORT with custom detectors

Count the number of people having a property bounded by two numbers

GridSearchCV without Cross Validation CV = 1

Trying to get CSV ready for keras model with tensorflow dataset

Variable length output in keras

Categories

Resources