Error while setting up roBERTa model in colab notebook - deep-learning

I am getting error while merging vocabulary and merge txt files for tokenizers designed for Tensorflow roBERTa. I attached the error snapshot!![enter image description here][1]
Code:
tokenizer = tokenizers.ByteLevelBPETokenizer(vocab_file='vocab_roberta_base.json',
merges_file='merges_roberta_base.txt', lowercase=True,add_prefix_space=True)
ERROR:
Exception Traceback (most recent call last)
<ipython-input-9-5dab9f2389e4> in <module>()
1 MAX_LEN = 96
----> 2 tokenizer = tokenizers.ByteLevelBPETokenizer(vocab_file='vocab_roberta_base.json',merges_file='merges_roberta_base.txt')
3 sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/byte_level_bpe.py in __init__(self, vocab_file, merges_file, add_prefix_space, lowercase, dropout, unicode_normalizer, continuing_subword_prefix, end_of_word_suffix)
31 dropout=dropout,
32 continuing_subword_prefix=continuing_subword_prefix or "",
---> 33 end_of_word_suffix=end_of_word_suffix or "",
34 )
35 )
Exception: expected ident at line 1 column 2

Related

Type Error when trying to save model in tensorflow python, getting trace 'Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.'

I'm relatively new to python and machine learning and I'm trying to classify chest X-ray scans and I created a model that does that. However, when I'm trying to save the model I'm getting the error:
TypeError: Unable to serialize [2.0896919 2.1128857 2.1081853] to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.
The full Stack Trace is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_22092\3924096485.py in <module>
1 working_dir=os.getcwd()
2 subject='chest scans'
----> 3 save_model(subject, classes, img_size, f1score, working_dir)
~\AppData\Local\Temp\ipykernel_22092\541087647.py in save_model(subject, classes, img_size, f1score, working_dir)
3 save_id=f'{name}-{f1score:5.2f}.h5'
4 model_save_loc=os.path.join(working_dir, save_id)
----> 5 model.save(model_save_loc)
6 msg= f'model was saved as {model_save_loc}'
7 print_in_color(msg, (0,255,255), (100,100,100)) # cyan foreground
~\anaconda3\lib\site-packages\keras\utils\traceback_utils.py in error_handler(*args, **kwargs)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
~\anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
232 if cls is None:
233 cls = JSONEncoder
--> 234 return cls(
235 skipkeys=skipkeys, ensure_ascii=ensure_ascii,
236 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
~\anaconda3\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
~\anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
TypeError: Unable to serialize [2.0896919 2.1128857 2.1081853] to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor'>.
I have created this function to save the model:
def save_model(subject, classes, img_size, f1score, working_dir):
name=subject + '-' + str(len(classes)) + '-(' + str(img_size[0]) + ' X ' + str(img_size[1]) + ')'
save_id=f'{name}-{f1score:5.2f}.h5'
model_save_loc=os.path.join(working_dir, save_id)
model.save(model_save_loc)
msg= f'model was saved as {model_save_loc}'
print_in_color(msg, (0,255,255), (100,100,100)) # cyan foreground
I have the following packages installed for tensorflow.
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.8.1
tensorflow-estimator 2.8.0
tensorflow-io-gcs-filesystem 0.28.0
termcolor 1.1.0
keras 2.8.0
keras-nightly 2.5.0.dev2021032900
Keras-Preprocessing 1.1.2
Any help would be great! I am really not being able to convert the tensorflow EagerTensor to Json File for my ML model. Thanks!
I tried to save a ml model that I created and I failed and it's giving me an error when converting it to json file

OSError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_8264/243236269.py in <module>

Trying to run code image to text
code below
image = Image.open("C:\Datasets\demo.jpg")
image = image.resize((300,150))
custom_config = r'-l eng --oem 3 --psm 6'
text = pytesseract.image_to_string(image,config=custom_config)
print(text)
#To Save The Text in Text file
filename = "C:\demo.txt"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "w") as f:
f.writ
error
OSError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8264/243236269.py in <module>
14 image = image.resize((300,150))
15 custom_config = r'-l eng --oem 3 --psm 6'
---> 16 text = pytesseract.image_to_string(image,config=custom_config)
17 print(text)
18
to run code file to save as txt

Hazm: POSTagger(): ArgumentError: argument 2: <class 'TypeError'>: wrong type

I have got an error for running the below code. May you give me some help?
from __future__ import unicode_literals
from hazm import *
tagger = POSTagger(model='resources/postagger.model')
tagger.tag(word_tokenize('ما بسیار کتاب موانیم'))
Error:
---------------------------------------------------------------------------
ArgumentError Traceback (most recent call last)
<ipython-input-16-1d74d781e0c1> in <module>
1 tagger = POSTagger(model='resources/postagger.model')
----> 2 tagger = POSTagger()
3 tagger.tag(word_tokenize('ما بسیار کتاب موانیم'))
~/.local/lib/python3.6/site-packages/hazm/SequenceTagger.py in __init__(self, patterns, **options)
21 def __init__(self, patterns=[], **options):
22 from wapiti import Model
---> 23 self.model = Model(patterns='\n'.join(patterns), **options)
24
25 def train(self, sentences):
~/.local/lib/python3.6/site-packages/wapiti/api.py in __init__(self, patterns, encoding, **options)
283 self._model = _wapiti.api_new_model(
284 ctypes.pointer(self.options),
--> 285 self.patterns
286 )
287
ArgumentError: argument 2: <class 'TypeError'>: wrong type
I am using ubuntu18.04 on windows 10. Also, I put mentioned files in resources file beside of code.
Python 3.6.9
Package of hazm
I have no problem to run Chunker one from this packege!
chunker = Chunker(model='resources/chunker.model')
tagged = tagger.tag(word_tokenize('واقعا ک بعضیا چقد بی درکن و ادعا دارن فقط بنده خدا لابد دسترسی نداره ب دکتری چیزی نگران شد'))
tree2brackets(chunker.parse(tagged))
its because of wapiti package! wapiti does not supporting python 3 and just work with python 2! if you need postagger, you should use another postagger package!

How to create a net which takes unlabeled "dummy data" as input?

I currently work myself through the caffe/examples/ to learn more about caffe/pycaffe.
In the 02-fine-tuning.ipynb-notebook there is a codecell which shows how to create a caffenet which takes unlabeled "dummmy data" as input, allowing us to set its input images externally. The notebook can be found here:
https://github.com/BVLC/caffe/blob/master/examples/02-fine-tuning.ipynb
There is a given code-cell, which throws an error:
dummy_data = L.DummyData(shape=dict(dim=[1, 3, 227, 227]))
imagenet_net_filename = caffenet(data=dummy_data, train=False)
imagenet_net = caffe.Net(imagenet_net_filename, weights, caffe.TEST)
error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-9f0ecb4d95e6> in <module>()
1 dummy_data = L.DummyData(shape=dict(dim=[1, 3, 227, 227]))
----> 2 imagenet_net_filename = caffenet(data=dummy_data, train=False)
3 imagenet_net = caffe.Net(imagenet_net_filename, weights, caffe.TEST)
<ipython-input-5-53badbea969e> in caffenet(data, label, train, num_classes, classifier_name, learn_all)
68 # write the net to a temporary file and return its filename
69 with tempfile.NamedTemporaryFile(delete=False) as f:
---> 70 f.write(str(n.to_proto()))
71 return f.name
~/anaconda3/envs/testcaffegpu/lib/python3.6/tempfile.py in func_wrapper(*args, **kwargs)
481 #_functools.wraps(func)
482 def func_wrapper(*args, **kwargs):
--> 483 return func(*args, **kwargs)
484 # Avoid closing the file as long as the wrapper is alive,
485 # see issue #18879.
TypeError: a bytes-like object is required, not 'str'
Anyone knows how to do this right?
tempfile.NamedTemporaryFile() opens a file in binary mode ('w+b') by default. Since you are using Python3.x, string is not the same type as for Python 2.x, hence providing a string as input to f.write() results in error since it expects bytes. Overriding the binary mode should avoid this error.
Replace
with tempfile.NamedTemporaryFile(delete=False) as f:
with
with tempfile.NamedTemporaryFile(delete=False, mode='w') as f:
This has been explained in a previous post:
TypeError: 'str' does not support the buffer interface

Error while exporting a dask dataframe to csv

My dask dataframe has about 120 mm rows and 4 columns:
df_final.dtypes
cust_id int64
score float64
total_qty float64
update_score float64
dtype: object
and I'm doing this operation on jupyter notebooks connected to linux machine :
%time df_final.to_csv('/path/claritin-files-*.csv')
and it throws up this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-46468ae45023> in <module>()
----> 1 get_ipython().magic(u"time df_final.to_csv('path/claritin-files-*.csv')")
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2334 magic_name, _, magic_arg_s = arg_s.partition(' ')
2335 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2336 return self.run_line_magic(magic_name, magic_arg_s)
2337
2338 #-------------------------------------------------------------------------
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2255 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2256 with self.builtin_trap:
-> 2257 result = fn(*args,**kwargs)
2258 return result
2259
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
191 **# but it's overkill for just that one bit of state.**
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1161 if mode=='eval':
1162 st = clock2()
-> 1163 out = eval(code, glob, local_ns)
1164 end = clock2()
1165 else:
<timed eval> in <module>()
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in to_csv(self, filename, **kwargs)
936 """ See dd.to_csv docstring for more information """
937 from .io import to_csv
--> 938 return to_csv(self, filename, **kwargs)
939
940 def to_delayed(self):
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.pyc in to_csv(df, filename, name_function, compression, compute, get, **kwargs)
411 if compute:
412 from dask import compute
--> 413 compute(*values, get=get)
414 else:
415 return values
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/base.pyc in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
74 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
75 cache=cache, get_id=_thread_get_id,
---> 76 **kwargs)
77
78 # Cleanup pools associated to dead threads
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, dumps, loads, **kwargs)
491 _execute_task(task, data) # Re-execute locally
492 else:
--> 493 raise(remote_exception(res, tb))
494 state['cache'][key] = res
495 finish_task(dsk, key, state, results, keyorder.get)
**ValueError: invalid literal for long() with base 10: 'total_qty'**
Traceback
---------
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.py", line 268, in execute_task
result = _execute_task(task, data)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 55, in pandas_read_text
coerce_dtypes(df, dtypes)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 83, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 3054, in astype
raise_on_error=raise_on_error, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3189, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3056, in apply
applied = getattr(b, f)(**kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 461, in astype
values=values, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 504, in _astype
values = _astype_nansafe(values.ravel(), dtype, copy=True)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/types/cast.py", line 534, in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/lib.pyx", line 980, in pandas.lib.astype_intsafe (pandas/lib.c:17409)
File "pandas/src/util.pxd", line 93, in util.set_value_at_unsafe (pandas/lib.c:72777)
I have a couple of questions:
1) First of all this export was working fine on Friday, it spit out 100 csv files ( since it has 100 partitions), which I later aggregated. So what is wrong today -- anything from the error log?
2) May be this question is for the creators of this package, what is the most time-efficient way to get a csv extract out of a dask dataframe of this size, since it was taking about 1.5 to 2 hrs, the last time it was working.
I'm not using dask distributed and this is on single core of a linux cluster.
This error likely has little to do with to_csv and more to do with something else in your computation. The call to df.to_csv was just the first time you forced the computation to roll through all of the data.
Given the error I actually suspect that this is failing in read_csv. Dask.dataframe read the first few hundred kilobytes of your first file to guess at the datatypes, but it seems to have guessed incorrectly. You might want to try specifying dtypes explicitly in the read_csv call.
In regards to the second question about writing to CSV quickly, my first answer would be "use Parquet or HDF5 instead". They're much faster and more accurate in almost every respect.