I'm trying to train https://github.com/NVIDIA/vid2vid. I'm...
...executing with pretty much the vanilla parametrization shown in the readme, I had to change the number of GPUs though and increased the number of threads for reading the dataset. Command:
python train.py \
--name pose2body_256p \
--dataroot datasets/pose \
--dataset_mode pose \
--input_nc 6 \
--num_D 2 \
--resize_or_crop ScaleHeight_and_scaledCrop \
--loadSize 384 \
--fineSize 256 \
--gpu_ids 0,1 \
--batchSize 1 \
--max_frames_per_gpu 3 \
--no_first_img \
--n_frames_total 12 \
--max_t_step 4 \
--nThreads 6
...training on the supplied example datasets.
...running a docker container built with the scripts in vid2vid/docker, e. g. with CUDA 9.0 and CUDNN 7.
...using two NVIDIA V100 GPUs.
Whenever I start training the script crashes after a couple of minutes with the message RuntimeError: CUDNN_STATUS_MAPPING_ERROR. Full error message:
Traceback (most recent call last):
File "train.py", line 329, in <module>
train()
File "train.py", line 104, in train
fake_B, fake_B_raw, flow, weight, real_A, real_Bp, fake_B_last = modelG(input_A, input_B, inst_A, fake_B_last)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
raise output
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/vid2vid/models/vid2vid_model_G.py", line 130, in forward
fake_B, fake_B_raw, flow, weight = self.generate_frame_train(netG, real_A_all, fake_B_prev, start_gpu, is_first_frame)
File "/vid2vid/models/vid2vid_model_G.py", line 175, in generate_frame_train
fake_B_feat, flow_feat, fake_B_fg_feat, use_raw_only)
File "/vid2vid/models/networks.py", line 171, in forward
downsample = self.model_down_seg(input) + self.model_down_img(img_prev)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDNN_STATUS_MAPPING_ERROR
From reading the issues in the vid2vid using two V100 should work with this setup. The error also occurs if CUDA 8/CUDNN 6 are used. I checked the flags but haven't found any indication of further necessary changes to the arguments supplied to train.py.
Any ideas on how to solve (or work around) this?
In case anybody deals the same issue: Training on P100 cards worked. Seems like the V100 architecture clashes with version of pytorch used in the supplied Dockerfile at some point. Not quite a solution, but a workaround.
Related
I have developed an application with Django.
This is working fine in my PC with sqlite backend.
But when I am trying to go live with linux server and mysql backend then I am getting bellow error while first time migration.
(env-bulkmailer) [root#localhost bulkmailer]# python3 manage.py migrate
Traceback (most recent call last):
File "/var/www/bulkmailer-folder/bulkmailer/manage.py", line 22, in <module>
main()
File "/var/www/bulkmailer-folder/bulkmailer/manage.py", line 18, in main
execute_from_command_line(sys.argv)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/__init__.py", line 446, in execute_from_command_line
utility.execute()
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/__init__.py", line 440, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/base.py", line 402, in run_from_argv
self.execute(*args, **cmd_options)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/base.py", line 448, in execute
output = self.handle(*args, **options)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/base.py", line 96, in wrapped
res = handle_func(*args, **kwargs)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/core/management/commands/migrate.py", line 114, in handle
executor = MigrationExecutor(connection, self.migration_progress_callback)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/migrations/executor.py", line 18, in __init__
self.loader = MigrationLoader(self.connection)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/migrations/loader.py", line 58, in __init__
self.build_graph()
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/migrations/loader.py", line 235, in build_graph
self.applied_migrations = recorder.applied_migrations()
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/migrations/recorder.py", line 82, in applied_migrations
return {
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/models/query.py", line 394, in __iter__
self._fetch_all()
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/models/query.py", line 1866, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/models/query.py", line 117, in __iter__
for row in compiler.results_iter(results):
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/models/sql/compiler.py", line 1336, in apply_converters
value = converter(value, expression, connection)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/db/backends/mysql/operations.py", line 331, in convert_datetimefield_value
value = timezone.make_aware(value, self.connection.timezone)
File "/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/utils/timezone.py", line 291, in make_aware
raise ValueError("make_aware expects a naive datetime, got %s" % value)
ValueError: make_aware expects a naive datetime, got 2022-11-20 12:39:18.866299+00:00
In settings-
USE_TZ = True
I have run mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql also as django doc.
I am using django 4.1.3 and mysql community 8.0.30
Thanks in advance.
Ran into the same issue. At some point, django assumes that the the data is timezone-naive without checking. Here's the work-around.
Update the make_aware function that is listed in your stack trace here:
/var/www/bulkmailer-folder/env-bulkmailer/lib64/python3.9/site-packages/django/utils/timezone.py", line 291, in make_aware
Instead of raising an error if the value is already aware, just return the aware value. See the last else statement below.
def make_aware(value, timezone=None, is_dst=NOT_PASSED):
"""Make a naive datetime.datetime in a given time zone aware."""
if is_dst is NOT_PASSED:
is_dst = None
else:
warnings.warn(
"The is_dst argument to make_aware(), used by the Trunc() "
"database functions and QuerySet.datetimes(), is deprecated as it "
"has no effect with zoneinfo time zones.",
RemovedInDjango50Warning,
)
if timezone is None:
timezone = get_current_timezone()
if _is_pytz_zone(timezone):
# This method is available for pytz time zones.
return timezone.localize(value, is_dst=is_dst)
else:
# Check that we won't overwrite the timezone of an aware datetime.
if is_aware(value):
# ADD THIS
return value
# REMOVE THE FOLLOWING LINE
# raise ValueError("make_aware expects a naive datetime, got %s" % value)
# This may be wrong around DST changes!
return value.replace(tzinfo=timezone)
A want to resolve coreferences without Internet using AllenNLP and coref-spanbert-large model.
I try to do it in the way that is describing here https://demo.allennlp.org/coreference-resolution
My code:
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
example = 'Paul Allen was born on January 21, 1953, in Seattle, Washington, to Kenneth Sam Allen and Edna Faye Allen.Allen attended Lakeside School, a private school in Seattle, where he befriended Bill Gates, two years younger, with whom he shared an enthusiasm for computers.'
pred = predictor.predict(document=example)
coref_res = predictor.coref_resolved(example)
print(pred)
print(coref_res)
When I have an access to internet the code works correctly.
But when I don't have an access to internet I get the following errors:
Traceback (most recent call last):
File "C:/Users/aap/Desktop/CoreNLP/Coref_AllenNLP.py", line 14, in <module>
predictor = Predictor.from_path(r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz")
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\predictors\predictor.py", line 361, in from_path
load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 206, in load_archive
config.duplicate(), serialization_dir
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\models\archival.py", line 232, in _load_dataset_readers
dataset_reader_params, serialization_dir=serialization_dir
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 632, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 200, in create_kwargs
cls.__name__, param_name, annotation, param.default, params, **extras
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 307, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 391, in construct_arg
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 341, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 604, in from_params
**extras,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\from_params.py", line 634, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_mismatched_indexer.py", line 63, in __init__
**kwargs,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\token_indexers\pretrained_transformer_indexer.py", line 58, in __init__
model_name, tokenizer_kwargs=tokenizer_kwargs
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\data\tokenizers\pretrained_transformer_tokenizer.py", line 71, in __init__
model_name, add_special_tokens=False, **tokenizer_kwargs
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\allennlp\common\cached_transformers.py", line 110, in get_tokenizer
**kwargs,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 362, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\models\auto\configuration_auto.py", line 368, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\configuration_utils.py", line 424, in get_config_dict
use_auth_token=use_auth_token,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1087, in cached_path
local_files_only=local_files_only,
File "C:\Users\aap\Desktop\CoreNLP\corenlp\lib\site-packages\transformers\file_utils.py", line 1268, in get_from_cache
"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Process finished with exit code 1
Please, say me, what do I need to do my code works without Internet?
You will need a local copy of transformer model's configuration file and vocabulary so that the tokenizer and token indexer don't need to download those:
from transformers import AutoConfig, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model_name)
config = AutoConfig.from_pretrained(transformer_model_name)
tokenizer.save_pretrained(local_config_path)
config.to_json_file(local_config_path + "/config.json")
You will then need to override the transformer model name in the configuration file to the local directory (local_config_path) where you saved these things:
predictor = Predictor.from_path(
r"C:\Users\aap\Desktop\coref-spanbert-large-2021.03.10.tar.gz",
overrides={
"dataset_reader.token_indexers.tokens.model_name": local_config_path,
"validation_dataset_reader.token_indexers.tokens.model_name": local_config_path,
"model.text_field_embedder.tokens.model_name": local_config_path,
},
)
I have run into similar problem when using structured-prediction-srl-bert without internet, and I saw in the logs 4 item for downloads:
dataset_reader.bert_model_name = bert-base-uncased, Downloading 4 files
model INFO vocabulary.py - Loading token dictionary from data/structured-prediction-srl-bert.2020.12.15/vocabulary. Downloading... 4x smaller files
Spacy models 'en_core_web_sm' not found
later on, [nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary failure in name resolution> [nltk_data] Error loading wordnet: <urlopen error [Errno -3] Temporary failure in name resolution>
I have solved it with these steps:
structured-prediction-srl-bert:
I have downloaded the structured-prediction-srl-bert.2020.12.15.tar.gz from the https://demo.allennlp.org/semantic-role-labeling (Model Card tab) -
https://storage.googleapis.com/allennlp-public-models/structured-prediction-srl-bert.2020.12.15.tar.gz
I have unzipped it into ./data/structured-prediction-srl-bert.2020.12.15
The code:
pip install allennlp==2.10.0 allennlp-models==2.10.0
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("./data/structured-prediction-srl-bert.2020.12.15/")
bert-base-uncased
I have created a folder ./data/bert-base-uncased and there I have downloaded these files from https://huggingface.co/bert-base-uncased/tree/main
config.json
tokenizer.json
tokenizer_config.json
vocab.txt
pytorch_model.bin
Aditionally, I had to change the "bert_model_name" from "bert-base-uncased" into a path "./data/bert-base-uncased", the earlier causes the download. This has to be done in the ./data/structured-prediction-srl-bert.2020.12.15/config.json , and there are two occurences.
python -m spacy download en_core_web_sm
python -c 'import nltk; nltk.download("punkt"); nltk.download("wordnet")'
After these steps the allennlp did not need internet anymore.
I was trying out the pix2pixHD code from the link below.
https://github.com/NVIDIA/pix2pixHD
The train.py worked with default images (in datasets/cityscapes). However, after changing images in the dataset, it shows the error below.
model [Pix2PixHDModel] was created
create web directory ./checkpoints/label2city/web...
Traceback (most recent call last):
File "/home/shimada/venv/py2.7/projects/Hiwi/pix2pixHD/train.py", line 58, in <module>
Variable(data['image']), Variable(data['feat']), infer=save_fake)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 66, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/shimada/venv/py2.7/projects/Hiwi/pix2pixHD/models/pix2pixHD_model.py", line 141, in forward
fake_image = self.netG.forward(input_concat)
File "/home/shimada/venv/py2.7/projects/Hiwi/pix2pixHD/models/networks.py", line 213, in forward
return self.model(input)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 277, in forward
self.padding, self.dilation, self.groups)
File "/home/shimada/venv/py2.7/local/lib/python2.7/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Given groups=1, weight[64, 36, 7, 7], so expected input[1, 39, 518, 1030] to have 36 channels, but got 39 channels instead
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.c line=184 error=59 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCStorage.c:184
bash: line 1: 10965 Aborted (core dumped) env "PYCHARM_HOSTED"="1" "PYTHONUNBUFFERED"="1" "PYTHONIOENCODING"="UTF-8" "PYCHARM_MATPLOTLIB_PORT"="42188" "JETBRAINS_REMOTE_RUN"="1" "PYTHONPATH"="/home/shimada/.pycharm_helpers/pycharm_matplotlib_backend:/home/shimada/venv/py2.7/projects/Hiwi/pix2pixHD" /home/shimada/venv/py2.7/bin/python -u /home/shimada/venv/py2.7/projects/Hiwi/pix2pixHD/train.py
I changed the images with same size (width 2048, hight 1024), same extension (.png) and gave the same names. Why doesn't it work?
It looks like your original image/ground truth data is grayscale. In that case you have to define --input_nc 1 --output_nc 1 means grayscale. You also have to change in pix2pixHD code to load grayscale images.
I am learning Deep Learning and want to use python-kereas to implement CNN, but when I run in command, it looks like some errors.
This is my source code. https://github.com/lijhong/CNN-kereas.git
And my fault is like this:
Traceback (most recent call last):
File "/home/ah0818lijhong/CNN-kereas/cnn-kereas.py", line 167, in <module>
model.fit(x_train, y_train,epochs=3)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/models.py", line 845, in fit
initial_epoch=initial_epoch)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1485, in fit
initial_epoch=initial_epoch)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1140, in _fit_loop
outs = f(ins_batch)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2073, in __call__
feed_dict=feed_dict)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,868] = 115873 is not in [0, 20001)
[[Node: embedding_1/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding_1/embeddi
ngs/read, _recv_embedding_1_input_0)]]
Caused by op u'embedding_1/Gather', defined at:
File "/home/ah0818lijhong/CNN-kereas/cnn-kereas.py", line 122, in <module>
model_left.add(embedding_layer)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/models.py", line 422, in add
layer(x)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 554, in __call__
output = self.call(inputs, **kwargs)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/layers/embeddings.py", line 119, in call
out = K.gather(self.embeddings, inputs)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 966, in gather
return tf.gather(reference, indices)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1207, in gather
validate_indices=validate_indices, name=name)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/ah0818lijhong/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): indices[0,868] = 115873 is not in [0, 20001)
[[Node: embedding_1/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding_1/embeddi
ngs/read, _recv_embedding_1_input_0)]]
I hope someone can help me fix it.
I have a Python Django application running on a Google Compute instance. It is using gcloudoem to interface from Django to Google Datastore. gcloudoem uses the same underlying code to communicate with Datastore as gcloud-python 0.5.x
At what seems to be completely random times, I will get SSL errors happening when trying to talk to Datastore. There is no pattern in where in my application code these happen. It's just during a random call to Datastore. Here are the two flavours of errors:
ERROR:django.request:Internal Server Error: /complete/google-oauth2/
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/django/core/handlers/base.py", line 111, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python2.7/dist-packages/django/views/decorators/cache.py", line 52, in _wrapped_view_func
response = view_func(request, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/django/views/decorators/csrf.py", line 57, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/apps/django_app/utils.py", line 51, in wrapper
return func(request, backend, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/apps/django_app/views.py", line 28, in complete
redirect_name=REDIRECT_FIELD_NAME, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/actions.py", line 43, in do_complete
user = backend.complete(user=user, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/backends/base.py", line 41, in complete
return self.auth_complete(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/utils.py", line 229, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/backends/oauth.py", line 387, in auth_complete
*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/utils.py", line 229, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/backends/oauth.py", line 396, in do_auth
return self.strategy.authenticate(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/strategies/django_strategy.py", line 96, in authenticate
return authenticate(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/django/contrib/auth/__init__.py", line 60, in authenticate
user = backend.authenticate(**credentials)
File "/usr/local/lib/python2.7/dist-packages/social/backends/base.py", line 82, in authenticate
return self.pipeline(pipeline, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/backends/base.py", line 85, in pipeline
out = self.run_pipeline(pipeline, pipeline_index, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/social/backends/base.py", line 112, in run_pipeline
result = func(*args, **out) or {}
File "/usr/local/lib/python2.7/dist-packages/social/pipeline/social_auth.py", line 20, in social_user
social = backend.strategy.storage.user.get_social_auth(provider, uid)
File "./social_gc/storage.py", line 105, in get_social_auth
return cls.objects.get(provider=provider, uid=uid)
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/queryset/__init__.py", line 162, in get
num = len(clone)
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/queryset/__init__.py", line 126, in __len__
self._fetch_all()
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/queryset/__init__.py", line 370, in _fetch_all
self._result_cache = list(self.iterator())
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/datastore/query.py", line 480, in __iter__
self.next_page()
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/datastore/query.py", line 452, in next_page
transaction_id=transaction and transaction.id,
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/datastore/connection.py", line 249, in run_query
response = self._rpc('runQuery', request, datastore_pb.RunQueryResponse)
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/datastore/connection.py", line 159, in _rpc
data=request_pb.SerializeToString()
File "/usr/local/lib/python2.7/dist-packages/gcloudoem/datastore/connection.py", line 134, in _request
body=data
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 589, in new_request
redirections, connection_type)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1609, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1351, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1307, in _conn_request
response = conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1127, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 453, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 409, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/lib/python2.7/ssl.py", line 734, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 621, in read
v = self._sslobj.read(len or 1024)
SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1752)
Unfortunately, for the second, I don't have a full stacktrace handy:
[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1752)
These errors don't happen when I am using the GCD tool. Does anyone have any idea what is happening here? Is this some sort of networking problem?
I have also been receiving the [SSL: WRONG_VERSION_NUMBER] error when trying to use Datastore, however, I can repeat the error on demand. As James suggested, I get this error as soon as I introduce another thread querying Datastore. They are using completely separate application-level objects but I would imagine that as they are getting lower down in the gcloud library or lower down still there is some sort of object-sharing happening that is causing this problem.
UPDATE: I have since found the following very helpful thread (https://github.com/GoogleCloudPlatform/gcloud-python/issues/1214) which identifies an issue across the gcloud python apis due to a common dependency on the httplib2 library which turns out to not be thread-safe.
Somebody has written a wrapper for the gcloud suite that will use the requests library instead of httplib2 (gcloud requests) but it is built for Python 2.7. I didnt try to convert it for my Python3 project and instead used the very simple httplib2shim library to monkey-patch httplib2 with urllib3.
It was as simple as adding this :
import httplib2shim
httplib2shim.patch()
I'm now making calls from multiple threads without an issue.
: )
Two things come to mind which may be leading to this. Sorry this is not super specific; trying to help!
Threads - there are objects being shared across threads somehow which is causing the problem
Connections - There are too many connections being made, causing in failure (especially for the second error)