TPU suddenly stops training - google-compute-engine

I am trying to train a transformer model using a TPU in Google Cloud by following the instructions in the official tutorial.
Loading the data worked fine, and after running
t2t-trainer \
--model=transformer \
--hparams_set=transformer_tpu \
--problem=translate_ende_wmt32k_packed \
--train_steps=500000 \
--eval_steps=3000 \
--data_dir=$DATA_DIR \
--output_dir=$OUT_DIR \
--use_tpu=True \
--cloud_tpu_name=$TPU_NAME
the training does start as expected, and the output may look somewhat like this:
I1118 14:48:18.978163 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2942 [114/1944]
INFO:tensorflow:examples/sec: 978.827
I1118 14:48:18.978595 140580835792320 tpu_estimator.py:2308] examples/sec: 978.827
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I1118 14:48:18.979720 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:18.979935 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:24.292932 140577566803712 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.
W1118 14:48:24.353135 140577566803712 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-8 in state READY, and health HEALTHY.
INFO:tensorflow:loss = 1.8486812, step = 113800 (6.536 sec)
I1118 14:48:25.512768 140580835792320 basic_session_run_hooks.py:260] loss = 1.8486812, step = 113800 (6.536 sec)
INFO:tensorflow:global_step/sec: 15.2986
I1118 14:48:25.514695 140580835792320 tpu_estimator.py:2307] global_step/sec: 15.2986
INFO:tensorflow:examples/sec: 979.11
I1118 14:48:25.515115 140580835792320 tpu_estimator.py:2308] examples/sec: 979.11
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I1118 14:48:25.516618 140580835792320 tpu_estimator.py:600] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I1118 14:48:25.516829 140580835792320 tpu_estimator.py:604] Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (388, 47)
I1118 14:48:28.761935 140577575196416 tpu_estimator.py:279] Outfeed finished for iteration (388, 47)
INFO:tensorflow:loss = 1.5237397, step = 113900 (6.573 sec)
I1118 14:48:32.086134 140580835792320 basic_session_run_hooks.py:260] loss = 1.5237397, step = 113900 (6.573 sec)
However, sometimes, and after a non-deterministic number of iterations (sometimes less than 25k, sometimes more than 400k, sometimes never), the training suddenly stops. There is no error message, but no more progress is made. In this case, I get the following output:
I1120 13:40:33.828651 140684764419520 tpu_estimator.py:2307] global_step/sec: 16.3988
INFO:tensorflow:examples/sec: 1049.52
I1120 13:40:33.829339 140684764419520 tpu_estimator.py:2308] examples/sec: 1049.52
INFO:tensorflow:Enqueue next (1000) batch(es) of data to infeed.
I1120 13:40:33.830607 140684764419520 tpu_estimator.py:600] Enqueue next (1000) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1000) batch(es) of data from outfeed.
I1120 13:40:33.830862 140684764419520 tpu_estimator.py:604] Dequeue next (1000) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (7, 0)
I1120 13:40:34.267921 140681504278272 tpu_estimator.py:279] Outfeed finished for iteration (7, 0)
I1120 13:40:39.989195 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:40:40.056418 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:10.124164 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:10.177670 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:41:40.259634 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:41:40.309398 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:42:10.377460 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
W1120 13:42:10.431982 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health UNKNOWN.
I1120 13:42:40.508342 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:42:40.567739 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:10.638391 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:10.694900 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:43:40.763782 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:43:40.810777 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:10.889873 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:10.942733 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
I1120 13:44:41.011034 140681495885568 transport.py:157] Attempting refresh to obtain initial access_token
WARNING:tensorflow:TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
W1120 13:44:41.066553 140681495885568 preempted_hook.py:91] TPUPollingThread found TPU tpuv3-5 in state READY, and health HEALTHY.
Note that the reported health was UNKNOWN once, which may or may not be related to this problem.
To resume training, I have to stop the process and run the training command again. It will then load the latest checkpoint and continue training, until it eventually stops again.
I am running the training command from within a tmux session, so this should not be caused by connection issues between me and Google Cloud. In fact, I can completely close all windows, and connect to the running training session from another PC.
I have seen the question TPU training freezes in the middle of training, but I am using a predefined model, and my bucket is defined in the same region (TPU in us-central1-a, storage bucket in us-central1).
Edit: In case this is relevant, I am currently in a free 1 month trial, that I got by applying to the TensorFlow Research Cloud project. Maybe those cluster nodes are less stable than the payed ones?
Edit2: Maybe this is related to the GitHub issue TPU dies after 3hrs (e.g. with no 'health' state) (and the follow up)? Please note that the issue has been closed, but the given answer appears to be unrelated to the problem. Also, I've checked the file /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/tpu/preempted_hook.py in my cloud VM, and both linked changes are already incorporated.

I had the same issue when training with TPU of TFRC. As the warning said, it seems there is a problem with the connection between TPU and Google Cloud even we follow the instructions.
I try a few solutions:
Remove gcloud config folder
rm -rf ~/.config/gcloud
Update gcloud sdk:
gcloud components update
Give TPU access to Cloud Bucket via IAM
link!
The TPU hang errors still happen but with less frequency. Hopefully, it may help with your case or you may figure out the universal solution.

This was reported as a bug on GitHub (#1, #2) and subsequently fixed.
If the error still occurs, you should reply to the second GitHub issue. Note, that you might have to recreate the TPU, just restarting it may not be enough.

Related

How to reference LINK token on (forked) development network without "invalid opcode"?

I like to run a number of local tests. Everything works well on rinkeby and other test chains. However, the local development chain disagrees with my configuration.
When I run a forked development network:
brownie console --network mainnet-fork
The ganache-cli initiates as expected:
Brownie v1.18.1 - Python development framework for Ethereum
BlockchainProject is the active project.
Launching 'ganache-cli --accounts 10 --hardfork istanbul --fork https://mainnet.infura.io/v3/6a633a4ecae8449abbc69974cdd3a9b9 --gasLimit 12000000 --mnemonic brownie --port 8545 --chainId 1'...
Brownie environment is ready.
However, even the most simple contract interaction fails:
>>> link_token = Contract.from_explorer("0x514910771AF9Ca656af840dff83E8264EcF986CA")
Fetching source of 0x514910771AF9Ca656af840dff83E8264EcF986CA from api.etherscan.io...
>>> accounts[0].balance()
100000000000000000000
>>> accounts[1].balance()
100000000000000000000
>>> link_token.transfer(accounts[0].address, 100, {'from': accounts[0].address})
Transaction sent: 0x1542b679e4d09b2f4523427c7f5048ed01ee0d194c34cd27b82bbd177e1b3f23
Gas price: 0.0 gwei Gas limit: 12000000 Nonce: 2
LinkToken.transfer confirmed (invalid opcode) Block: 14604608 Gas used: 12000000 (100.00%)
<Transaction '0x1542b679e4d09b2f4523427c7f5048ed01ee0d194c34cd27b82bbd177e1b3f23'>
Since the Link token is compiled with an unsupported compiler I do not get any further information on why this results in LinkToken.transfer confirmed (invalid opcode).
How do I (correctly) run chainlink code against a forked development network using brownie - am I missing a step such as funding??
My networks: configuration in brownie-config.yaml:
networks:
mainnet-fork:
vrf_coordinator: '0xf0d54349aDdcf704F77AE15b96510dEA15cb7952'
link_token: '0x514910771AF9Ca656af840dff83E8264EcF986CA'
keyhash: '0xAA77729D3466CA35AE8D28B3BBAC7CC36A5031EFDC430821C02BC31A238AF445'
I did try to rm -rf build but that does not change anything.
System environment:
Brownie v1.18.1
Node 8.5.5
Ganache v7.0.4
21.3.0 Darwin Kernel Version (macOS 12.2.1)
Python 3.9.7
In this instance, the account used for Link token funding does not have any Link. For some reason, the transaction does not get reverted but the unlock: option of brownie provides assistance.
First adjusting the networks:settings to include an arbitrary account with a large Link balance:
mainnet-fork:
cmd_settings:
unlock:
- 0xf37c348b7d19b17b29cd5cfa64cfa48e2d6eb8db
vrf_coordinator: '0xf0d54349aDdcf704F77AE15b96510dEA15cb7952'
link_token: '0x514910771AF9Ca656af840dff83E8264EcF986CA'
keyhash: '0xAA77729D3466CA35AE8D28B3BBAC7CC36A5031EFDC430821C02BC31A238AF445'
Second, run a mainnet-fork as before:
brownie console --network mainnet-fork
Third, confirm that the unlocked account is available and funded:
>>> accounts[10]
<Account '0xF37C348B7d19b17B29CD5CfA64cfA48E2d6eb8Db'>
>>> accounts[10].balance()
426496436000000000
Fourth, instantiate the contract of the token, Link in this instance:
link_token = Contract.from_explorer("0x514910771AF9Ca656af840dff83E8264EcF986CA")
Finally, transfer Link from the unlocked account to some other account (or contract):
link_token.transfer(accounts[0], 20, {"from": accounts[10]})
Alternatively, funding the mainnet address with Link, or even unlocking the Link owner and minting new Link would work too...

Config block returns not found, in hyperledger fabric

Trying to fetch config block to create a config update.
I'm using the test network in fabric samples with default settings (no CA)
even after starting the network I cannot fetch any blocks. not latest or oldest either
This is the output I'm getting
peer channel fetch config
2022-02-08 11:09:47.306 +03 [channelCmd] InitCmdFactory -> INFO 001 Endorser and orderer connections initialized
2022-02-08 11:09:47.309 +03 [cli.common] readBlock -> INFO 002 Expect block, but got status: &{NOT_FOUND}
Error: can't read the block: &{NOT_FOUND}
I think you need to specify the channel, for example:
peer channel fetch config -c mychannel
That works for me with the default test network channel, and I get the same error you saw without the -c option.
It's also worth having a look at the test network scripts since they are meant to be a sample themselves. In this case configUpdate.sh does a config update.

Error with prysm beacon-chain with testnet pyrmont

I am trying to run a beacon-chain for Ethereum2.0 in the pyrmont testnet with Prysm and Besu.
I run the ETH1 node with the command :
besu --network=goerli --data-path=/root/goerliData --rpc-http-enabled
This command is working and download the entire blockchain, then run properly.
But when I launch :
./prysm.sh beacon-chain --http-web3provider=localhost:8545 --pyrmont
I get :
Verified /root/prysm/dist/beacon-chain-v1.0.0-beta.3-linux-amd64 has been signed by Prysmatic Labs.
Starting Prysm beacon-chain --http-web3provider=localhost:8545 --pyrmont
[2020-11-18 14:03:06] WARN flags: Running on Pyrmont Testnet
[2020-11-18 14:03:06] INFO flags: Using "max_cover" strategy on attestation aggregation
[2020-11-18 14:03:06] INFO node: Checking DB database-path=/root/.eth2/beaconchaindata
[2020-11-18 14:03:08] ERROR main: database contract is xxxxxxxxxxxx3fdc but tried to run with xxxxxxxxxxxx6a8c
I tried to delete the previous data folder /root/goerliData and re-download the blockchain but nothing changed...
Why does the database contract didn't change and what should I do ?
Thanks :)
The error means that you have an existing database for another network, probably medalla.
Try starting your beacon node with the flag --clear-db next time, and you'll see it the error disappear and start syncing Pyrmont.

Problem running python script using TPU on VM instance

I created TPU and VM instance with same name via cloud console(not ctpu, gcloud).
When I check tpu on VM with command
gcloud compute tpus list
I get my TPU READY.
But when I run python script:
from tensorflow.contrib.cluster_resolver import TPUClusterResolver
tpu_grpc_url = TPUClusterResolver(tpu="v3-nonpre", zone="us-central1-a").get_master()
It says googleapiclient.errors.HttpError: <HttpError 403 when requesting https://tpu.googleapis.com/v1alpha1/projects/red-splice-230206/locations/us-central1-a/nodes/v3-nonpre?alt=json returned "Request had insufficient authentication scopes.">
What should I do more to get the required authentication?

Chain head is not set yet. Permit all

Installed Hyperledger sawtooth with this guide:
https://sawtooth.hyperledger.org/docs/core/releases/latest/sysadmin_guide/installation.html
[2018-11-04 02:35:13.204 DEBUG selector_events] Using selector: ZMQSelector
[2018-11-04 02:35:13.205 INFO interconnect] Listening on tcp://127.0.0.1:4004
[2018-11-04 02:35:13.205 DEBUG dispatch] Added send_message function for connection ServerThread
[2018-11-04 02:35:13.206 DEBUG dispatch] Added send_last_message function for connection ServerThread
[2018-11-04 02:35:13.206 DEBUG genesis] genesis_batch_file: /var/lib/sawtooth/genesis.batch
[2018-11-04 02:35:13.206 DEBUG genesis] block_chain_id: not yet specified
[2018-11-04 02:35:13.207 INFO genesis] Producing genesis block from /var/lib/sawtooth/genesis.batch
[2018-11-04 02:35:13.207 DEBUG genesis] Adding 1 batches
[2018-11-04 02:35:13.208 DEBUG executor] no transaction processors registered for processor type sawtooth_settings: 1.0
[2018-11-04 02:35:13.209 INFO executor] Waiting for transaction processor (sawtooth_settings, 1.0)
[2018-11-04 02:35:13.311 INFO processor_handlers] registered transaction processor: connection_id=014a2086c9ffe773b104d8a0122b9d5f867a1b2d44236acf4ab097483dbe49c2ad33d3302acde6f985d911067fe92207aa8adc1c9dbc596d826606fe1ef1d4ef, family=intkey, version=1.0, namespaces=['1cf126']
[2018-11-04 02:35:18.110 INFO processor_handlers] registered transaction processor: connection_id=e615fc881f8e7b6dd05b1e3a8673d125a3e759106247832441bd900abae8a3244e1507b943258f62c458ded9af0c5150da420c7f51f20e62330497ecf9092060, family=xo, version=1.0, namespaces=['5b7349']
[2018-11-04 02:35:21.908 DEBUG permission_verifier] Chain head is not set yet. Permit all.
[2018-11-04 02:35:21.908 DEBUG permission_verifier] Chain head is not set yet. Permit all.
Than:
ubuntu#ip-172-31-42-144:~$ sudo intkey-tp-python -vv
[2018-11-04 02:42:05.710 INFO core] register attempt: OK
Than:
ubuntu#ip-172-31-42-144:~$ intkey create_batch
Writing to batches.intkey...
ubuntu#ip-172-31-42-144:~$ intkey load
batches: 2 batch/sec: 160.14600713999351
REST-API works, too.
I did exactly all steps as shown in the guide. The older one doesn't help me, too. hyperledger sawtooth validator node permissioning issue
ubuntu#ip-172-31-42-144:~$ curl http://localhost:8008/blocks
{
"error": {
"code": 15,
"message": "The validator has no genesis block, and is not yet ready to be queried. Try your request again later.",
"title": "Validator Not Ready"
}
}
genesis was attached ?!
MARiE
As the log shows, the genesis batch is waiting on the sawtooth-setting TP. If you start that up, just like you start up intkey and xo, it will process the genesis batch and will then be able to handle your intkey transactions.