Eliminate Longhorn-system stuck in Terminating State - k3s

I am trying to delete an instance of longhorn, as well as the namespace, that is stuck in the terminating state.
I tried all three methods on the longhorn documentation, of which all fail. I cannot uninstall longhorn using a helm chart as I never installed longhorn through helm in the first place. Uninstalling longhorn using the kubectl also fails to create job.batch/longhorn-uninstall because the namespace longhorn-system is in the Terminating state.
Editing the CRDs and finalizers, as per the troubleshooting documentation and the following site (https://avasdream.engineer/kubernetes-longhorn-stuck-terminating) also do not fix the problem of the system terminating, as in both cases, there is no change. Using the script from https://github.com/longhorn/longhorn/blob/master/scripts/cleanup.sh also fails to terminate longhorn, as it fails to find any of the resources.
The query kubectl get namespace longhorn-system -o json gives the following results:
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"creationTimestamp": "2022-10-31T20:09:45Z",
"deletionTimestamp": "2023-01-26T18:17:03Z",
"labels": {
"kubernetes.io/metadata.name": "longhorn-system",
"name": "longhorn-system"
},
"name": "longhorn-system",
"resourceVersion": "41929420",
"uid": "f1c78184-4613-4f9d-939d-13947ac8befa"
},
"spec": {
"finalizers": [
"kubernetes"
]
},
"status": {
"conditions": [
{
"lastTransitionTime": "2023-01-26T18:29:35Z",
"message": "All resources successfully discovered",
"reason": "ResourcesDiscovered",
"status": "False",
"type": "NamespaceDeletionDiscoveryFailure"
},
{
"lastTransitionTime": "2023-01-26T18:17:27Z",
"message": "All legacy kube types successfully parsed",
"reason": "ParsedGroupVersions",
"status": "False",
"type": "NamespaceDeletionGroupVersionParsingFailure"
},
{
"lastTransitionTime": "2023-01-26T18:17:51Z",
"message": "All content successfully deleted, may be waiting on finalization",
"reason": "ContentDeleted",
"status": "False",
"type": "NamespaceDeletionContentFailure"
},
{
"lastTransitionTime": "2023-01-26T18:17:27Z",
"message": "Some resources are remaining: engines.longhorn.io has 2 resource instances, nodes.longhorn.io has 5 resource instances, orphans.longhorn.io has 1 resource instances, replicas.longhorn.io has 4 resource instances, snapshots.longhorn.io has 3 resource instances, volumes.longhorn.io has 2 resource instances",
"reason": "SomeResourcesRemain",
"status": "True",
"type": "NamespaceContentRemaining"
},
{
"lastTransitionTime": "2023-01-26T18:17:27Z",
"message": "Some content in the namespace has finalizers remaining: longhorn.io in 17 resource instances",
"reason": "SomeFinalizersRemain",
"status": "True",
"type": "NamespaceFinalizersRemaining"
}
],
"phase": "Terminating"
}
}
The query kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found -n longhorn-system yields the following information.
NAME STATE NODE INSTANCEMANAGER IMAGE AGE
engine.longhorn.io/pvc-1886524a-5ba0-459d-9d51-b8044fec3057-e-344e3d26 stopped 64d
engine.longhorn.io/pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001-e-79f24cb7 stopped 64d
NAME READY ALLOWSCHEDULING SCHEDULABLE AGE
node.longhorn.io/master0 False true True 86d
node.longhorn.io/master1 True true True 14d
node.longhorn.io/master2 False true True 86d
node.longhorn.io/worker0 False true True 70d
node.longhorn.io/worker1 True true True 70d
NAME TYPE NODE
orphan.longhorn.io/orphan-010ee0d16422c151e7e039e27fe2306815361596fa3f8b6cccc8a601b673e429 replica master0
NAME STATE NODE DISK INSTANCEMANAGER IMAGE AGE
replica.longhorn.io/pvc-1886524a-5ba0-459d-9d51-b8044fec3057-r-89dfabab stopped master2 c5a7e70d-09d8-43a2-9ba3-d5b65eb12b34 13d
replica.longhorn.io/pvc-1886524a-5ba0-459d-9d51-b8044fec3057-r-a6881548 running worker1 2d7f16e8-f11b-40e8-8935-7f0559f7674e instance-manager-r-8ccf914f longhornio/longhorn-engine:v1.3.2 16d
replica.longhorn.io/pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001-r-52b5a290 stopped worker1 2d7f16e8-f11b-40e8-8935-7f0559f7674e 31d
replica.longhorn.io/pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001-r-8f0ae6c9 running master2 c5a7e70d-09d8-43a2-9ba3-d5b65eb12b34 instance-manager-r-672003dc longhornio/longhorn-engine:v1.3.2 13d
NAME VOLUME CREATIONTIME READYTOUSE RESTORESIZE SIZE AGE
snapshot.longhorn.io/887f9621-5417-40b3-8999-c2695d5585d7 pvc-1886524a-5ba0-459d-9d51-b8044fec3057 2023-01-12T21:07:46Z false 10737418240 312860672 13d
snapshot.longhorn.io/8f11c48b-da51-4124-80b8-1316db88eb01 pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001 2023-01-12T21:21:54Z false 21474836480 20096512000 13d
snapshot.longhorn.io/b4c31fe7-5ff5-4881-9cd8-b22fc73798bb pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001 2023-01-12T21:38:23Z true 21474836480 102400 13d
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
volume.longhorn.io/pvc-1886524a-5ba0-459d-9d51-b8044fec3057 attaching unknown 10737418240 master0 64d
volume.longhorn.io/pvc-5aca06f9-adf1-45c8-b11d-6e79d9719001 detaching unknown 21474836480 64d
Attempting to manually delete any of the items described also failed. All APIservices have Availability as TRUE.
What do I do to resolve this problem? I will provide any more information needed.

Related

Google App Engine ERROR Throttling refreshCfg with Gcloud MySQL instance

Continuous failed connection attempts errors are occurring in Google Cloud MySQL running on Google APP Engine with public IP.
These are some of the logs:
receiveTimestamp resource.labels.module_id resource.labels.project_id resource.labels.version_id resource.labels.zone resource.type severity textPayload timestamp
2021-06-08T05:48:43.497385728Z zzzzzzzzzz xxxxxxx 4 eu5 gae_app ERROR Throttling refreshCfg(xxxxxxx:europe-west1:yyyyyyyyy): it was only called 80.802µs ago 2021-06-08T05:48:43.494284Z
2021-06-08T05:19:08.394840567Z zzzzzzzzzz xxxxxxx 4 eu5 gae_app ERROR Throttling refreshCfg(xxxxxxx:europe-west1:yyyyyyyyy): it was only called 42.519µs ago 2021-06-08T05:19:08.391909Z
2021-06-08T05:13:42.889911567Z zzzzzzzzzz xxxxxxx 4 eu5 gae_app ERROR Throttling refreshCfg(xxxxxxx:europe-west1:yyyyyyyyy): it was only called 73.279µs ago 2021-06-08T05:13:42.888659Z
2021-06-08T04:47:07.470804269Z zzzzzzzzzz xxxxxxx 4 eu5 gae_app ERROR Throttling refreshCfg(xxxxxxx:europe-west1:yyyyyyyyy): it was only called 85.928µs ago 2021-06-08T04:47:07.467377Z
I tried some different configurations of max_connections, pool_size, pool_timeout with no success.
I have consulted this previous Issue.
And this documentation.
Some help would be appreciated.
More information. The error is always preceded by this warning in the log record:
"protoPayload": {
"#type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {},
"authenticationInfo": {
"principalEmail": "bbbbbbbb#appspot.gserviceaccount.com",
"serviceAccountDelegationInfo": [
{
"firstPartyPrincipal": {
"principalEmail": "app-engine-appserver#prod.google.com"
}
}
]
},
"requestMetadata": {
"callerIp": "2600:1900:2001:12::8",
"requestAttributes": {
"time": "2021-06-09T05:59:27.400680Z",
"auth": {}
},
"destinationAttributes": {}
},
"serviceName": "cloudsql.googleapis.com",
"methodName": "cloudsql.instances.connect",
"authorizationInfo": [
{
"resource": "instances/aaaaaaaaaaa",
"permission": "cloudsql.instances.connect",
"granted": true,
"resourceAttributes": {}
}
],
"resourceName": "instances/aaaaaaaaa",
"request": {
"project": "bbbbbbb",
"#type": "type.googleapis.com/google.cloud.sql.v1beta4.SqlInstancesCreateEphemeralCertRequest",
"instance": "zzzzzzzzz",
"body": {}
},
"response": {
"#type": "type.googleapis.com/google.cloud.sql.v1beta4.SslCert",
"kind": "sql#sslCert"
}
},
"insertId": "-rgtsssssssssss",
"resource": {
"type": "cloudsql_database",
"labels": {
"region": "europe-west1",
"project_id": "bbbbbbbb",
"database_id": "aaaaaaaaaaaaaaa"
}
},
"timestamp": "2021-06-09T05:59:27.381352Z",
"severity": "NOTICE",
"logName":
"projects/demosmf/logs/cloudaudit.googleapis.com%2Factivity",
"receiveTimestamp": "2021-06-09T05:59:27.746071609Z"
I think it has something to do with the management of ssl certificates.
I have verified that the application certificates are valid and have not expired
This error has been reported via Google's Public Issue Tracker.
You can follow the thread I mentioned above to track the progress.

Get host status by CheckMK Web-API

I'm trying to get the status of a host with the CheckMK WebAPI. Can someone point me in the right direction how to get these data?
We're currently using CheckMK enterprise 1.4.0.
I've tried:
https://<monitoringhost.tld>/<site>/check_mk/webapi.py?action=get_host&_username=<user>&_secret=<secret>&output_format=json&effective_attributes=1&request={"hostname": "<hostname>"}
But the response does not have any relevant information about the host itself (e.g. state up/down, uptime, etc.).
{
"result": {
"attributes": {
"network_scan": {
"scan_interval": 86400,
"exclude_ranges": [],
"ip_ranges": [],
"run_as": "api"
},
"tag_agent": "cmk-agent",
"snmp_community": null,
"ipv6address": "",
"alias": "",
"management_protocol": null,
"site": "testjke",
"tag_address_family": "ip-v4-only",
"tag_criticality": "prod",
"contactgroups": [
true,
[]
],
"network_scan_result": {
"start": null,
"state": null,
"end": null,
"output": ""
},
"parents": [],
"management_address": "",
"tag_networking": "lan",
"ipaddress": "",
"management_snmp_community": null
},
"hostname": "<host>",
"path": ""
},
"result_code": 0
The webapi is only for getting/setting the configuration of a host or other objects. If you want't to get the live status of a host use livestatus.
If you enabled livestats on port 6557 (default) you could query the status of a host via network. If you are logged into a shell locally you can use 'lq'.
OMD[mysite]:~$ lq "GET hosts\nColumns: name"
Why:
The CheckMK webapi if for accessing WATO. WATO is the source for creating the nagios configuration. Nagios will do the monitoring of the hosts and the livestatus api is an extension of the nagios core.
http://<monitoringhost.tld>/<site>/check_mk/view.py?view_name=allhosts&output_format=csv
You can use all the views that you see in the webui by adding output_format=[csv|json|python].
You will the data of the table that you see.
You also need to add the creditals as seen in yout question.

How to check if name already exists? Azure Ressource Manager Template

is it possible to check, in an ARM Template, if the name for my Virtual Machine already exists?
I am developing a Solution Template for the Azure Marketplace. Maybe it is possible to set a paramter in the UiDefinition uniqe?
The goal is to reproduce this green Hook
A couple notes...
VM Names only need to be unique within a resourceGroup, not within the subscription
Solution Templates must be deployed to empty resourceGroups, so collisions with existing resources aren't possible
For solution templates the preference is that you simply name the VMs for the user, rather than asking - use something that is appropriate for the workload (e.g. jumpbox) - not all solutions do this but we're trying to improve that experience
Given that it's not likely we'll ever build a control that checks for naming collisions on resources without globally unique constraints.
That help?
This looks impossible, according to the documentation.
There are no validation scenarious.
I assume that you should be using the Microsoft.Common.TextBox UI element in your createUiDefinition.json.
I have tried to reproduce a green check by creating a simple createUiDefinition.json as below with a Microsoft.Common.TextBox UI element as shown below.
{
"$schema": "https://schema.management.azure.com/schemas/0.1.2-preview/CreateUIDefinition.MultiVm.json",
"handler": "Microsoft.Compute.MultiVm",
"version": "0.1.2-preview",
"parameters": {
"basics": [
{
"name": "textBoxA",
"type": "Microsoft.Common.TextBox",
"label": "VM Name",
"defaultValue": "",
"toolTip": "Please enter a VM name",
"constraints": {
"required": true
},
"visible": true
}
],
"steps": [],
"outputs": {}
}
}
I am able to reproduce the green check beside the VM Name textbox as shown below:
However, this green check DOES NOT imply the VM Name is Available.
This is because based on my testing, even if I use an existing VM Name in the same subscription, it is still showing the green check.
Based on the official documented constraints that are supported by the Microsoft.Common.TextBox UI element, it DOES NOT VALIDATE Name Availability.
Hope this helps!
While bmoore's point is correct that it's unlikely you would ever need this for a VM (nor is there an API for it), there are other compute resources that do have global naming requirements.
As of 2022 this concept is possible now with the use of the ArmApiControl UI element. It allows you to call ARM apis as part of validation in the createUiDefinition.json. Here is an example using the check name API for an Azure App service.
{
"$schema": "https://schema.management.azure.com/schemas/0.1.2-preview/CreateUIDefinition.MultiVm.json#",
"handler": "Microsoft.Azure.CreateUIDef",
"version": "0.1.2-preview",
"parameters": {
"basics": [
{}
],
"steps": [
{
"name": "domain",
"label": "Domain Names",
"elements": [
{
"name": "domainInfo",
"type": "Microsoft.Common.InfoBox",
"visible": true,
"options": {
"icon": "Info",
"text": "Pick the domain name that you want to use for your app."
}
},
{
"name": "appServiceAvailabilityApi",
"type": "Microsoft.Solutions.ArmApiControl",
"request": {
"method": "POST",
"path": "[concat(subscription().id, '/providers/Microsoft.Web/checknameavailability?api-version=2021-02-01')]",
"body": "[parse(concat('{\"name\":\"', concat('', steps('domain').domainName), '\", \"type\": \"Microsoft.Web/sites\"}'))]"
}
},
{
"name": "domainName",
"type": "Microsoft.Common.TextBox",
"label": "Domain Name Word",
"toolTip": "The name of your app service",
"placeholder": "yourcompanyname",
"constraints": {
"validations": [
{
"regex": "^[a-zA-Z0-9]{4,30}$",
"message": "Alphanumeric, between 4 and 30 characters."
},
{
"isValid": "[not(equals(steps('domain').appServiceAvailabilityApi.nameAvailable, false))]",
"message": "[concat('Error with the url: ', steps('domain').domainName, '. Reason: ', steps('domain').appServiceAvailabilityApi.reason)]"
},
{
"isValid": "[greater(length(steps('domain').domainName), 4)]",
"message": "The unique domain suffix should be longer than 4 characters."
},
{
"isValid": "[less(length(steps('domain').domainName), 30)]",
"message": "The unique domain suffix should be shorter than 30 characters."
}
]
}
},
{
"name": "section1",
"type": "Microsoft.Common.Section",
"label": "URLs to be created:",
"elements": [
{
"name": "domainExamplePortal",
"type": "Microsoft.Common.TextBlock",
"visible": true,
"options": {
"text": "[concat('https://', steps('domain').domainName, '.azurewebsites.net - The main app service URL')]"
}
}
],
"visible": true
}
]
}
],
"outputs": {
"desiredDomainName": "[steps('domain').domainName]"
}
}
}
You can copy the above code and test it in the createUiDefinition.json sandbox azure provides.

CannotStartContainerError while submitting a AWS Batch Job

In AWS Batch I have a job definition and a job queue and a compute environment where to execute my AWS Batch jobs.
After submitting a job, I find it in the list of the failed ones with this error:
Status reason
Essential container in task exited
Container message
CannotStartContainerError: API error (404): oci runtime error: container_linux.go:247: starting container process caused "exec: \"/var/application/script.sh --file= --key=.
and in the cloudwatch logs I have:
container_linux.go:247: starting container process caused "exec: \"/var/application/script.sh --file=Toulouse.json --key=out\": stat /var/application/script.sh --file=Toulouse.json --key=out: no such file or directory"
I have specified a correct docker image that has all the scripts (we use it already and it works) and I don't know where the error is coming from.
Any suggestions are very appreciated.
The docker file is something like that:
# Pull base image.
FROM account-id.dkr.ecr.region.amazonaws.com/application-image.base-php7-image:latest
VOLUME /tmp
VOLUME /mount-point
RUN chown -R ubuntu:ubuntu /var/application
# Create the source directories
USER ubuntu
COPY application/ /var/application
# Register aws profile
COPY data/aws /home/ubuntu/.aws
WORKDIR /var/application/
ENV COMPOSER_CACHE_DIR /tmp
RUN composer update -o && \
rm -Rf /tmp/*
Here is the Job Definition:
{
"jobDefinitionName": "JobDefinition",
"jobDefinitionArn": "arn:aws:batch:region:accountid:job-definition/JobDefinition:25",
"revision": 21,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"retryStrategy": {
"attempts": 1
},
"containerProperties": {
"image": "account-id.dkr.ecr.region.amazonaws.com/application-dev:latest",
"vcpus": 1,
"memory": 512,
"command": [
"/var/application/script.sh",
"--file=",
"Ref::file",
"--key=",
"Ref::key"
],
"volumes": [
{
"host": {
"sourcePath": "/mount-point"
},
"name": "logs"
},
{
"host": {
"sourcePath": "/var/log/php/errors.log"
},
"name": "php-errors-log"
},
{
"host": {
"sourcePath": "/tmp/"
},
"name": "tmp"
}
],
"environment": [
{
"name": "APP_ENV",
"value": "dev"
}
],
"mountPoints": [
{
"containerPath": "/tmp/",
"readOnly": false,
"sourceVolume": "tmp"
},
{
"containerPath": "/var/log/php/errors.log",
"readOnly": false,
"sourceVolume": "php-errors-log"
},
{
"containerPath": "/mount-point",
"readOnly": false,
"sourceVolume": "logs"
}
],
"ulimits": []
}
}
In Cloudwatch log stream /var/log/docker:
time="2017-06-09T12:23:21.014547063Z" level=error msg="Handler for GET /v1.17/containers/4150933a38d4f162ba402a3edd8b7763c6bbbd417fcce232964e4a79c2286f67/json returned error: No such container: 4150933a38d4f162ba402a3edd8b7763c6bbbd417fcce232964e4a79c2286f67"
This error was because the command was malformed. I was submitting the job by a lambda function (python 2.7) using boto3 and the syntax of the command should be something like this:
'command' : ['sudo','mkdir','directory']
Hope it helps somebody.

Data Factory: AzureSQL in- and output for pipeline activity type AzureMLBatchExecution

In Azure Data Factory, I’m trying to call an Azure Machine Learning model by a Data Factory Pipeline. I want to use a Azure SQL table as input and another Azure SQL table for the output.
First I deployed a Machine Learning (classic) web service. Then I created an Azure Data Factory Pipeline, using a LinkedService (type= ‘AzureML’, using Request URI and API key of the ML-webservice) and a input and output dataset (‘AzureSqlTable’ type).
Deploying and Provisioning is succeeded. The pipeline starts as scheduled, but keeps ‘Running’ without any result. The pipeline activity is not being shown in the Monitor&Manage: Activity Windows.
On different sites and tutorials, I only find JSON-scripts using the activity type ‘AzureMLBatchExecution’ with BLOB in- and outputs. I want to use AzureSQL in- and output but I can’t get this working.
Can someone provide a sample JSON-script or tell me what’s possibly wrong with the code below?
Thanks!
{
"name": "Predictive_ML_Pipeline",
"properties": {
"description": "use MyAzureML model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {},
"inputs": [
{
"name": "AzureSQLDataset_ML_Input"
}
],
"outputs": [
{
"name": "AzureSQLDataset_ML_Output"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Week",
"interval": 1
},
"name": "My_ML_Activity",
"description": "prediction analysis on ML batch input",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2017-04-04T09:00:00Z",
"end": "2017-04-04T18:00:00Z",
"isPaused": false,
"hubName": "myml_hub",
"pipelineMode": "Scheduled"
}
}
With a little help from a Microsoft technician, I've got this working. The JSON script as mentioned above is only changed in the schedule-section:
"start": "2017-04-01T08:45:00Z",
"end": "2017-04-09T18:00:00Z",
A pipeline is active only between its start time and end time. Because the scheduler is set to weekly, the pipeline is triggered at the start of the week: that date should be within start- and end date. For more details about scheduling, see: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
The Azure SQL Input dataset should look like this:
{
"name": "AzureSQLDataset_ML_Input",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "SRC_SQL_Azure",
"typeProperties": {
"tableName": "dbo.Azure_ML_Input"
},
"availability": {
"frequency": "Week",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
I added the external and policy properties to this dataset (see script above) and after that, it worked.