Troubleshooting

Contents

Troubleshooting a particular domain resource

After you have an installed and running operator, it is rarely but sometimes necessary to debug the operator itself. If you are having problems with a particular domain resource, then first see Domain debugging .

Check Helm status

An operator runtime is installed into a Kubernetes cluster and maintained using a Helm release. For information about how to list your installed Helm releases and get each release’s configuration, see Useful Helm operations .

Ensure the operator CRDs are installed

When you install and run an operator, the installation should have deployed a domain custom resource and a cluster custom resource to the cluster. To check, verify that the following command lists a CRD with the name domains.weblogic.oracle and another CRD with the name clusters.weblogic.oracle:

$ kubectl get crd

The command output should look something like the following:

NAME                                    CREATED AT
clusters.weblogic.oracle                2022-10-15T03:45:27Z
domains.weblogic.oracle                 2022-10-15T03:45:27Z

When a domain or cluster CRD is not installed, the operator runtimes will not be able to monitor domains or clusters, and commands like kubectl get domains will fail.

Typically, the operator automatically installs each CRD when the operator first starts. However, if a CRD was not installed, for example, if the operator lacked sufficient permission to install it, then refer to the operator Prepare for installation documentation.

Check the operator deployment

Verify that the operator’s deployment is deployed and running by listing all deployments with the weblogic.operatorName label.

$ kubectl get deployment --all-namespaces=true -l weblogic.operatorName

Check the operator deployment’s detailed status:

$ kubectl -n OP_NAMESPACE get deployments/weblogic-operator -o yaml

And/or:

$ kubectl -n OP_NAMESPACE describe deployments/weblogic-operator

Each operator deployment will have a corresponding Kubernetes pod with a name that has a prefix that matches the deployment name, plus a unique suffix that changes every time the deployment restarts.

To find operator pods and check their high-level status:

$ kubectl get pods --all-namespaces=true -l weblogic.operatorName

To check the details for a given pod:

$ kubectl -n OP_NAMESPACE get pod weblogic-operator-UNIQUESUFFIX -o yaml
$ kubectl -n OP_NAMESPACE describe pod weblogic-operator-UNIQUESUFFIX

A pod describe usefully includes any events that might be associated with the operator.

Check the conversion webhook deployment

All operators in a Kubernetes cluster share a single conversion webhook deployment. Verify that the conversion webhook is deployed and running by listing all deployments with the weblogic.webhookName label.

$ kubectl get deployment --all-namespaces=true -l weblogic.webhookName

Check the conversion webhook deployment’s detailed status:

$ kubectl -n WH_NAMESPACE get deployments/weblogic-operator-webhook -o yaml

And/or:

$ kubectl -n WH_NAMESPACE describe deployments/weblogic-operator-webhook

Each conversion webhook deployment will have a corresponding Kubernetes pod with a name that has a prefix that matches the deployment name, plus a unique suffix that changes every time the deployment restarts.

To find conversion webhook pods and check their high-level status:

$ kubectl get pods --all-namespaces=true -l weblogic.webhookName

To check the details for a given pod:

$ kubectl -n WH_NAMESPACE get pod weblogic-operator-webhook-UNIQUESUFFIX -o yaml
$ kubectl -n WH_NAMESPACE describe pod weblogic-operator-webhook-UNIQUESUFFIX

A pod describe usefully includes any events that might be associated with the conversion webhook. For information about installing and uninstalling the webhook, see WebLogic Domain resource conversion webhook .

Check common operator issues

Helm 4 server-side apply field ownership conflicts

If a Helm chart manages WebLogic Domain or Cluster resources, a Helm 4 upgrade may fail when Helm uses Kubernetes Server-Side Apply and another field manager owns the same resource fields.

For example, an upgrade of a chart that contains a legacy weblogic.oracle/v8 Domain resource may fail with an error similar to:

Error: UPGRADE FAILED: conflict occurred while applying object MY_NAMESPACE/MY_DOMAIN weblogic.oracle/v8, Kind=Domain:
Apply failed with 1 conflict: conflict with "kubectl-client-side-apply" using weblogic.oracle/v8: .spec.clusters

The conflicting manager name may differ, for example, before-first-apply, kubectl-client-side-apply, or another tool that uses Kubernetes apply semantics. This issue is not caused by the WebLogic Kubernetes Operator reconciliation logic. The Kubernetes API server rejects the Helm update before the operator processes the resource.

Kubernetes Server-Side Apply tracks ownership of individual resource fields in metadata.managedFields. If Helm 4 applies a field that is already owned by another field manager, Kubernetes rejects the update unless conflicts are forced. The Helm 4 documentation describes Server-Side Apply as a Helm 4 feature for resolving cases where multiple tools manage the same Kubernetes resources; the Helm 4 helm upgrade command documents the --server-side option and the --force-conflicts option for conflict handling. For more information, see Helm 4 Overview and helm upgrade . This is most likely when all of the following are true:

  • A Helm chart manages WebLogic Domain or Cluster resources.
  • Helm 4 is using Server-Side Apply for the install or upgrade.
  • The same resource fields were previously created or modified by another apply manager, such as kubectl apply, kubectl apply --server-side, a GitOps tool, or an older workflow that caused Kubernetes to assign ownership to before-first-apply.

For legacy weblogic.oracle/v8 Domain resources, conflicts can appear on fields such as .spec.clusters. Beginning with weblogic.oracle/v9, cluster lifecycle settings are represented by separate Cluster resources, so equivalent conflicts may appear on Cluster fields such as .spec.replicas.

This scenario is separate from installing or upgrading the WebLogic Kubernetes Operator Helm chart itself. A typical operator Helm upgrade does not manage user Domain resources.

If the Helm chart is intended to be the source of truth for the affected Domain or Cluster resource fields, run the Helm 4 upgrade with conflict forcing:

$ helm upgrade RELEASE_NAME CHART_PATH \
  --namespace NAMESPACE \
  --server-side=true \
  --force-conflicts

This allows Helm to take ownership of the conflicted fields for future Server-Side Apply operations.

If you do not want Helm to take Server-Side Apply ownership of those fields, disable Server-Side Apply for the Helm 4 upgrade:

$ helm upgrade RELEASE_NAME CHART_PATH \
  --namespace NAMESPACE \
  --server-side=false

Avoid managing the same Domain or Cluster spec fields with multiple Server-Side Apply managers. For example, do not use both Helm Server-Side Apply and another Server-Side Apply based GitOps or kubectl apply --server-side workflow for the same fields unless you intentionally coordinate field ownership.

For legacy weblogic.oracle/v8 Domain resources, migrate authored manifests to weblogic.oracle/v9 and manage cluster lifecycle settings through separate Cluster resources. Do not continue using v8 manifests as the normal update path after conversion. This reduces the amount of mutable cluster lifecycle configuration in the Domain resource and aligns with the current operator schema. Note that Server-Side Apply conflicts can still occur if multiple managers apply the same v9 Domain or Cluster fields.

Check for operator events

To check for Kubernetes events that may have been logged to the operator’s namespace:

$ kubectl -n OP_NAMESPACE get events --sort-by='.lastTimestamp'

Check the operator metrics endpoint

The operator exposes a Prometheus-compatible metrics endpoint on port 8083 at path /metrics. The operator deployment is annotated for scraping using prometheus.io/scrape: 'true' and prometheus.io/port: '8083'.

To check the metrics endpoint from the operator namespace:

$ kubectl -n OP_NAMESPACE port-forward deployment/weblogic-operator 8083:8083

Then, in a different terminal:

$ curl http://localhost:8083/metrics

The operator metrics output includes standard JVM and process metrics from the Prometheus Java client. It also includes operator-specific metrics that identify the namespaces and domains that are actively managed by the operator at the time of the scrape:

  • wko_managed_namespace_count The number of namespaces actively managed by the operator.
  • wko_managed_namespace_info{namespace="..."} A presence metric with value 1 for each namespace actively managed by the operator.
  • wko_managed_domain_count The total number of domains actively managed by the operator across all active namespaces.
  • wko_managed_domain_info{namespace="...",domain_uid="..."} A presence metric with value 1 for each domain actively managed by the operator.

These metrics reflect the operator’s current runtime state. For example, if the operator uses the RegExp namespace selection strategy, then the metrics report the namespaces and domains that are currently being managed after the regular expression has been resolved, not the configured regular expression itself.

Check for conversion webhook events

To check for Kubernetes events that may have been logged to the conversion webhook’s namespace:

$ kubectl -n WH_NAMESPACE get events --sort-by='.lastTimestamp'

Check the operator log

Look for SEVERE and ERROR level messages in your operator logs. For example:

  • Find your operator.

    $ kubectl get deployment --all-namespaces=true -l weblogic.operatorName
    NAMESPACE                     NAME                DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    sample-weblogic-operator-ns   weblogic-operator   1         1         1            1           20h
  • Use grep on the operator log; look for SEVERE and WARNING level messages.

    $ kubectl logs deployment/weblogic-operator -n sample-weblogic-operator-ns  \
      | egrep -e "level...(SEVERE|WARNING)"
    {"timestamp":"03-18-2020T20:42:21.702+0000","thread":11,"fiber":"","domainUID":"","level":"WARNING","class":"oracle.kubernetes.operator.helpers.HealthCheckHelper","method":"createAndValidateKubernetesVersion","timeInMillis":1584564141702,"message":"Kubernetes minimum version check failed. Supported versions are 1.13.5+,1.14.8+,1.15.7+, but found version v1.12.3","exception":"","code":"","headers":{},"body":""}
  • You can filter out operator log messages specific to your domainUID by piping the previous logs command through grep "domainUID...MY_DOMAINUID". For example, assuming your operator is running in namespace sample-weblogic-operator-ns and your domain UID is sample-domain1:

    $ kubectl logs deployment/weblogic-operator -n sample-weblogic-operator-ns  \
      | egrep -e "level...(SEVERE|WARNING)" \
      | grep "domainUID...sample-domain1"

Check the conversion webhook log

To check the conversion webhook deployment’s log (especially look for SEVERE and ERROR level messages):

$ kubectl logs -n YOUR_CONVERSION_WEBHOOK_NS -c weblogic-operator-webhook deployments/weblogic-operator-webhook

Operator ConfigMap

An operator’s settings are automatically maintained by Helm in a Kubernetes ConfigMap named weblogic-operator-cm in the same namespace as the operator. To view the contents of this ConfigMap, call kubectl -n sample-weblogic-operator-ns get cm weblogic-operator-cm -o yaml.

Domain on PV hostPath PersistentVolume denied after upgrade

Beginning with operator version 4.3.9, operator-created PersistentVolumes that specify a hostPath source under domain.spec.configuration.initializeDomainOnPV.persistentVolume require the operator Helm chart value domainOnPV.localDeveloperMode=true. This mode is intended only for local development clusters and must not be used in production or shared multi-tenant clusters. If multiple Domain resources use the same hostPath and the same domain home location, their domain creation jobs can race and overwrite the same files.

If you upgrade the operator and then apply a Domain on PV resource that uses an operator-created hostPath PersistentVolume, the validating webhook may deny the request with an error similar to the following:

admission webhook "weblogic.validating.webhook" denied the request:
Persistent volume sample-domain1-pv-rwm1 is invalid, the 'spec.hostPath' source is not allowed in
'spec.configuration.initializeDomainOnPV.persistentVolume' unless the operator Helm chart value
'domainOnPV.localDeveloperMode' is enabled.

To enable this setting for a local development cluster, update the Helm release:

$ helm upgrade OPERATOR_RELEASE_NAME weblogic-operator/weblogic-operator \
  --namespace OP_NAMESPACE \
  --reuse-values \
  --set domainOnPV.localDeveloperMode=true \
  --wait

The Helm upgrade updates the operator and webhook ConfigMaps, but if this is the only change, Kubernetes may not restart the already running operator or validating webhook Pods because their Deployment pod templates did not change. Restart the operator and webhook deployments so that they use the updated ConfigMap values:

$ kubectl -n OP_NAMESPACE rollout restart deployment/weblogic-operator
$ kubectl -n OP_NAMESPACE rollout status deployment/weblogic-operator
$ kubectl -n OP_NAMESPACE rollout restart deployment/weblogic-operator-webhook
$ kubectl -n OP_NAMESPACE rollout status deployment/weblogic-operator-webhook

You can verify the live setting with:

$ kubectl -n OP_NAMESPACE get cm weblogic-operator-cm \
  -o jsonpath='{.data.domainOnPVLocalDeveloperMode}{"\n"}'
$ kubectl -n OP_NAMESPACE get cm weblogic-webhook-cm \
  -o jsonpath='{.data.domainOnPVLocalDeveloperMode}{"\n"}'

Force the operator to restart

Note

An operator is designed to robustly handle thousands of domains even in the event of failures, so it should not normally be necessary to force an operator to restart, even after an upgrade. Accordingly, if you encounter a problem that you think requires an operator restart to resolve, then please make sure that the operator development team is aware of the issue (see Get Help ).

When you restart an operator:

  • The operator is temporarily unavailable for managing its namespaces.
    • For example, a domain that is created while the operator is restarting will not be started until the operator pod is fully up again.
  • This will not shut down your current domains or affect their resources.
  • The restarted operator will rediscover existing domains and manage them.

There are several approaches for restarting an operator:

  • Most simply, use the helm upgrade command: helm upgrade <release-name> --reuse-values --recreate-pods

    $ helm upgrade weblogic-operator --reuse-values --recreate-pods
  • Delete the operator pod, and let Kubernetes restart it.

    a. First, find the operator pod you wish to delete:

    $ kubectl get pods --all-namespaces=true -l weblogic.operatorName

    b. Second, delete the pod. For example:

    $ kubectl delete pod/weblogic-operator-65b95bc5b5-jw4hh -n OP_NAMESPACE
  • Scale the operator deployment to 0, and then back to 1, by changing the value of the replicas.

    a. First, find the namespace of the operator deployment you wish to restart:

    $ kubectl get deployment --all-namespaces=true -l weblogic.operatorName

    b. Second, scale the deployment down to zero replicas:

    $ kubectl scale deployment.apps/weblogic-operator -n OP_NAMESPACE --replicas=0

    c. Finally, scale the deployment back up to one replica:

    $ kubectl scale deployment.apps/weblogic-operator -n OP_NAMESPACE --replicas=1

Operator and conversion webhook logging level

Warning

It should rarely be necessary to change the operator and conversion webhook to use a finer-grained logging level, but, in rare situations, the operator support team may direct you to do so. If you change the logging level, then be aware that FINE or finer-grained logging levels can be extremely verbose and quickly use up gigabytes of disk space in the span of hours, or, at the finest levels, during heavy activity, in even minutes. Consequently, the logging level should only be increased for as long as is needed to help get debugging data for a particular problem.

To set the operator javaLoggingLevel to FINE (default is INFO) assuming the operator Helm release is named sample-weblogic-operator its namespace is sample-weblogic-operator-ns, and you have locally downloaded the operator src to /tmp/weblogic-kubernetes-operator:

$ cd /tmp/weblogic-kubernetes-operator
$ helm upgrade \
  sample-weblogic-operator \
  weblogic-operator/weblogic-operator \
  --namespace sample-weblogic-operator-ns \
  --reuse-values \
  --set "javaLoggingLevel=FINE" \
  --wait

To set the operator javaLoggingLevel back to INFO:

$ helm upgrade \
  sample-weblogic-operator \
  weblogic-operator/weblogic-operator \
  --namespace sample-weblogic-operator-ns \
  --reuse-values \
  --set "javaLoggingLevel=INFO" \
  --wait

For more information, see the javaLoggingLevel documentation.

Troubleshooting the conversion webhook

The following are some common mistakes and solutions for the conversion webhook.

Ensure the conversion webhook is deployed and running

Verify that the conversion webhook is deployed and running by following the steps in check the conversion webhook deployment . If it is not deployed, then you will see the following conversion webhook not found error when creating a Domain with weblogic.oracle/v8 schema Domain resource.

Error from server: error when creating "k8s-domain.yaml": conversion webhook for weblogic.oracle/v9, Kind=Domain failed: Post "https://weblogic-operator-webhook-svc.sample-weblogic-operator-ns.svc:8084/webhook?timeout=30s": service "weblogic-operator-webhook-svc" not found

The conversion webhook can be deployed standalone or as part of an operator installation. Note that if the conversion webhook was installed as part of an operator installation, then it is implicitly removed by default when the operator is uninstalled. If the conversion webhook is not deployed or running, then reinstall it by following the steps in Installing the conversion webhook .

If the conversion webhook Deployment is deployed but is not in the ready status, then you will see a connection refused error when creating a Domain using the weblogic.oracle/v8 schema Domain resource.

The POST URL in the error message has the name of the conversion webhook service and the namespace. For example, if the POST URL is https://weblogic-operator-webhook-svc.sample-weblogic-operator-ns.svc:8084/webhook?timeout=30s, then the service name is weblogic-operator-webhook-svc and the namespace is sample-weblogic-operator-ns. In this case, run the following commands to ensure that the Deployment is running and the webhook service exists in the sample-weblogic-operator-ns namespace.

$  kubectl get deployment weblogic-operator-webhook -n sample-weblogic-operator-ns
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
weblogic-operator-webhook   1/1     1            1           87m

$  kubectl get service weblogic-operator-webhook-svc -n sample-weblogic-operator-ns
NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
weblogic-operator-webhook-svc   ClusterIP   10.106.89.198   <none>        8084/TCP   88m

If the conversion webhook Deployment status is not ready, then check the conversion webhook log and the conversion webhook events in the conversion webhook namespace. If the conversion webhook service doesn’t exist, make sure that the conversion webhook was installed correctly and reinstall the conversion webhook to see if it resolves the issue.

X509: Certificate signed by unknown authority error from the webhook

The following x509: certificate signed by unknown authority error from the conversion webhook can be due to the incorrect proxy configuration of the Kubernetes API server in your environment or incorrect self-signed certificate in the conversion webhook configuration in the Domain CRD.

Error from server (InternalError): error when creating "./weblogic-domains/sample-domain1/domain.yaml": Internal error occurred: conversion webhook for weblogic.oracle/v8, Kind=Domain failed: Post "https://weblogic-operator-webhook-svc.sample-weblogic-operator-ns.svc:8084/webhook?timeout=30s": x509: certificate signed by unknown authority
  • If your environment uses a PROXY server, then ensure that the NO_PROXY settings of the Kubernetes API server include the .svc value. The Kubernetes API server makes a REST request to the conversion webhook REST endpoint using the host name weblogic-operator-webhook-svc.${NAMESPACE}.svc in the POST URL. If the REST request is routed through a PROXY server, then you will see an “x509: certificate signed by unknown authority” error. Because this REST request is internal to your Kubernetes cluster, ensure that it doesn’t get routed through a PROXY server by adding .svc to the NO_PROXY settings.

  • If, for some reason your Domain CRD conversion webhook configuration has an incorrect self-signed certificate, then you can patch the Domain CRD to remove the existing conversion webhook configuration. The operator will re-create the conversion webhook configuration with the correct self-signed certificate in the Domain CRD. Use the following patch command to remove the conversion webhook configuration in the Domain CRD to see if it resolves the error.

    kubectl patch crd domains.weblogic.oracle --type=merge --patch '{"spec": {"conversion": {"strategy": "None", "webhook": null}}}'

Webhook errors in older operator versions

When you install operator version 4.x or upgrade to operator 4.x, a conversion webhook configuration is added to your Domain CRD. If you downgrade or switch back to the operator version 3.x, the conversion webhook configuration is not removed from the CRD. This is to support environments with multiple operator installations potentially with different versions. For environments having a single operator installation, use the following patch command to manually remove the conversion webhook configuration from Domain CRD.

kubectl patch crd domains.weblogic.oracle --type=merge --patch '{"spec": {"conversion": {"strategy": "None", "webhook": null}}}'

Webhook errors in operator dedicated Mode

If the operator is running in the Dedicated mode, the operator’s service account will not have the permission to read or update the CRD. If you need to convert the domain resources with weblogic.oracle/v8 schema to weblogic.oracle/v9 schema using the conversion webhook in Dedicated mode, then you can manually add the conversion webhook configuration to the Domain CRD. Use the following patch command to add the conversion webhook configuration to the Domain CRD.

NOTE: Substitute YOUR_OPERATOR_NS in the below command with the namespace where the operator is installed.

export OPERATOR_NS=YOUR_OPERATOR_NS
kubectl patch crd domains.weblogic.oracle --type=merge --patch '{"spec": {"conversion": {"strategy": "Webhook", "webhook": {"clientConfig": { "caBundle": "'$(kubectl get secret weblogic-webhook-secrets -n ${OPERATOR_NS} -o=jsonpath="{.data.webhookCert}"| base64 --decode)'", "service": {"name": "weblogic-operator-webhook-svc", "namespace": "'${OPERATOR_NS}'", "path": "/webhook", "port": 8084}}, "conversionReviewVersions": ["v1"]}}}}'

Check for runtime errors during conversion

If you see a WebLogic Domain custom resource conversion webhook failed error when creating a Domain with a weblogic.oracle/v8 schema domain resource, then check the conversion webhook runtime Pod logs and check for the generated events in the conversion webhook namespace. Assuming that the conversion webhook is deployed in the sample-weblogic-operator-ns namespace, run the following commands to check for logs and events.

$ kubectl logs -n sample-weblogic-operator-ns -c weblogic-operator-webhook deployments/weblogic-operator-webhook

$ kubectl get events -n sample-weblogic-operator-ns

See also

If you have set up either of the following, then these documents may be helpful in debugging: