Skip to content

Troubleshooting guide for OCI Service Operator for Kubernetes (OSOK)

Operator Lifecycle Manager (OLM) Installation Issues

OLM Installation Status

In general verify the status of OLM installation in the cluster using the below command:

$ operator-sdk olm status

Expected output of above command :

INFO[0016] Fetching CRDs for version "0.18.2"
INFO[0016] Fetching resources for resolved version "v0.18.2"
INFO[0061] Successfully got OLM status for version "0.18.2"

NAME                                            NAMESPACE    KIND                        STATUS
operators.operators.coreos.com                               CustomResourceDefinition    Installed
operatorgroups.operators.coreos.com                          CustomResourceDefinition    Installed
operatorconditions.operators.coreos.com                      CustomResourceDefinition    Installed
installplans.operators.coreos.com                            CustomResourceDefinition    Installed
clusterserviceversions.operators.coreos.com                  CustomResourceDefinition    Installed
olm-operator                                    olm          Deployment                  Installed
olm-operator-binding-olm                                     ClusterRoleBinding          Installed
operatorhubio-catalog                           olm          CatalogSource               Installed
olm-operators                                   olm          OperatorGroup               Installed
aggregate-olm-view                                           ClusterRole                 Installed
catalog-operator                                olm          Deployment                  Installed
subscriptions.operators.coreos.com                           CustomResourceDefinition    Installed
aggregate-olm-edit                                           ClusterRole                 Installed
olm                                                          Namespace                   Installed
global-operators                                operators    OperatorGroup               Installed
operators                                                    Namespace                   Installed
packageserver                                   olm          ClusterServiceVersion       Installed
olm-operator-serviceaccount                     olm          ServiceAccount              Installed
catalogsources.operators.coreos.com                          CustomResourceDefinition    Installed
system:controller:operator-lifecycle-manager                 ClusterRole                 Installed

If the output of the OLM installation is having any failures, please uninstall and re-install the OLM into the cluster.

## Uninstall the OLM
$ operator-sdk olm uninstall

## Install the OLM
$ operator-sdk olm install

OLM Installation Issues

OLM installation fails with below error

FATA[0055] Failed to install OLM version "latest": detected existing OLM resources: OLM must be completely uninstalled before installation

Cleanup and Re-Install the OLM as below:

$ operator-sdk olm uninstall

If the above command fails, identify which version of OLM from below command

$ operator-sdk olm status

Option 1 : Run the below command to uninstall OLM using version

$ operator-sdk olm uninstall --version <OLM_VERSION>

Option 2 : Run the below command to uninstall OLM and its related components

$ kubectl -n olm get csvs
$ export OLM_RELEASE=<OLM_VERSION>
$ kubectl delete apiservices.apiregistration.k8s.io v1.packages.operators.coreos.com
$ kubectl delete -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/${OLM_RELEASE}/crds.yaml
$ kubectl delete -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/${OLM_RELEASE}/olm.yaml

Option 3 : In case OLM uninstall still fails, run below commands to uninstall OLM and its related components

$ kubectl delete apiservices.apiregistration.k8s.io v1.packages.operators.coreos.com
$ kubectl delete -f https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/crds.yaml
$ kubectl delete -f https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml

Verify OLM has been uninstalled successfully

$ kubectl get namespace olm
Error from server (NotFound): namespaces olm not found

Verify that OLM has been uninstalled successfully by making sure that OLM owned CustomResourceDefinitions are removed:

$ kubectl get crd | grep operators.coreos.com

OSOK Deployment Issues

If OSOK installation using OLM fails with below error message

FATA[0125] Failed to run bundle upgrade: error waiting for CSV to install: timed out waiting for the condition

The error signifies that during the installation of the OSOK, it timed out waiting for the condition. To mitigate this issue. Try to delete the bundle pod of the OSOK version we are trying to deploy

$ kubectl get pods | grep 'oci-service-operator-.*-bundle'

$ kubectl delete pod <POD_FROM_ABOVE_COMMAND>

After the bundle pod is deleted, re-install the OSOK bundle

$ operator-sdk run bundle ghcr.io/<REPOSITORY_OWNER>/oci-service-operator-<GROUP>-bundle:v2.0.0-alpha
## or for Upgrade
$ operator-sdk run bundle-upgrade ghcr.io/<REPOSITORY_OWNER>/oci-service-operator-<GROUP>-bundle:v2.0.0-alpha

Verify the OSOK is deployed successfully

$ kubectl get deployments -n $NAMESPACE | grep "oci-service-operator-<GROUP>-controller-manager"
..

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
oci-service-operator-mysql-controller-manager   1/1     1            1           2d20h

If all the replicas in deployment is not running, verify deployment logs for specific issue using below commands :

$ kubectl logs deploy/oci-service-operator-<GROUP>-controller-manager -n $NAMESPACE -f

OSOK Pods Issues

Verify the OSOK pods are running successfully

$ kubectl get pods -n $NAMESPACE | grep "oci-service-operator-<GROUP>-controller-manager"

oci-service-operator-mysql-controller-manager-5fcf985fd7-zj7d9    1/1     Running     0          2d22h

If the pods not running, verify pod logs for specific issue using below commands :

$ kubectl logs pod/oci-service-operator-<GROUP>-controller-manager-5fcf985fd7-zj7d9 -n $NAMESPACE -f

Note : Use the namespace that contains the controller-manager deployment for the package you installed. For the published subpackage bundles this is normally the package namespace from packages/<group>/metadata.env, for example oci-service-operator-mysql-system, unless you explicitly overrode the installation namespace.

Debugging Custom Resource (CR) Issues

If CR creation fails, monitor the OSOK controller pod logs (with steps outlined above) to understand the corresponding error code. Below are few of the commonly encountered failure scenarios :

  1. Authorization failed or requested resource not found ```bash "message": Failed to create or update resource: Service error:NotAuthorizedOrNotFound. Authorization failed or requested resource not found.. http status code: 404.
This happens mostly due to user authorization. Follow below steps for remediation : 
* Check if the instance principals are configured correctly for the OCI resource being provisioned. 
* If using user credentials or security-token auth, cross verify that the `ocicredentials` secret is populated correctly and that `auth_type` matches the intended mode.
* If `auth_type` is omitted and the secret still contains user-principal inputs (`user`, `tenancy`, `region`, `fingerprint`, `privatekey`) or OCI config-file inputs (`config_file_path`, `config_file_profile`), OSOK will use the user-principal path. Remove those inputs or set `auth_type=instance_principal` if instance principal auth is intended.

2. **Legacy or unsupported Autonomous Database credential fields rejected by schema validation**

The generated v2 `AutonomousDatabase` CR no longer accepts the old
`AutonomousDatabases` compatibility shape. Manifests that still use
`kind: AutonomousDatabases`, `spec.wallet`, `spec.walletPassword`, or a
plaintext `spec.adminPassword` value will be rejected before reconciliation.
Use either `spec.adminPassword.secret.secretName` or `spec.secretId` instead,
but not both in the same manifest. Migrate those manifests to the generated
`AutonomousDatabase` fields and use the dedicated
`AutonomousDatabaseWallet` or `AutonomousDatabaseRegionalWallet` resources for
wallet material.

3. **Service error:InvalidParameter**

Sample error msg : ERROR service-manager.AutonomousDatabase Create AutonomousDatabase failed {"error": "Service error:InvalidParameter. The Autonomous Database name cannot be longer than 14 characters ``` Invalid parameter error happens when one of the parameter for the associated OCI resource is not valid or not as per the specification. Check the specifications for the parameter being reported as invalid from the documentation page of the associated resource and update the same in the yaml for the CR. Parameter specifications : * AutonomousDB : https://docs.oracle.com/en-us/iaas/api/#/en/database/20160918/AutonomousDatabase/ * MySql : https://docs.oracle.com/en-us/iaas/api/#/en/mysql/20190415/DbSystem/ * Streaming : https://docs.oracle.com/en-us/iaas/api/#/en/streaming/20180418/Stream/

  1. If the CR creation fails with any 5XX error :
  2. Contact respective service team from Oracle for support with details of the request (opc-id) and failure message