GCP: How to export google datastore for your GCP applications?

Accidental deletion of data or hardware faults, no matter what it’s always good to have disaster recovery plan for your critical applications in production. This guide takes you through existing ways I discovered for backing up datastore entities. If you are using google app engine with google datastore this guide may help you to understanding backup mechanism to good extent. So far, I have only used Python on GAE and hence code snippets in this blog & client libraries are all Python based.

 

Overview

If you need one time datastore backups, google datastore has REST/RPC based API endpoints for managed export and import of datastore entities as well as client libraries for languages of your choice. I am not going to discuss these in detail because google documentation has already done the job fair enough. But, I would like to suggest alternative for this managed export import service.
Google provides managed export and import service for google datastore in following ways :-

– Google Datastore NDB Client Library allows App Engine Python apps to connect to Cloud Datastore and perform several operations. NDB is well documented but exporting via NDB is an undocumented bit with google docs which I would be explaining later in this blog.

  • Via GCP console

– at https://ah-builtin-python-bundle-dot-                [PROJECT_ID].appspot.com/_ah/datastore_admin?app_id=[PROJECT_ID]

allows you to export and import of datastore entities by a single click.

– Refer: https://cloud.google.com/appengine/docs/standard/python/console/managing-datastore

Scheduling managed datastore exports

Google document  explains, how to schedule periodic backups per service by setting cron jobs. The backups can be created and executed as per specifications in cron.yaml. The configuration (cron.yaml) mentions few directives as described below:-

  • `url`: a HTTP GET endpoint certain required/optional query params such as storage bucket name, entities to be backed up, namespace_id.
    (e.g. url: /cloud-datastore-export?output_url_prefix=gs://[BUCKET_NAME]&namespace_id=Classical&namespace_id=Pop&kind=Song&kind=Album )
  • `target`: service within which request handler exists.
  • `schedule`: different schedule per environment
    (e.g. every 24 hours, every 1 hour etc.)
  • `timezone`: based on region of app engine.

 

If you have single application and it’s only associated datastore, google suggested approach best suites for you. But if your architecture has multiple services (different project ids altogether) and multiple datastores with each service having different set of entities to be backed up at different schedules in every environment (DEV, PROD etc), the job becomes cumbersome. Also it has following described limitations.

Limitations: –

  • We have to introduce an extra HTTP endpoint which non-app specific.
  • The same endpoint and handler code will be redundant across all our services.
  • Required directives for cron may have different values for each environment (TEST, DEV, UAT & PROD) and service. So, we have to dynamically generate `cron.yaml` which in turn demands to place extra code redundantly across all our services.
  • You would have to elevate permissions of service account to `datastore export import admin` by going to each environment of each service which is just extra work.

 

Solution

 

Isolate back up service from application code using cloud scheduler and cloud function.

Cloud scheduler can trigger HTTP request on `scheduled` time. HTTP cloud functions can trigger back up either by using client library such as NDB or datastore endpoint that creates back up.

HTTP cloud functions cannot have authorisation mechanism by its limitations. So to add security, protect cloud function endpoint with pubsub topic. Use an option `–trigger-topic` to deploy function.

Using datastore API endpoint for exporting datastore entities – 

Refer: https://developers.google.com/identity/protocols/OAuth2ServiceAccount#authorizingrequests

In cloud function, to access datastore API `https://datastore.googleapis.com/v1/projects/<APP ID>:export`, we require `access_token` from scope `https://www.googleapis.com/auth/datastore`.

I was using my default application service account to hit the datastore API and had bit of a struggle getting access token. I finally succeeded getting access_token but the approach is not so neat so I won’t recommend you to go with this.

Please refer – https://stackoverflow.com/questions/54762687/google-cloud-functions-how-to-get-access-tokens-in-cloud-function-for-the-requi

Using NDB client library for exporting datastore entities –

You can store your environment variables in google cloud function by adding `--set-env-vars=[KEY=VALUE,…]` to your gcloud deploy command as below –

gcloud deploy functions --set-env-vars=['GCS': '<bucket name>']

But you have multiple environment variables and want extra control on cloud function versioning it’s suggested to use --env-vars-file=FILE_PATH

Path to a local YAML file with definitions for all environment variables.

from googleapiclient.discovery import build


def get_export(request):
    output_url_prefix = os.environ.get('GCS', 'Specified environment variable is not set - GCS ')
    project_id = os.environ.get('PROJECT_ID', 'Specified environment variable is not set - PROJECT_ID ')

    kinds = ['Customer', 'Purchase']

    entity_filter = {
        'kinds': kinds,
    }

    body = {
        'output_url_prefix': output_url_prefix,
        'entity_filter': entity_filter
    }

    client = build('datastore', 'v1beta1')
    client.projects().export(projectId=project_id, body=body).execute()

    return "success"

This is pretty neat and easy approach. You can have better control on entities and schedules to be backed up without placing back up service within application code.

Please leave comments to improve this article. Thanks for reading.

Leave a comment