Menu

A simple Python script which uses Google Cloud Storage to move BigQuery data between regions

By default, the Google Analytics 360 raw data export to BigQuery is saved to a dataset located in the US. It is possible to force new exports to be stored in the EU, but moving an existing dataset involves a bit more effort. There are many reasons you might want to move datasets of any type between regions, especially as more legislation is created with constraints on where data should be located and stored.

In this blog we’ll run you through the process of migrating the Google Analytics Export between regions. The process can be split into three parts:

1. Storing existing data in Google Cloud Storage (GCS)
2. Creating a new BigQuery dataset
3. Loading data from Cloud Storage into the new BigQuery dataset

We’ll explain what to do for each part, providing Python code samples along the way to help you apply this solution to your own dataset migrations. (Disclaimer: use this code at your own risk, please perform the necessary checks beforehand!)

 

Defining the BigQuery client

Before we can start interacting with BigQuery data programmatically, we first need to define a BigQuery client. The exact method to do this will differ depending on the environment in which you are running the script. For example, the following code works in the Colaboratory Notebook environment:

from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
client = bigquery.Client(project=project_id)

To run the code in the Mac Terminal (or Windows / Linux equivalent) follow the help articles on the Google Cloud website:

Getting Started

Service Accounts

Once you have defined a BigQuery client we are ready to start moving datasets.

 

1. Storing existing data in Google Cloud Storage

We start the migration process by creating a list of table_names within our Python environment.

# Code to define a list of table_names from the source_dataset

table_objects = list(client.list_tables(client.dataset(source_dataset_name)))
table_names = []

for i in range(len(table_objects)):
table_names.append(table_objects[i].table_id)

Once we have all our table names in a list we can start the first step of the migration, backing up all our data into GCS. The code below runs through all items in our table_names list and extracts the data in the Avro format. It then stores this data in the GCS bucket you have specified.

# Code to extract all table_names from the source_dataset in AVRO format and store them into the specified cloud storage bucket

for name in table_names:
source_table = client.dataset(source_dataset_name).table(name)

job_config = bigquery.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.AVRO

bucket_uri = 'gs://{}/{}-*'.format(bucket_name, name)

extract_job = client.extract_table(
source_table,
bucket_uri,
location=source_dataset_location,
job_config=job_config)

extract_job.result()

 

2. Configuring a new BigQuery dataset

Now that your Google Analytics data is backed up in GCS, we can configure BigQuery to create our dataset to in the desired region. The steps to take for this are as follows:

  • Unlink the Google Analytics 360 BigQuery export
  • Delete the existing dataset
  • Create the new dataset in the correct region
  • Relink the Google Analytics 360 BigQuery export

We won’t cover how to do this within this blog, as the following help page takes you through the process in great detail: https://support.google.com/analytics/answer/7584198?hl=en

Don’t forget to name you new BigQuery dataset as it was before (the name should be equal to the View ID that you are exporting) and to select the correct region in the setup process!

 

3. Storing existing data in Google Cloud Storage

Now that you have created a blank dataset in your desired region within BigQuery and have all your data backed up in Cloud Storage, it’s time to start loading data back into the dataset. We start by creating an empty table for every item in our table_names list:

# Code to create a table in the destination_dataset for each item in the table_names list

for name in table_names:
table_ref = client.dataset(destination_dataset_name).table(name)
table = bigquery.Table(table_ref)
table = client.create_table(table)

Finally, we load the data that is sitting within Google Cloud Storage into the blank tables that we have just defined.

# Code to load data from the specified bucket_uri into the tables within the destination_dataset

for name in table_names:
destination_table = client.dataset(destination_dataset_name).table(name)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.AVRO

bucket_uri = 'gs://{}/{}-*'.format(bucket_name, name)

load_job = client.load_table_from_uri(
bucket_uri,
destination_table,
location=destination_dataset_location,
job_config=job_config)

load_job.result()

Once this script has finished running, we are done with the dataset migration. All of your Google Analytics data will now be sitting in the correct region. Don’t forget to delete all the data in your Cloud Storage bucket to avoid storage costs!

 

Appendix

Avro format

The code in this blog exports tables in the Avro format, which exports the data schema as well as the data within each table. This is very useful when migrating the GA Export, as the schema has changed multiple times. More information on Avro can be found here.

Variables to define

To run the code in this blog on your own BigQuery datasets, you’ll need to define the following variables as strings with information for your specific case:

project_id
bucket_uri
source_dataset_name
source_dataset_location
destination_dataset_name
destination_dataset_location

Manual table name definition

Sometimes it can be useful to manually define the ‘table_names’ list, for instance when a notebook disconnects or your computer turns off midway through a running some code. Rather than start at the beginning, we can often pick back up where we left off by manually defining the table_names list. This can be done with the code below:

# Code to define the table_names list. Input format: mm/dd/YYYY

import pandas as pd
from time import strftime

datelist = pd.date_range(start='mm/dd/YYYY',end='mm/dd/YYYY').tolist()

table_names = []

for x in datelist:
table_names.append('ga_sessions_' + x.strftime('%Y%m%d'))

Joe Whitehead Data Specialist