Google cloud platform guide

A guide for Google cloud platform (GCP).

APIs and services

When using working with GCP, the user will need to enable the APIs required for the features they will be using. Here is a list of APIs the user will need to activate for storage, workflows, dataflows, and triggers.

Cloud:
Cloud Pub/Sub API
Cloud storage
Cloud logging API
Cloud storage JSON API
BigQuery API
Cloud datastore API
Compute Engine API
Workflow:
Workflow API
Workflow executions API
Eventarc API
Cloud scheduler API
Dataflow:
Dataflow API

This does not mean the user needs to make use of all these APIs, it only points out important cases

To activate any of the above, do the following: - On the dashboard, click on APIs and services

GCP API services

Select Library. This will take the user to a library which contains the available APIs
The simplest method, if the API name is known, is to search for the API in the search bar
For this guide, type “Cloud pub/sub api” in the search bar and press Enter

Select the Cloud Pub/Sub API result
Click Enable
Do this for all required APIs and services

If an API is required for a function by GCP which has not been activated, the user will be prompted to enable it.

Cloud storage

Create a new bucket

How to create a Storage bucket: - Go to Cloud Storage and click on Browser

GCP cloud starage

Click Create bucket

GCP cloud starage

Provide a unique name for the bucket and click Continue
Choose a region of your choice and click Continue

Using a single Region type will be the most cost-effective option. Only use Multi-region when required.

The user can decide on the storage type, storage class, access control and data protection which will suffice for their requirements

Default parameters/options should be sufficient for most cases. It is also important to consider the different storage classes, as this can affect the cost.

Storage class	Minimum storage duration (days)	Retrieval fees (per GB)
Standard storage	None	None
Nearline storage	30	$0.01
Coldline storage	90	$0.02
Archive storage	365	$0.05

Click Create

For more information on Cloud Storage and buckets, see https://cloud.google.com/storage/docs/json_api/v1/buckets.

Load data into a bucket

Select the bucket to which data will be added
Click on Create folder
Open the newly created folder
Click Upload files
Select the file which needs to be stored in the bucket, and click Open. The file will be uploaded to the bucket

GCP file uploading

The file can be renamed as follows:
To the right, open the object overflow menu (button with three dots)
Select Rename
Rename the file as desired
A file can be deleted from the bucket by selecting it (tick next to the file) and clicking on Delete

Workflow

This sections deals with creating a workflow in GCP.

In the GCP menu, click on Workflows

GCP workflow

Click Create to start setting up a workflow

GCP workflow create

Provide a workflow name, description (optional) and region

Using a single region option is a more cost-effective approach. Only use multi-regional if required.

Select a service account. Do the following if you want to create a new service account:
Click on Create a new service account
Provide an account name, ID and description (optional). The user can also generate a random ID.
Click on Create and Continue
Now select the newly created service account

GCP workflow parameters

Description parameters are optional, but it is highly recommended to provide a good, but short description. This helps a lot if multiple users will be working on the GCP project.

The user can also provide a label to link this workflow to a group. This is optional.
Click on Add label
Provide a Key and Value
GCP provides two trigger types:

GCP workflow triggers

Choose either type:
Cloud scheduler: This is a recurring schedule option
- Click on Add new trigger and select Cloud scheduler
- Provide the following:
- A name, region, and time zone
- The frequency needs to be set as explained here: https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules#defining_the_job_schedule
- Click Continue
- Set the workflow argument, log level and the service account as desired
- Click Create
Eventarc: This is an event driven option (e.g. file uploaded to a bucket)
- Click on Add new trigger and select Eventarc
- Set the Trigger name as desired
- Choose an Event provider. Depending on the chosen provider, the parameters which follows may differ. For this example/guide Cloud Storage will be used
- Under Event, the user will note that there are numerous options, but for now we will use google.cloud.storage.object.v1.finalized – This event occurs when a new object has been created in a bucket
- To select the bucket which should be monitored, click on Browse. Then select the desired bucket, or create a new bucket. Click Select
- Choose the Service account
- Grant permission as required for the chosen Region and Service account. If permissions are already set up, this option will not show to the user
- Click Save trigger
Set up the workflow which needs to be performed. See the following for more information on this: https://cloud.google.com/workflows/docs/reference/syntax
Click Deploy
If the workflow has been successfully deployed, the user can perform the workflow. Click on Execute to go to the execute workflow section

GCP workflow execute

Click Execute
Here is an example of an output from a Workflow

GCP workflow started

Dataflow

The GCP dataflow can be split into three main categories, namely Java, Python and notebooks. Essentially the outcome and aim of each category will be the same. For this guide focus will be on Python.

To enable dataflow developing for an account, add the Dataflow developer role to the service account.

Here is a short tutorial on creating a Dataflow:

Let's first create a text file which will contain some words for our example:
Create a new text file at a desired directory on your computer. The text file name can be anything (e.g. “just_some_text.txt”);
Add some words to the text file. Which words you add does not matter, but here is an example:

GCP dataflow text file

Upload the new file:
Open Cloud storage
Open an existing bucket, or create a new bucket
Click Upload files
Select the newly created text file and click OK
Create a tempory storage folder for Dataflow:
Click on Create folder and name it "temp"
On the dashboard, click on Dataflow

GCP dataflow dashboard

Click on Create job from template

A user can also create their own template, but for this example we will not be doing that.

GCP dataflow dashboard

Provide a Job name
Choose a Regional endpoint

The region will be best to be the same as your cloud storage. This will minimize costs.

Set the Dataflow template to Word count. The Word count dataflow will execute code on a provided text file and return the number of cases the file contains each word
To set directories the syntax will be “gs://{bucket}/{folder}/{filename}”. For example, “gs://newb2/just_some_text.txt” will refer to the text file named ‘just_some_text.txt’ in the bucket named ‘newb2’. These directories can also refer to folders in a bucket. Set the directories as follows:
Input file(s) in Cloud storage: The text file which contains the words (e.g. “gs://newb2/just_some_text.txt”). This will be the input for the dataflow
Output Cloud Storage file prefix: This is the prefix which will be used for the output file (“gs://newb2/just_some_text_result.txt”). The counts for each word will be stored in this file
Temporary location: Provide the temporary folder we created here (“gs://newb2/temp”). The dataflow will use this folder to store any temporary data used for processing
Here is an example of the parameters:

GCP dataflow parameters

Click Run job
The job will run for a few minutes. The output should be similar to the following:

GCP dataflow job run

Once the dataflow finished running, go back to your bucket. There should be a new file, based on the name you have provided, in your bucket. In this example, the file is named “just_some_text_result.txt-00000-000001”

GCP dataflow output files

Download the file
The user should see something similar to the following:

GCP dataflow output example

This concludes the basic example for Dataflows.

BigQuery

Create a new table

Click the View actions button for the project
Click Create data set

GCP bigquery create table

Select the Project ID
Provide a Data set ID
Select a Data location (region)

For cost-effectiveness, make use a single zone region. Multi-region is more costly.

Click Create data set

GCP bigquery create dataset

Click View actions for the newly created data set;
Click Create table;

GCP bigquery create table2

Provide the required parameters and click on Create table

GCP bigquery parameters

A dataset can contain multiple tables.

Table schema

The schema for a table contains all of the fields, field types, size, description, and other parameters based on the field type. To add or edit a Schema to the table, do the following:

Click on the table so that it opens in a new tab in GCP
Click Edit schema

GCP bigquery schema

Click the Add field button
Provide a Field name, Type, Mode and Description. Some schema options might be optional (e.g. Description). The parameters will also differ depending on the Type;
The user can add several fields;

GCP bigquery schema fields

Click Save to update the Schema

Notes

Some issues the user should avoid or may encounter:

BigQuery will not allow loading of a table with the incorrect number of columns:
Deleting unwanted fields will solve this problem
Always base the schema on the table being loaded, but this can introduce other issues
The names of each field needs to be the same. So the user needs to check the data prior to uploading it. This will also make it easier to automatically delete unwanted fields
The data type needs to be the same for all features/rows of the table. A string cannot be imported into a float fields type
Found an issue where the number of columns changes from a certain row (e.g. row 81), therefore the data cannot be loaded. These were resolved by ‘just deleting’ a bunch of columns at the end of the file. Could not find any data in those columns, so not exactly sure what caused to problem (maybe spaces or tabs in the table elements)
Most of these issues can be solved prior to uploading data, or in the Python code itself when a trigger event occurs
Issues when importing a date column. BigQuery needs the data to be perfect for improting.

Python

Python can be used for trigger events for Storage bucket or Dataflow pipelines. Other languages, namely Java, are also supported.

Locally run python code

Python code can be run locally before running it on the GCP. Makes it easier to test code before uploading it. Firstly the user needs to create a service account JSON file, followed by setting up environmental variables in their OS and installing a few libraries/modules.

Service account key

The service account key will be used by the user’s OS to access the GCP account. The JSON key can be created as follows:

Go to IAM and admin
Select Service account
Select the service account for which you want to create the JSON key
Click on the Keys tab
Click Add key and then Create new key
Select JSON and click Create
The new key will be created and the browser will download the JSON file

If the download fails, try using a different browser as some browsers causes issues.

Environmental variables

Windows environmental variables needs to be set up as follows: - Type ‘environmental variables’ in the Windows start menu - Select Edit the system environmental variables - Under System variables, click New

GCP python environmental variables

Set the variables as follows:
Service account JSON key:
- Variable name: GOOGLE_APPLICATION_CREDENTIALS; and
- Variable value: Directory with filename to the JSON key file.
GCP project:
- Variable name: GOOGLE_CLOUD_PROJECT; and
- Variable value: Project ID.

GCP python environmental example

Click OK
The new environmenal variables will now be added

On Linux the user can do the following to set up the environmental variables: - Open the terminal - Service account JSON key: - Run the following: export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH", where KEY_PATH is the directory to the JSON key file - GCP project: - Run the following: export GOOGLE_CLOUD_PROJECT="PROJECT", where PROJECT is the ID of the project.

Accessing the GCP using Python code should now work.

Python modules and libraries to install

Required roles: - Dataflow admin; - Dataflow developer; - Dataflow worker; and - Service account user.

Install pip, if not installed: - sudo apt update - sudo apt install python3-pip

The following Python modules will be required: - pip install google-cloud-storage; - pip install google-cloud-bigquery; and - pip install google-cloud.

These modules should be installed locally if the user wants to perform Python runs locally, but also on the GCP terminal when code should be run on the platform as well. To check the list of installed Python modules, use ‘pip list’.

The following libraries should be installed for local Python runs: - gcloud command: - sudo snap install google-cloud-cli; and - sudo snap install google-cloud-sdk.

If the user will be making use of Apache-beam (Data flow pipeline), do the following to be able to run code locally: - Open the console - pip install wheel - Installing Apache-beam differs for Windows and Linux: - Windows: pip install “apache-beam[gcp]” - Linux: pip install ‘apache-beam[gcp]’

Take note that Windows uses a souble-quote, Linux uses a single-quote.

Running the code locally

Before adding code to GCP (e.g. for a trigger), it will be best to test the code locally as it can be slow to upload or deploy the code. Simply run the Python code using the normal console command: - python3 python_file.py

Bucket trigger events

A trigger event can be set up for a Storage bucket. Depending on the event (e.g. object.finalize will occur when a new file finished uploading to the bucket), code can be triggered. This guide will only explain how to do this for Python, but it can also be set up for other languages (e.g. Java). Add required roles: - cloudfunctions.functions.get - cloudfunctions.functions.create - cloudfunctions.functions.delete - cloudfunctions.functions.call - Any other required roles

Trigger event for Python

A trigger event for Python can be set up as follows: - main.py: This Python file will contain the functions which will be triggered when an event occurs. The file has to be named main.py. - requirements.txt: This file will consist of any modules which will be imported and used by the code. Here is two examples of requirements added to the text file: - google.cloud.storage - google.cloud.bigquery

GCP python requirements file

The code can be deployed by going to Cloud functions in GCP;
The code can also be deployed on the GCP to the bucket using the following command: gcloud functions deploy FUNCTION_NAME \ --runtime python39 \ --trigger-resource BUCKET_NAME \ --trigger-event TRIGGER_EVENT \ --project PROJECT_ID
The user can manually upload the data to the bucket or use the following command: gsutil cp FILE_NAME BUCKET_DIR
To check the logs of the trigger, do the following command: gcloud functions logs read --limit 50 --project PROJECT_ID
Variable descriptions:
FUNCTION_NAME: The name of the method/function in the main.py file;
BUCKET_NAME: The name of the bucket which will trigger the event. This should only be the bucket name (e.g. example_bucket);
TRIGGER_EVENT: The event which will trigger the code (e.g. google.storage.object.finalize will trigger when a file finished uploading);
PROJECT_ID: The ID of the project (not the name of the project)
FILE_NAME: The local file the user wants to upload; and
BUCKET_DIR: The bucket directory on GCP. This will consist of ‘gs://’ and the BUCKET_NAME (e.g. gs://example_bucket)

Pub/Sub trigger events

This section focuses on setting up a scheduled trigger event using Pub/Sub, Cloud function and Cloud scheduler, but other approaches are available.

In GCP, open Pub/Sub in the menu
Click Create topic
Provide a Topic ID
The other parameter can be left as is. Click Create topic

GCP create topic

Open your newly created topic
Click on Trigger Cloud function at the top of the page. This will allow you to add function which will be triggered if the Pub/Sub is called
Provide the required parameters for the Create function step:
Environment can be left as 1st gen, but 2nd can also be used
Provide a name in the Function name field
Choose a Region which will be the same as the data/buckets/BigQuery which will be accessed
Set the Timeout to 540 seconds (9 minutes)
Set other settings as desired

GCP create function

Under Source code select the programming language for Runtime (e.g. Python)
Paste your code in the MAIN.PY tab
Entry point should consist of the name of the function which will be called in the code segment pasted in MAIN.PY

GCP source code

The Entry point function needs to have two parameters, namely ‘event’ and ‘context’. These parameters are included when the trigger event happens.

Click Deploy function
Now that the Pub/Sub topic and trigger function has been set up, we can set the schedule for when this should occur
Go to Cloud Scheduler in the GCP menu
Click Create job
Set the parameters as follows:
Name: As desired
Region: Select the region which is used by your Sub/Pub topic
Description: As desired
Frequency: This consist of 5 values using this structure * * * * *, which is minute, hour, day of the month, month, and day of the week. * refers to all cases for that parameter. See https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules for more information on how to set frequencies
Time zone: Set as desired. South Africa Standard Time (SAST) is available as an option

GCP cron job

Click Continue to go to the Configure the execution section
Select Pub/Sub as the Target type
Select the Topic which contains the trigger function
Add a message as desired
Set other options as desired
Click Create

The job will be executed based on the frequency set by the user. It will call the Pub/Sub topic, which in turn will trigger the Cloud function. To have a look at the function and execution, do the following: - Go to Cloud function in the GCP menu - Open the function. The user should see the metric

GCP function metric

This section shows all information related to the function, such is when it was called, if it failed, completed, or is currently running
Go to LOGS to see the terminal responses and print from the active function

GCP function console

Dataflow pipeline using Python

When the user wants a Dataflow pipeline to make use of Python code a template needs to be generated from the Python code (other languages also supported, namely Java). This can be done as follows:

This needs to be done on the GCP console (Linux-based terminal) and cannot be done locally:
Open the cloud shell on GCP
Click on More
Select Upload
Upload the files (e.g. Python code) as needed

GCP pipeline upload

The files will now be uploaded to the console home directory;
Run the following command in the GCP console to create the pipeline template:

python3 -m MODULE \ --runner DataflowRunner \ --project PROJECT_ID \ --staging_location STAGING_LOC \ --temp_location TEMPORARY_LOC \ --template_location TEMPLATE_LOC \ --region REGION

MODULE: The Python file uploaded to the GCP console
PROJECT_ID: ID of the project (not the project name)
STAGING_LOC: The staging location for the pipeline (e.g. gs://pipeline_bucket/staging)
TEMPORARY_LOC: Temporary location for the pipeline (gs://pipeline_bucket/temp);
TEMPLATE_LOC: The location to which the JSON template will be saved (gs://pipeline_bucket/template);
REGION: Region (e.g. us-east1)

If a template fails to generate, there is likely a problem in the code. The code needs to make use of Apache-beam, and minor issues will be a problem. For instance, code might work locally, but an issue might occur when running in the pipeline itself. For more information on Python template generation go to https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-python.

Compute Engine

A Dataflow pipeline requires a virtual machine (VM). This can be set up using GCP Compute engine. Follow these instructions to set up a VM:

In the GCP navigation menu, select Compute engine
Click on VM instances
Click Create instance

GCP compute engine

Provide the following parameters:
Name;
Choose a Region and Zone. Best will be to make use of the same region than what your other data in a bucket or BigQuery makes use of;
Choose a Series and Machine type
Other parameters can be left on default or changed as desired.

GCP virtual machine

Click Create

Be sure to choose the machine type as required. The costs will be higher to more processing power is used

Here is an example using a medium VM:

GCP VM medium

Here is an example of a high end VM:

GCP VM high

Notice the cost difference! There are a large number of machine options, so be sure to choose as required.

Deploy pipeline Python code

Dataflow makes use of Apache Beam. This can be developed/deployed using the SSH console for a VM. Do this as follows:

Go to Compute engine in GCP;
Create a virtual machine (see Compute engine), or select an existing virtual machine;
Click on SSH. This will open the SSH console;
Install Python modules:
pip install apitools
pip install “apache_beam[gcp]” (this can run a very long time)
pip install google
pip install google-cloud-storage
pip install google-cloud-bigquery
pip install google-cloud
Any other required Python modules (e.g. geopandas)

If git is not installed for the VM, follow the instructions here to install git: https://www.atlassian.com/git/tutorials/install-git.

Clone the repo from GitHub or other sources. This will download the repo into the compute engine root folder
Navigate in the console to the code which needs to be run
If providing parameters as as variable from the console:
Set the bucket, if required, as follows:
- BUCKET=""
- echo $BUCKET
Set the project as follows:
- PROJECT=""
- echo $PROJECT
Set up the environmental variables (see Environmental variables). This is required for authentication and local runs in the VM:
The user will require the credentials json file for the service account (see Service account key).

Executing the pipeline:

There are two options available. Local is faster, but cloud runs the pipeline on the cloud but is slower (the is shown In the GCP course). Here is how code can be run:
Execute pipeline locally:
- python3 file.py --parameter $BUCKET --project $PROJECT –otherParameters $PARAMETER --DirectRunner
Execute in the cloud:
- python3 file.py --bucket $BUCKET --project $PROJECT –otherParameters $PARAMETER --DataFlowRunner

BigQuery in Python

This section will deal with BigQuery in Python code.

The table schema, basically the attributes structure, is important to set up correctly in BigQuery and will cause issues if the user attempts to load a table into BigQuery using the incorrect structure for the table being loaded. A table schema makes use of JSON, and is easy to set up. Here is a code example:

SCHEMA = [ bigquery.SchemaField('Date', 'Date', mode='NULLABLE'), bigquery.SchemaField('Max_Temperature', 'FLOAT', mode='NULLABLE'), bigquery.SchemaField('Min_Temperature', 'FLOAT', mode='NULLABLE'), bigquery.SchemaField('Precipitation', 'FLOAT', mode='NULLABLE'), bigquery.SchemaField('Relative_Humidity', 'FLOAT', mode='NULLABLE'), bigquery.SchemaField('Solar', 'FLOAT', mode='NULLABLE'), bigquery.SchemaField('Streamflow', 'FLOAT', mode='NULLABLE'), ]