Overview
This articles covers how to set up a Python development environment, get the cloud dataflow SDK for python, and an example pipeline using the Google cloud platform.
Requirements
- Google cloud platform account
Steps
Step 1 - Activate cloud shell
Then click Continue.
Step 2 - Create a cloud storage bucket.
- In the GCP console, click on cloud storage.
- Click the create Bucket.
- In the create Bucket Dialog, specify the following attributes.
Name: Publicly visible name
Storage Class: Multi Regional
A location where bucket data will be stored.
- Click create.
Step 3 - Install pip and cloud Dataflow SDK
The cloud Dataflow for Python requires Python version 2.7. Check the version through the Cloud shell command.
Check the Python version
Python –version
Check the pip version.
pip --version
After installation, the pip version is 7.0.0 or newer. To update pip, run the command
sudo pip install -U pip
If you do not have virtualenv version 13.1.0 or newer, install it by running
sudo pip install --upgrade virtualenv
A virtual python environment is own python distribution, To create a virtual environment, run the command
virtualenv -p python env
then, To activate virtual environment in Bash, Run command
source env/bin/activate
Step 4
Install the latest version of Apache beam.
Pip install apache-beam[gcp]
Step 5
Run the wordcount.py example locally by running the following command,
python -m apache_beam.examples.wordcount –output OUTPUT_FILE
You may see a similar message.
Step 6
List the files that are on your local cloud environment to get the name of the OUTPUT_FILE.
1s
Copy the name of the OUTPUT_FILE and cat into it.
cat<OUTPUT_FILE>
Step 7
Run example pipeline remotely,
BUCKET=GS://bucket name provided earlier>
Step 8
Run the wordcount.py example remotely,
- Python -m apache_beam.example.wordcount –project $DEVSHELL_PROJECT_ID\
- --runner DataflowRunner \
- --staging_location $BUCKET/staging\
- --temp_location $BUCKET/temp \
- --output $BUCKET/results/output
OUTPUT
JOB_MESSAGE_DETAILED: Workers have started successfully.
Check if your job has succeeded
Click Navigation Menu->Cloud Dataflow
You should see your wordcount job with the status of running.
Click Navigation Menu->Storage
Click on the name of your bucket. In your bucket, you should see the results and staging directories.
Then, click on the results folder and you should see the output files that your job created.
Conclusion
That’s all. We have created a Python environment with Dataflow. I hope you understood how to create a Python environment in the Google Cloud Platform.