In the Extract data from websites tutorial, you learned how to handle data dependencies and ingest large amounts of data. Now, you’ll learn how to train a machine learning model using your data. Here’s the pipeline that you’ll construct:

Prerequisites

  • A free Prefect Cloud account and an API key
  • The Prefect SDK
  • An AWS account, including:
    • The ability to create an ml.g4dn.xlarge instance with SageMaker.
    • An access key with full access to S3 and SageMaker.
  • Optional: Terraform

Log in to Prefect Cloud

If needed, log in to your Prefect Cloud account.

prefect cloud login

Create a work pool

Create a new Process work pool named my-work-pool in Prefect Cloud.

prefect work-pool create my-work-pool --type process

Now, in a separate terminal, start a worker in that work pool.

prefect worker start --pool my-work-pool

Leave this worker running for the rest of the tutorial.

Create deployments

Clone the repository with the flow code you’ll be deploying to Prefect Cloud.

git clone https://github.com/PrefectHQ/demos.git
cd demos/

Now deploy the model-training flow to Prefect Cloud. This flow trains an XGBoost model on the Iris dataset using SageMaker.

python model_training.py

Next, deploy the model-inference flow to Prefect Cloud. This flow calculates performance metrics using the fitted model from S3.

python model_inference.py

Provision cloud resources

This tutorial uses AWS S3 and EventBridge to store data and emit events, and Prefect webhooks and automations to trigger your flows. We’ll use Terraform so you don’t have to provision these resources manually.

Set the following environment variables:

# Prefect auth
export PREFECT_API_KEY=<Your Prefect API key>
export PREFECT_CLOUD_ACCOUNT_ID=<Your account ID>

# AWS auth
export AWS_ACCESS_KEY_ID=<Your AWS access key ID>
export AWS_SECRET_ACCESS_KEY=<Your AWS secret access key>
export AWS_REGION=us-east-1

# Terraform variables
export TF_VAR_prefect_workspace_id=<Your workspace ID>
export TF_VAR_data_bucket_name=prefect-ml-data # You may need to change this if the bucket name is already taken
export TF_VAR_model_bucket_name=prefect-model  # ...
export TF_VAR_model_training_deployment_id=<Your model-training deployment ID>
export TF_VAR_model_inference_deployment_id=<Your model-inference deployment ID>

Switch to the infra/ directory.

cd infra/

Provision these resources with Terraform.

terraform init
terraform apply

When you’re ready to permanently delete these resources, empty the prefect-ml-data and prefect-model buckets and then run terraform destroy in your terminal.

Alternative: manual provisioning

If you don’t want to use Terraform, you can create the resources manually.

Grant access to AWS resources

Your Prefect flows need access to the AWS resources that you just provisioned. Create the following blocks in Prefect Cloud:

  • An AWS Credentials block named aws-credentials. Set the AWS Access Key ID and AWS Access Key Secret fields to the values from the access key you created earlier.
  • An S3 Bucket block named s3-bucket-block, set to the name of your model bucket (prefect-model by default). This block can use the aws-credentials block you created earlier.
  • A Secret block named sagemaker-role-arn. This block stores the IAM role ARN for SageMaker that you created earlier.

Now, when you run your flows in Prefect Cloud, they can use these blocks to authenticate with AWS.

Test the ML pipeline

Use the AWS Console to upload the following files to the prefect-ml-data bucket:

After the files are uploaded, the train-model flow will automatically run. Open the flow run details in Prefect Cloud to see the model training logs. Training will take a few minutes.

...
[43]#011train-mlogloss:0.18166#011validation-mlogloss:0.24436
[44]#011train-mlogloss:0.18166#011validation-mlogloss:0.24436
[45]#011train-mlogloss:0.18168#011validation-mlogloss:0.24422
[46]#011train-mlogloss:0.18166#011validation-mlogloss:0.24443
[47]#011train-mlogloss:0.18166#011validation-mlogloss:0.24443
[48]#011train-mlogloss:0.18165#011validation-mlogloss:0.24458

2024-12-20 20:16:56 Completed - Training job completed

Training seconds: 101
Billable seconds: 101
Memory usage post-training: total - 17179869184, percent - 78.2%
Finished in state Completed()

After training is complete, the model will be uploaded to the prefect-model bucket, which will trigger the run-inference flow. Inference will take a few seconds to complete, after which you can see the predictions in the flow logs:

...
Prediction for sample [5.0, 3.4, 1.5, 0.2]: 0.0
Prediction for sample [6.4, 3.2, 4.5, 1.5]: 1.0
Prediction for sample [7.2, 3.6, 6.1, 2.5]: 2.0
Finished in state Completed()

Next steps

In this tutorial, you learned how to publish data to S3 and train a machine learning model whenever that data changes. You’re well on your way to becoming a Prefect expert!

Now that you’ve finished this tutorial series, continue your learning journey by going deep on the following topics:

Need help? Book a meeting with a Prefect Product Advocate to get your questions answered.