I am recently reading the book Robust Python by Patrick Viafore. In this blog post, I want to discuss the usage of pydantic library, introduced in chapter 14 of the book.

I have heard about pydantic for a long time since its usage in popular libraries like FastAPI. However, I am still confused about the functionality of this library when reading the introduction of documentation. This book helps me understand this topic a lot better.

In short, pydantic can help us validate (or parse) data during the runtime with type checks and avoid using a large amount of if-else conditions to validate the data. You might be as confused as me when you read this sentence for the first time. Let’s walk through an example together.

Suppose we have a deep learning model training script. One of the function inside it requires the user to pass the learning rate and optimizer in a yaml file. You can learn more about yaml file here if you are not familiar with it.

The yaml file might look like the following.

model_config:
  learning_rate: 0.01
  optimizer: adam

A sample function can look like the following:

from typing import Dict

import yaml


with open('config.yaml') as yaml_file:
    config: Dict = yaml.safe_load(yaml_file)
model_config = config["model_config"]

def train_model(model_config: Dict):
    lr = model_config["learning_rate"]
    optimizer = config_file["optmizer"]
    # the model training code ...

However, we want to restrict the learning rate to be a float between 0.0 and 1.0, and force optimizer to be one of the following: sgd, rmsprop, adam. Instead of throwing the error when the model begins to train or letting the model fail silently (the code runs but the model cannot converge because of a huge learning rate), we want to catch this error when we load the data from yaml.

One naive solution is to use try-except and if-else statements like the following.

try:
    lr = float(model_config["learning_rate"])
except:
    raise ValueError("learning rate needs to be a float")

optimizer = str(model_config["optimizer"])

if lr < 0. or lr > 1.:
    raise ValueError("learning rate needs to between 0.0 and 1.0")

if optimizer not in ["sgd", "rmsprop", "adam"]:
    raise ValueError("optimizer needs to be sgd, rmsprop, or adam")

This can get extremely long and complex when have a large list of hyper parameters. Also, even we want to use type hint and type checker like mypy to get rid of the checking for type like the following (Note that you need to have Python 3.8 or above to use the Literal type).

from typing import Literal, TypedDict

class ModelConfig(TypedDict):
    learning_rate: float 
    optimizer: Literal["sgd", "rmsprop", "adam"]
 
def load_yaml(file_path: str) -> ModelConfig:
    with open(file_path) as yaml_file:
  	  config = yaml.safe_load(yaml_file)
    return config["model_config"]

We are still out of luck because we don’t know the information in the yaml file until we actually run the script (during runtime). Thus, even we have a yaml file like this:

model_config:
  learning_rate: not_learning
  optimizer: 1.0

Type checker like mypy will not throw any error. This is when pydantic comes in. It checks type during the runtime.

from pydantic import BaseModel, confloat

class ModelConfig(BaseModel):
    learning_rate: confloat(gt=0., lt=1.)
    optimizer: Literal["sgd", "rmsprop", "adam"]

def load_yaml(file_path: str) -> ModelConfig:
    with open(file_path) as yaml_file:
        config = yaml.safe_load(yaml_file)
    return ModelConfig(**config["model_config"])

Let’s explain few things in this code.

  • We inherit from the BaseModel class of the pydantic model so when we construct this class during the run time, the type checking will be executed.
  • The confloat type in pydantic allows us to restrict the range of a float value. In our case, we require it to be greater than 0 or less than 1.
  • We use double asterisk (**) to construct the class from a dictionary. You can learn more about it here.

Now, if we run the script by passing it the wrong yaml config file above, we will get the following error.

ValidationError: 2 validation errors for ModelConfig
learning_rate
  value is not a valid float (type=type_error.float)
optimizer
  unexpected value; permitted: 'sgd', 'rmsprop', 'adam' (type=value_error.const; given=1.0; permitted=('sgd', 'rmsprop', 'adam'))

Note that pydantic does the validation for us automatically and simplifies our code quite a bit. To reiterate, pydantic can help us validate (or parse) data during the runtime with type checks and avoid using a large amount of if-else conditions to validate the data. Hopefully now this sentence makes more sense to you.

My understanding of pydantic is still at a basic level and I encourage you to check out the documentation of pydantic to learn more about it. Also, the robust python is a really nice book for introducing typing in Python and many other good coding practices.