What is Tensor Processing Units (TPUs) ?

Saurabh Prajapati
1d
1.1k
0
4

Article

Tensor Processing Units (TPUs)

TPUs are hardware accelerators specialized in deep learning tasks. They are supported in Tensorflow 2.1 both through the Keras high-level API and, at a lower level, in models using a custom training loop.

You can use up to 20 hours per week of TPUs and up to 9h at a time in a single session.

Tensor Processing Units(TPUs)

TPUs in Keras

Once you have flipped the "Accelerator" switch in your notebook to "TPU v3-8", this is how to enable TPU training in Tensorflow Keras:

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()

# instantiate a distribution strategy
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )

# train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

TPUs are network-connected accelerators, and you must first locate them on the network. This is what TPUClusterResolver.connect() it does.

Finally, you use the TPUStrategy by instantiating your model in the scope of the strategy. This creates the model on the TPU. Model size is constrained by the TPU RAM only, not by the amount of memory available on the VM running your Python code. Model creation and model training use the usual Keras APIs.

Batch size, learning rate, steps_per_execution

To go fast on a TPU, increase the batch size. The rule of thumb is to use batches of 128 elements per core (ex, batch size of 128*8=1024 for a TPU with 8 cores). At this size, the 128x128 hardware matrix multipliers of the TPU are most likely to be kept busy. You start seeing interesting speedups from a batch size of 8 per core, though. In the sample above, the batch size is scaled with the core count through this line of code:

BATCH_SIZE = 16 * tpu_strategy.num_replicas_in_sync

With a TPUStrategy running on a single TPU v3-8, the core count is 8. It could be more on larger configurations called TPU pods available on Google Cloud.

TPU Batch size

With larger batch sizes, TPUs will be crunching through the training data faster. This is only useful if the larger training batches produce more “training work” and get your model to the desired accuracy faster. That is why the rule of thumb also calls for increasing the learning rate with the batch size. You can start with a proportional increase, but additional tuning may be necessary to find the optimal learning rate schedule for a given model and accelerator.

Starting with Tensorflow 2.4, model.compile() accepts a new steps_per_execution parameter. This parameter instructs Keras to send multiple batches to the TPU at once. In addition to lowering communications overheads, this gives the XLA compiler the opportunity to optimize TPU hardware utilization across multiple batches. With this option, it is no longer necessary to push batch sizes to very high values to optimize TPU performance. As long as you use batch sizes of at least 8 per core (>=64 for a TPUv3-8) performance should be acceptable. Example:

model.compile( …,
              steps_per_execution=32)

model.compile( …,
              steps_per_execution=32)

tf.data.Dataset and TFRecords

Because TPUs are very fast, many models ported to TPUs end up with a data bottleneck. The TPU is sitting idle, waiting for data for the most part of each training epoch. TPUs read training data exclusively from GCS (Google Cloud Storage). And GCS can sustain a pretty large throughput if it is continuously streaming from multiple files in parallel. Following a couple of best practices will optimize the throughput:

For TPU training, organize your data in GCS in a reasonable number (10s to 100s) of reasonably large files (10s to 100s of MB).

With too few files, GCS will not have enough streams to get max throughput. With too many files, time will be wasted accessing each individual file.

Data for TPU training typically comes sharded across the appropriate number of larger files. The usual container format is TFRecords. You can load a dataset from TFRecords files by writing:

filenames = tf.io.gfile.glob("gs://flowers-public/tfrecords-jpeg-512x512/*.tfrec") # list files on GCS
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # TFRecord decoding here...

To enable parallel streaming from multiple TFRecord files, modify the code like this:

AUTO = tf.data.experimental.AUTOTUNE
ignore_order = tf.data.Options()
ignore_order.experimental_deterministic = False

filenames = tf.io.gfile.glob("gs://flowers-public/tfrecords-jpeg-512x512/*.tfrec") # list files on GCS
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO)
dataset = dataset.with_options(ignore_order)
dataset = dataset.map(...) # TFRecord decoding here...

There are two settings here:

num_parallel_reads=AUTO instructs the API to read from multiple files if available. It figures out how many automatically.
experimental_deterministic = False disables data order enforcement. We will be shuffling the data anyway, so order is not important. With this setting, the API can use any TFRecord as soon as it is streamed in.

TPU hardware

At approximately 20 inches (50 cm), a TPU v3-8 board is a fairly sizeable piece of hardware. It sports 4 dual-core TPU chips for a total of 8 TPU cores.

TPU cores and chips

Each TPU core has a traditional vector processing part (VPU) as well as dedicated matrix multiplication hardware capable of processing 128x128 matrices. This is the part that specifically accelerates machine learning workloads.

TPUs are equipped with 128GB of high-speed memory, allowing larger batches, larger models, and larger training inputs. In the sample above, you can try using 512x512 px input images, also provided in the dataset, and see if the TPU v3-8 handles them easily.

TPU monitor

The MXU percentage indicates how efficiently the TPU compute hardware is utilized. Higher is better.

The "Idle Time" percentage measures how often the TPU is sitting idle, waiting for data. You should optimize your data pipeline to make this as low as possible.

The measurements are refreshed approximately every 10 seconds and only appear when the TPU is running a computation.

TPU Measurement

Model saving/loading on TPUs

When loading and saving models TPU models from/to the local disk, the experimental_io_device option must be used. The technical explanation is at the end of this section. It can be omitted if writing to GCS because TPUs have direct access to GCS. This option does nothing on GPUs.

Saving a TPU model locally

save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
model.save('./model', options=save_locally) # saving in Tensorflow's "SavedModel" format

Loading a TPU model from local disk

with strategy.scope():
    load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    model = tf.keras.models.load_model('./model', options=load_locally) # loading in Tensorflow's "SavedModel" format

Writing checkpoints locally from a TPU model

save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
checkpoints_cb = tf.keras.callbacks.ModelCheckpoint('./checkpoints', options=save_locally)
model.fit(…, callbacks=[checkpoints_cb])

Loading a model from Tensorflow Hub to TPU directly

import tensorflow_hub as hub
with strategy.scope():
    load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    pretrained_model = hub.KerasLayer('https://tfhub.dev/tensorflow/efficientnet/b6/feature-vector/1', trainable=True, input_shape=[512,512,3], load_options=load_locally)

Experimental io device explained

To understand what the experimental_io_device='/job:localhost' flag does, some background info is needed first. TPU users will remember that in order to train a model on TPU, you have to instantiate the model in a TPUStrategy scope. Like this:

# connect to a TPU and instantiate a distribution strategy
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)

# instantiate the model in the strategy scope
with tpu_strategy.scope():
    model = tf.keras.Sequential( … )

This boilerplate code actually does 2 things

The strategy scope instructs Tensorflow to instantiate all the variables of the model in the memory of the TPU. The TPUClusterResolver.connect() call automatically enters the TPU device scope, which instructs Tensorflow to run Tensorflow operations on the TPU. Now, if you call model.save('./model') when you are connected to a TPU, Tensorflow will try to run the save operations on the TPU, and since the TPU is a network-connected accelerator that has no access to your local disk, the operation will fail. Notice that saving to GCS will work, though. The TPU does have access to GCS.

If you want to save a TPU model to your local disk, you need to run the saving operation on your local machine, and that is what the experimental_io_device='/job:localhost' flag does.

Step 1. Save the Model

    # Save your model to disk using the .save() functionality. Here we save in .h5 format
    # This step will be replaced with an alternative call to save models in Tensorflow 2.3
    model.save('model.h5')

Step 2. Put your model in a dataset

Step 3. Load your model into the inference Notebook

    model = tf.keras.models.load_model('model.h5')

Conclution

You can now load your model and run inference using a GPU in this notebook.

CDN Software Solution PVt.LTD

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.