Unlock high performance of Transfer Learning with transformer based Neuron architecture on Inferentia Neurons

Supported Neural Network Architecture and Neuron Family

TensorFlow Neuron enables native TensorFlow models to be accelerated on Neuron devices, so you can use your existing framework application and get started easily with minimal code changes.

Deep learning model architectures are a good fit for Inferentia, Inferentia2 and Trainium. However, Inferentia have comparatively lesser cost. Each of the neuron core has it’s own tensor engine, vector & scaler engine

If we just talk about Inferentia family, all the transformer encoders are well supported for PyTorch and TensorFlow TF1.x and TensorFlow TF2.x neurons. This gives a straight pathway to Sequence classification, question answering, masked language modeling and Natural Language Understanding. Overall, Autoencoding models are always a good fit on Inferentia whereas Autoregressive models are not a good fit for Inferetia.

If we intent to use Autoregressive models or Sequence to Sequence, NL Generation and Translation, i.e. the architecture where decoders are involved, then we can train on Pytorch for Inferentia2 and Trainium.

Both the above families have supported architecture for computer vision too, which means that even a Single Shot Detector based Object Detection is possible with neurons on Inferentia. 😉

Picking one instance of the Transfer Learning - ZSC/ZSL

Zero-shot classification (ZSC) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. We will use Zero Shot classification using Hugging face transformer. One of the popular technique Zero Shot classification uses is natural language inference. Under the hood, Zero shot classification takes our sequence and it creates hypothesis for our pre-trained model which is specifically trained on premise and hypothesis classification and then it gives the score for each of given labels.

Compile the model into a Neuron Optimized Model

We will compile the underlying model inside the pipeline as well as make some edits to the tokenizer. We need to edit the tokenizer to a specific sequence lenght (128 using padding) is because neuron only accepts static input shapes.

The trace() creates a TensorFlow GraphDef protobuf intermediate representation (IR) of the model compute graph.

finally compilation code snippet (thanks to AWS team)


from transformers import pipeline
import tensorflow as tf
import tensorflow.neuron as tfn
import time

--Choosing supported model for 'zero-shot-classification' task
model_name = 'roberta-large-mnli'
pipe = pipeline('zero-shot-classification', model=model_name, framework='tf')

sequence_to_classify = "one day I will see the world"

candidate_labels = ['travel', 'cooking', 'dancing']

start = time.time()
print(pipe(sequence_to_classify, candidate_labels))
print("CPU infer time: ", time.time() - start)

neuron_pipe = pipeline('zero-shot-classification', model='roberta-large-mnli', framework='tf')


original_tokenizer = pipe.tokenizer
def wrapper_function(*args, **kwargs):
    kwargs['padding'] = 'max_length'
    kwargs['max_length'] = 128
    kwargs['truncation'] = True
    kwargs['return_tensors'] = 'tf'
    return original_tokenizer(*args, **kwargs)

neuron_pipe.tokenizer = wrapper_function
neuron_pipe.tokenizer.decode = original_tokenizer.decode
neuron_pipe.tokenizer.mask_token_id = original_tokenizer.mask_token_id
neuron_pipe.tokenizer.pad_token_id = original_tokenizer.pad_token_id
neuron_pipe.tokenizer.convert_ids_to_tokens = original_tokenizer.convert_ids_to_tokens
example_inputs = neuron_pipe.tokenizer('we can use any string here to generate example inputs')

#compile the model by calling tfn.trace by passing in the underlying model


start = time.time()
neuron_model = tfn.trace(pipe.model, example_inputs)
print("Neuron compile time: ", time.time() - start)
neuron_pipe.model = neuron_model
neuron_pipe.model.config = pipe.model.config

start = time.time()
print(neuron_pipe(sequence_to_classify, candidate_labels))
print("Neuron infer time: ", time.time() - start)


start = time.time()
neuron_model = tfn.trace(pipe.model, example_inputs)
print("Neuron compile time: ", time.time() - start)
neuron_pipe.model = neuron_model
neuron_pipe.model.config = pipe.model.config

start = time.time()
print(neuron_pipe(sequence_to_classify, candidate_labels))
print("Neuron infer time: ", time.time() - start)

We executed this on 16 Neuron Cores. Now, we need to save your neuron model to disk and avoid recompilation.

And now we all set to load the model from disk, can provide high performance even on 4 Neuron Cores too. I have gained 10x improvement with this model compilation using neuron-cc.

Optimising for further high throughput and low latency

For further optimizations, correctly choose the neuron cores needed for Large Language Model (LLM) inference, it’s driven by batch size, sequence length. Moreover, compiler can flagged with neuroncore-pipeline-cores, that can be calculated as following,


neuroncore-pipeline-cores = 4 * round( number-of-weights-in-model/(2 * 10^7) )

Insights and Updates

The Latest News, Trends, and Best Practices

1 year ago

Unlock high performance of Transfer Learning with transformer based Neuron architecture on Inferentia Neurons

We can accelerate our existing framework application with minimal code changes whether they are native PyTorch models or native Tensorflow models on Neurons.

1 year ago

Organising for Generative AI: why CEO needs to know this?

Generative AI is a subset of Deep learning, it uses artificial neural network, can process both labelled and unlabelled data using supervised, unsupervised and semi supervised methods. Generative Deep learning model, learn patterns in unstructured content, generate new data that is similar to data it was trained on.

1 year ago

A human’s guide to Foundation Models & unlimited opportunities ahead

Boom of Generative AI Generative AI has taken a boom in recent times. With the advent of Foundation models such

12 months ago

Quick guide to Transformer Architectures

Quick Refresher Rest assured, we’re not revisiting the Transformer model architecture and paper for the 100th time. However, this model

12 months ago

Achieving linear-time operations with shift in attention mechanisms in AI architectures – Mamba, Recurrent Windowed Key-Value

Rapid advancements in AI The field of Large Language Models (LLMs) is currently experiencing rapid development, with a significant focus

12 months ago

What Retrieval Augmentation Generation (RAG) offers to LLM?

Retrieval-augmented generation (RAG) for large language models (LLMs) aims to improve prediction quality by using an external datastore at inference time to augment a richer prompt that includes some combination of context, history, and recent and relevant knowledge.

Unlock high performance of Transfer Learning with transformer based Neuron architecture on Inferentia Neurons

Supported Neural Network Architecture and Neuron Family

Picking one instance of the Transfer Learning - ZSC/ZSL

Compile the model into a Neuron Optimized Model

Optimising for further high throughput and low latency

Insights and Updates

The Latest News, Trends, and Best Practices

Unlock high performance of Transfer Learning with transformer based Neuron architecture on Inferentia Neurons

Organising for Generative AI: why CEO needs to know this?

A human’s guide to Foundation Models & unlimited opportunities ahead

Quick guide to Transformer Architectures

Achieving linear-time operations with shift in attention mechanisms in AI architectures – Mamba, Recurrent Windowed Key-Value

What Retrieval Augmentation Generation (RAG) offers to LLM?

Leave a comment Cancel reply

Let's create a better future for all our users!

© 2024 Klimber Technologies Pvt. Ltd. All Rights Reserved.