Skip links

Unlock high performance of Transfer Learning with transformer based Neuron architecture on Inferentia Neurons

Supported Neural Network Architecture and Neuron Family

TensorFlow Neuron enables native TensorFlow models to be accelerated on Neuron devices, so you can use your existing framework application and get started easily with minimal code changes.

Deep learning model architectures are a good fit for Inferentia, Inferentia2 and Trainium. However, Inferentia have comparatively lesser cost. Each of the neuron core has it’s own tensor engine, vector & scaler engine

If we just talk about Inferentia family, all the transformer encoders are well supported for PyTorch and TensorFlow TF1.x and TensorFlow TF2.x neurons. This gives a straight pathway to Sequence classification, question answering, masked language modeling and Natural Language Understanding. Overall, Autoencoding models are always a good fit on Inferentia whereas Autoregressive models are not a good fit for Inferetia.

If we intent to use Autoregressive models or Sequence to Sequence, NL Generation and Translation, i.e. the architecture where decoders are involved, then we can train on Pytorch for Inferentia2 and Trainium.

Both the above families have supported architecture for computer vision too, which means that even a Single Shot Detector based Object Detection is possible with neurons on Inferentia. 😉

Picking one instance of the Transfer Learning - ZSC/ZSL

Zero-shot classification (ZSC) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. We will use Zero Shot classification using Hugging face transformer. One of the popular technique Zero Shot classification uses is natural language inference. Under the hood, Zero shot classification takes our sequence and it creates hypothesis for our pre-trained model which is specifically trained on premise and hypothesis classification and then it gives the score for each of given labels.

Compile the model into a Neuron Optimized Model

We will compile the underlying model inside the pipeline as well as make some edits to the tokenizer. We need to edit the tokenizer to a specific sequence lenght (128 using padding) is because neuron only accepts static input shapes.

The trace() creates a TensorFlow GraphDef protobuf intermediate representation (IR) of the model compute graph.

finally compilation code snippet (thanks to AWS team)


from transformers import pipeline
import tensorflow as tf
import tensorflow.neuron as tfn
import time

--Choosing supported model for 'zero-shot-classification' task
model_name = 'roberta-large-mnli'
pipe = pipeline('zero-shot-classification', model=model_name, framework='tf')

sequence_to_classify = "one day I will see the world"

candidate_labels = ['travel', 'cooking', 'dancing']

start = time.time()
print(pipe(sequence_to_classify, candidate_labels))
print("CPU infer time: ", time.time() - start)

neuron_pipe = pipeline('zero-shot-classification', model='roberta-large-mnli', framework='tf')

original_tokenizer = pipe.tokenizer
def wrapper_function(*args, **kwargs):
    kwargs['padding'] = 'max_length'
    kwargs['max_length'] = 128
    kwargs['truncation'] = True
    kwargs['return_tensors'] = 'tf'
    return original_tokenizer(*args, **kwargs)

neuron_pipe.tokenizer = wrapper_function
neuron_pipe.tokenizer.decode = original_tokenizer.decode
neuron_pipe.tokenizer.mask_token_id = original_tokenizer.mask_token_id
neuron_pipe.tokenizer.pad_token_id = original_tokenizer.pad_token_id
neuron_pipe.tokenizer.convert_ids_to_tokens = original_tokenizer.convert_ids_to_tokens
example_inputs = neuron_pipe.tokenizer('we can use any string here to generate example inputs')

#compile the model by calling tfn.trace by passing in the underlying model


start = time.time()
neuron_model = tfn.trace(pipe.model, example_inputs)
print("Neuron compile time: ", time.time() - start)
neuron_pipe.model = neuron_model
neuron_pipe.model.config = pipe.model.config

start = time.time()
print(neuron_pipe(sequence_to_classify, candidate_labels))
print("Neuron infer time: ", time.time() - start)
 

start = time.time()
neuron_model = tfn.trace(pipe.model, example_inputs)
print("Neuron compile time: ", time.time() - start)
neuron_pipe.model = neuron_model
neuron_pipe.model.config = pipe.model.config

start = time.time()
print(neuron_pipe(sequence_to_classify, candidate_labels))
print("Neuron infer time: ", time.time() - start)
 

We executed this on 16 Neuron Cores. Now, we need to save your neuron model to disk and avoid recompilation.

And now we all set to load the model from disk, can provide high performance even on 4 Neuron Cores too. I have gained 10x improvement with this model compilation using neuron-cc.

Optimising for further high throughput and low latency

For further optimizations, correctly choose the neuron cores needed for Large Language Model (LLM) inference, it’s driven by batch size, sequence length. Moreover, compiler can flagged with neuroncore-pipeline-cores, that can be calculated as following,


neuroncore-pipeline-cores = 4 * round( number-of-weights-in-model/(2 * 10^7) )
Insights and Updates

The Latest News, Trends,  and Best Practices

Organising for Generative AI: why CEO needs to know this?

Generative AI is a subset of Deep learning, it uses artificial neural network, can process both labelled and unlabelled data using supervised, unsupervised and semi supervised methods. Generative Deep learning model, learn patterns in unstructured content, generate new data that is similar to data it was trained on.

What Retrieval Augmentation Generation (RAG) offers to LLM?

Retrieval-augmented generation (RAG) for large language models (LLMs) aims to improve prediction quality by using an external datastore at inference time to augment a richer prompt that includes some combination of context, history, and recent and relevant knowledge.

Leave a comment