Skip to content

Converting Phi-3 model into MLFlow format, to enable targeted inference.

License

Notifications You must be signed in to change notification settings

LazaUK/SLM-Phi-3-MLFlow

Repository files navigation

Building a wrapper and using Phi-3 as an MLFlow model

MLflow is an open-source platform designed to streamline the entire machine learning (ML) lifecycle. It helps data scientists track experiments, manage their ML models and deploy them into production, ensuring reproducibility and efficient collaboration.

In this repo, I’ll demonstrate two different approaches to building a wrapper around Phi-3 small language model (SLM) and then running it as an MLFlow model either locally or in the cloud, e.g., in Azure Machine Learning workspace. You can use attached Jupyter notebooks to jump-start your development process.

Note: this code has been now contributed to Microsoft's Phi-3 Cookbook here.

Table of contents:

Option 1: Transformer pipeline

This is the easiest option to build a wrapper if you want to use a HuggingFace model with MLFlow’s experimental transformers flavour.

  1. You would require relevant Python packages from MLFlow and HuggingFace:
import mlflow
import transformers
  1. Next, you should initiate a transformer pipeline by referring to the target Phi-3 model in the HuggingFace registry. As can be seen from the Phi-3-mini-4k-instruct’s model card, its task is of a “Text Generation” type:
pipeline = transformers.pipeline(
    task = "text-generation",
    model = "microsoft/Phi-3-mini-4k-instruct"
)
  1. You can now save your Phi-3 model’s transformer pipeline into MLFlow format and provide additional details such as the target artifacts path, specific model configuration settings and inference API type:
model_info = mlflow.transformers.log_model(
    transformers_model = pipeline,
    artifact_path = "phi3-mlflow-model",
    model_config = model_config,
    task = "llm/v1/chat"
)

Option 2: Custom Python wrapper

At the time of writing, the transformer pipeline did not support MLFlow wrapper generation for HuggingFace models in ONNX format, even with the experimental optimum Python package. For the cases like this, you can build your custom Python wrapper for MLFlow model.

  1. I utilise here Microsoft's ONNX Runtime generate() API for the ONNX model's inference and tokens encoding / decoding. You have to choose onnxruntime_genai package for your target compute, with the below example targeting CPU:
import mlflow
from mlflow.models import infer_signature
import onnxruntime_genai as og
  1. Our custom class implements two methods: load_context() to initialise the ONNX model of Phi-3 Mini 4K Instruct, generator parameters and tokenizer; and predict() to generate output tokens for the provided prompt:
class Phi3Model(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        # Retrieving model from the artifacts
        model_path = context.artifacts["phi3-mini-onnx"]
        model_options = {
             "max_length": 300,
             "temperature": 0.2,         
        }
    
        # Defining the model
        self.phi3_model = og.Model(model_path)
        self.params = og.GeneratorParams(self.phi3_model)
        self.params.set_search_options(**model_options)
        
        # Defining the tokenizer
        self.tokenizer = og.Tokenizer(self.phi3_model)

    def predict(self, context, model_input):
        # Retrieving prompt from the input
        prompt = model_input["prompt"][0]
        self.params.input_ids = self.tokenizer.encode(prompt)

        # Generating the model's response
        response = self.phi3_model.generate(self.params)

        return self.tokenizer.decode(response[0][len(self.params.input_ids):])
  1. You can use now mlflow.pyfunc.log_model() function to generate a custom Python wrapper (in pickle format) for the Phi-3 model, along with the original ONNX model and required dependencies:
model_info = mlflow.pyfunc.log_model(
    artifact_path = artifact_path,
    python_model = Phi3Model(),
    artifacts = {
        "phi3-mini-onnx": "cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4",
    },
    input_example = input_example,
    signature = infer_signature(input_example, ["Run"]),
    extra_pip_requirements = ["torch", "onnxruntime_genai", "numpy"],
)

Signatures of generated MLFlow models

  1. In Step 3 of Option 1 above, we set the MLFlow model’s task to “llm/v1/chat”. Such instruction generates a model’s API wrapper, compatible with OpenAI’s Chat API as shown below:
{inputs: 
  ['messages': Array({content: string (required), name: string (optional), role: string (required)}) (required), 'temperature': double (optional), 'max_tokens': long (optional), 'stop': Array(string) (optional), 'n': long (optional), 'stream': boolean (optional)],
outputs: 
  ['id': string (required), 'object': string (required), 'created': long (required), 'model': string (required), 'choices': Array({finish_reason: string (required), index: long (required), message: {content: string (required), name: string (optional), role: string (required)} (required)}) (required), 'usage': {completion_tokens: long (required), prompt_tokens: long (required), total_tokens: long (required)} (required)],
params: 
  None}
  1. As a result, you can submit your prompt in the following format:
messages = [{"role": "user", "content": "What is the capital of Spain?"}]
  1. Then, use OpenAI API-compatible post-processing, e.g., response[0][‘choices’][0][‘message’][‘content’], to beautify your output to something like this:
Question: What is the capital of Spain?

Answer: The capital of Spain is Madrid. It is the largest city in Spain and serves as the political, economic, and cultural center of the country. Madrid is located in the center of the Iberian Peninsula and is known for its rich history, art, and architecture, including the Royal Palace, the Prado Museum, and the Plaza Mayor.

Usage: {'prompt_tokens': 11, 'completion_tokens': 73, 'total_tokens': 84}
  1. In Step 3 of Option 2 above, we allow the MLFlow package to generate the model’s signature from a given input example. Our MLFlow wrapper's signature will look like this:
{inputs: 
  ['prompt': string (required)],
outputs: 
  [string (required)],
params: 
  None}
  1. So, our prompt would need to contain "prompt" dictionary key, similar to this:
{"prompt": "<|system|>You are a stand-up comedian.<|end|><|user|>Tell me a joke about atom<|end|><|assistant|>",}
  1. The model's output will be provided then in string format:
Alright, here's a little atom-related joke for you!

Why don't electrons ever play hide and seek with protons?

Because good luck finding them when they're always "sharing" their electrons!

Remember, this is all in good fun, and we're just having a little atomic-level humor!

Inference of Phi-3 with MLFlow runtime

  1. To run the generated MLFlow model locally, you can load it with mlflow.pyfunc.load_model() from the model’s directory and then call its predict() method. You can load the model as follows:
loaded_model = mlflow.pyfunc.load_model(
    model_uri = model_info.model_uri
)
  1. To run in a cloud environment like an Azure Machine Learning workspace, you can register your MLFlow model with a custom Python wrapper in workspace's model registry: phi3_mlflow_registration
  2. Then, deploy it to a managed real-time endpoint: phi3_mlflow_deploy
  3. Once the deployment succeeds, you can immediately start using it with code samples provided in JavaScript, Python, C# or R: phi3_mlflow_endpoint