Generative AI Use Cases#
Introduction#
Beside Tensorflow Serving API (/v1
) and KServe API (/v2
) frontends, the model server supports a range of endpoints for generative use cases (v3
). They are extendible using MediaPipe graphs.
Currently supported endpoints are:
OpenAI compatible endpoints:
embeddings Cohere Compatible endpoint:
OpenAI API Clients#
When creating a Python-based client application, you can use OpenAI client library - openai.
Alternatively, it is possible to use just a curl
command or requests
python library.
Install the Package#
pip3 install openai
pip3 install requests
pip3 install cohere
Request chat completions with unary calls#
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=False,
)
print(response.choices[0].message)
import requests
payload = {"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "user","content": "Say this is a test" }]}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/chat/completions", json=payload, headers=headers)
print(response.text)
curl http://localhost:8000/v3/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "user","content": "Say this is a test" }]}'
Check LLM quick start and end to end demo of text generation.
Request completions with unary calls#
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.completions.create(
model="meta-llama/Llama-2-7b",
prompt="Say this is a test",
stream=False,
)
print(response.choices[0].text)
import requests
payload = {"model": "meta-llama/Llama-2-7b", "prompt": "Say this is a test"}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/completions", json=payload, headers=headers)
print(response.text)
curl http://localhost:8000/v3/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-2-7b", "prompt": "Say this is a test"}'
Check LLM quick start and end to end demo of text generation.
Request chat completions with streaming#
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v3",
api_key="unused"
)
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Check LLM quick start and end to end demo of text generation.
Request completions with streaming#
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v3",
api_key="unused"
)
stream = client.completions.create(
model="meta-llama/Llama-2-7b",
prompt="Say this is a test",
stream=True,
)
for chunk in stream:
if chunk.choices[0].text is not None:
print(chunk.choices[0].text, end="")
Check LLM quick start and end to end demo of text generation.
Text embeddings#
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v3",
api_key="unused"
)
responses = client.embeddings.create(input=['hello world'], model='Alibaba-NLP/gte-large-en-v1.5')
for data in responses.data:
print(data.embedding)
import requests
payload = {"model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/embeddings", json=payload, headers=headers)
print(response.text)
curl http://localhost:8000/v3/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}'
Cohere Python Client#
Clients can use rerank endpoint via cohere python package - cohere.
Just like with openAI endpoints and alternative is in curl
command or requests
python library.
Install the Package#
pip3 install cohere
pip3 install requests
Documents reranking#
import cohere
client = cohere.Client(base_url='http://localhost:8000/v3', api_key="not_used")
responses = client.rerank(query="Hello",documents=["Welcome","Farewell"], model='BAAI/bge-reranker-large')
for res in responses.results:
print(res.index, res.relevance_score)
import requests
payload = {"model": "BAAI/bge-reranker-large", "query": "Hello", "documents":["Welcome","Farewell"]}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/rerank", json=payload, headers=headers)
print(response.text)
curl http://localhost:8000/v3/rerank \
-H "Content-Type: application/json" \
-d '{"model": "BAAI/bge-reranker-large", "query": "Hello", "documents":["Welcome","Farewell"]}'