Transformation

Introduction

Transformation is the ability to transform a data source using language models to generate data features. The goal is to perform extractive NLP tasks such as classification, named entity recognition, translation, summarization, question answering, or sentiment analysis.

The key challenge is to generate bulk LLM responses efficiently. Performing one task per column per row over an HTTP connection is very slow. Even if the model is self hosted, the process takes a long time. 100,000 records times 5 features is 500,000 LLM requests. With an average of 26ms per generated token (OpenAI GPT-3.5), assuming a single token response for a classification task, the total processing time is around 4 hours. If the task requires GPT-4 level capabilitity (average latency ~ 76ms), then the processing time would be around 11 hours.

However, if we pack multiple inputs into a request then we can achieve the following:

  • GPT-3.5 - 3 minutes (assuming a gpt-3.5-turbo-1106 context window of 16,385 tokens, 64 tokens per input text, 3 tokens per output allowing for a structured JSON response, 26ms per generated token, a 1,500 token overhead for the system and user prompts, and around 220 inputs per request given the size of the context window.)
  • GPT-4.5 - 4 minutes (assuming a gpt-4.5-32k context window of 32,768 tokens, 64 tokens per input text, 3 tokens per output allowing for a structured JSON response, 76ms per generated token, a 1,500 token overhead for the system and user prompts, and around 465 inputs per request given the size of the context window.)

Latency Benchmarks

ModelAverage latency / output token
OpenAI GPT-3.526ms
OpenAI GPT-476ms
OpenAI GPT-3.5-Instruct12ms
Azure GPT-442ms
Azure GPT-3.525 ms
Azure GPT-3.5-Instruct8ms
Claude Instant-110ms
Claude-231ms

These results are further optimized by:

  • De-duplicating inputs
  • Caching responses

Method

The process implemented by Prompt Store entails:

  1. Bin packing inputs using the selected model’s context window to calculate batch sizes.
  2. Hashing inputs to enable caching and to reconstruct outputs after de-duplication. (See below.)
  3. De-duplicating inputs. Only novel inputs are processed.
  4. Check the cache for inputs that may have been already processed in previous requests.
  5. Parallelize requests to maximize throughput given the published token and request limits of the selected model.
  6. Make the request, using a system prompt to get structured JSON output. This step is model specific. Prompt Store will automatically make the appropriate request. Some models support “function calling” - the ability to respond with a structured format for use by an external tool. With other models, we can coerce the model to return a structured response using a system prompt. (The actual response will need to be parsed, validated, and maybe retried if the response format is incorrect. Prompt Store transparently handles both cases.)
  7. Using the hash keys calculated in step 2, assign each response to the original input in the right sequence. (Where an input was de-duplicated, this will involve assigning the same response to more than one input.)

Destinations

Finally the result set is written to one or more destinations. Suported destinations include:

  • A SQL table
  • A CSV file in object storage
  • A Semantic Index (e.g. Redis, Chroma, Milvus) so the transformation result can be used for retrieval augmented generation (RAG)
  • A Graph Store so the transformation result can be used to build a Knowledge Graph