How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

how-to-build-traceable-and-evaluated-llm-workflows-using-promptflow, prompty,-and-openai

Source: MarkTechPost

In this tutorial, we build a complete, production-style LLM workflow using Promptflow within a Colab environment. We begin by setting up a reliable keyring backend to avoid OS dependency issues and securely configure our OpenAI connection. From there, we establish a clean workspace and define a structured Prompty file that acts as the core LLM component of our pipeline. We then design a class-based flex flow that combines deterministic preprocessing with LLM reasoning, allowing us to inject computed hints into model responses. We also enable tracing to monitor each execution step, run both single- and batch-queries, and generate outputs in a structured format. Finally, we extend the system with an evaluation pipeline that leverages an LLM-as-a-judge to score responses against expected answers.

!pip install -q keyrings.alt   import keyring from keyrings.alt.file import PlaintextKeyring keyring.set_keyring(PlaintextKeyring())   import os from promptflow.client import PFClient from promptflow.connections import OpenAIConnection   pf = PFClient() CONN = "open_ai_connection" try:    pf.connections.get(name=CONN)    print(f"Using existing connection '{CONN}'") except Exception:    pf.connections.create_or_update(        OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"])    )    print(f"Created connection '{CONN}'")

We begin by installing a fallback keyring backend to avoid dependency issues in environments like Colab. We then initialize the Promptflow client and check if an OpenAI connection already exists. If not, we create one using the API key from the environment, ensuring a reusable and consistent connection setup.

!pip install -q "promptflow>=1.13.0" "promptflow-tracing" "promptflow-tools" openai   import os, sys, json, getpass, textwrap, importlib from pathlib import Path   if "OPENAI_API_KEY" not in os.environ:    os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key: ")   WORK_DIR = Path("https://www.marktechpost.com/content/pf_demo"); WORK_DIR.mkdir(exist_ok=True, parents=True) os.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR))   from promptflow.client import PFClient from promptflow.connections import OpenAIConnection from promptflow.tracing import start_trace   pf = PFClient() CONN = "open_ai_connection" try:    pf.connections.get(name=CONN); print(f"Using existing connection '{CONN}'") except Exception:    pf.connections.create_or_update(OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"]))    print(f"Created connection '{CONN}'")

We install all required Promptflow libraries and set up the project’s working directory. We securely capture the OpenAI API key if it is not already set and configure the environment accordingly. We then reinitialize the Promptflow client and ensure that the connection is properly established for downstream usage.

(WORK_DIR / "researcher.prompty").write_text("""--- name: Researcher description: Concise research assistant. model:  api: chat  configuration:    type: openai    connection: open_ai_connection    model: gpt-4o-mini  parameters:    temperature: 0.2    max_tokens: 350 inputs:  question: {type: string}  hint:     {type: string, default: ""} sample:  question: "What is the speed of light in vacuum?"  hint: "" --- system: You are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.   user: Q: {{question}} {% if hint %}Hint: {{hint}}{% endif %} """)   (WORK_DIR / "flow.py").write_text(textwrap.dedent('''    from pathlib import Path    from promptflow.tracing import trace    from promptflow.core import Prompty      BASE = Path(__file__).parent      @trace    def safe_calc(expression: str) -> str:        """A tiny deterministic 'tool' the assistant can lean on."""        if not set(expression) <= set("0123456789+-*/(). "):            return "unsafe"        try: return str(eval(expression))        except Exception as e: return f"error:{e}"      class ResearchAssistant:        """Class-based flex flow. __init__ args become flow init parameters."""        def __init__(self, model: str = "gpt-4o-mini"):            self.model = model            self.llm = Prompty.load(source=BASE / "researcher.prompty")          @trace        def __call__(self, question: str) -> dict:            hint = ""            if "*" in question or "+" in question:                tokens = [t for t in question.replace("?","").split() if any(c.isdigit() for c in t)]                expr = "".join(tokens)                if expr:                    hint = f"computed: {expr} = {safe_calc(expr)}"              answer = self.llm(question=question, hint=hint)              return {"question": question, "answer": str(answer).strip(), "hint_used": hint} '''))   (WORK_DIR / "flow.flex.yaml").write_text(    "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn"    "entry: flow:ResearchAssistantn" ) 

We define a Prompty file that structures how the LLM should behave as a concise research assistant. We then create a class-based flow that combines a deterministic calculation tool with an LLM call, enabling hybrid reasoning. Finally, we register this flow using a YAML configuration, making it executable within the Promptflow framework.

try: start_trace() except Exception as e: print("trace ui unavailable on Colab — traces still recorded:", e)   import flow as _flow; importlib.reload(_flow) agent = _flow.ResearchAssistant(model="gpt-4o-mini")   print("n=== Single call ===") print(json.dumps(agent(question="In one sentence, what is photosynthesis?"), indent=2)) print(json.dumps(agent(question="What is 21 * 19 ?"), indent=2))   data = [    {"question": "What is the capital of France?",          "expected": "Paris"},    {"question": "Chemical symbol for gold?",               "expected": "Au"},    {"question": "Who wrote the play Hamlet?",              "expected": "Shakespeare"},    {"question": "What is 12 * 11 ?",                       "expected": "132"},    {"question": "Boiling point of water at sea level (C)?","expected": "100"},    {"question": "Largest planet in our solar system?",     "expected": "Jupiter"}, ] data_path = WORK_DIR / "data.jsonl" data_path.write_text("n".join(json.dumps(r) for r in data))   print("n=== Batch run ===") base_run = pf.run(    flow=str(WORK_DIR / "flow.flex.yaml"),    data=str(data_path),    column_mapping={"question": "${data.question}"},    stream=True, ) print(pf.get_details(base_run))

We enable tracing to capture execution details and instantiate our research assistant flow. We test the system with individual queries to verify both natural language and arithmetic handling. We then prepare a dataset and run a batch job in Promptflow, collecting structured outputs for further evaluation.

(WORK_DIR / "judge.prompty").write_text("""--- name: Judge model:  api: chat  configuration:    type: openai    connection: open_ai_connection    model: gpt-4o-mini  parameters:    temperature: 0    max_tokens: 150    response_format: {type: json_object} inputs:  question: {type: string}  answer:   {type: string}  expected: {type: string} --- system: You are an exacting grader. Decide whether the assistant's answer contains the expected fact (case-insensitive, allowing reasonable phrasing/synonyms). Reply ONLY as JSON: {"score": 0 or 1, "reason": "..."}.   user: Question: {{question}} Expected: {{expected}} Answer:   {{answer}} """)   (WORK_DIR / "eval_flow.py").write_text(textwrap.dedent('''    import json    from pathlib import Path    from promptflow.tracing import trace    from promptflow.core import Prompty      BASE = Path(__file__).parent      class Evaluator:        def __init__(self):            self.judge = Prompty.load(source=BASE / "judge.prompty")          @trace        def __call__(self, question: str, answer: str, expected: str) -> dict:            raw = self.judge(question=question, answer=answer, expected=expected)            if isinstance(raw, str):                try: raw = json.loads(raw)                except Exception: raw = {"score": 0, "reason": f"unparseable:{raw[:80]}"}            return {"score": int(raw.get("score", 0)), "reason": str(raw.get("reason",""))}          def __aggregate__(self, line_results):            """Run-level aggregation. Whatever this returns shows up in pf.get_metrics()."""            scores = [r["score"] for r in line_results if r]            return {                "accuracy": (sum(scores) / len(scores)) if scores else 0.0,                "passed":   sum(scores),                "total":    len(scores),            } '''))   (WORK_DIR / "eval.flex.yaml").write_text(    "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn"    "entry: eval_flow:Evaluatorn" )   print("n=== Evaluation run ===") eval_run = pf.run(    flow=str(WORK_DIR / "eval.flex.yaml"),    data=str(data_path),    run=base_run,    column_mapping={        "question": "${data.question}",        "expected": "${data.expected}",        "answer":   "${run.outputs.answer}",    },    stream=True, )   eval_details = pf.get_details(eval_run) print(eval_details)   print("n=== Aggregated metrics (from __aggregate__) ===") print(json.dumps(pf.get_metrics(eval_run), indent=2))   import pandas as pd if "outputs.score" in eval_details.columns:    s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0)    print(f"Manual accuracy: {s.mean():.2%}  ({int(s.sum())}/{len(s)})")

We create a judging Prompty that evaluates model outputs against expected answers using structured JSON responses. We implement an evaluator class that parses results, computes scores, and defines an aggregation method for overall metrics. Also, we run the evaluation pipeline, link it to the base run, and compute accuracy both through Promptflow metrics and a manual fallback.

In conclusion, we built a robust, modular LLM pipeline that extends beyond basic prompt-response interactions. We integrated deterministic tools, structured prompting, and reusable flow components to create a system that is both transparent and scalable. Through batch execution and linked evaluation runs, we established a clear feedback loop that helps us measure performance using accuracy metrics and detailed reasoning. The inclusion of tracing and aggregation functions enables us to debug, monitor, and improve the system efficiently. Also, this workflow demonstrates how we can design reliable, end-to-end LLM applications with strong foundations in structure, evaluation, and reproducibility.


Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.