Source: MarkTechPost
In this tutorial, we implement an instrumented workflow for Microsoft SkillOpt. We set up the SkillOpt repository, connect it to OpenAI-compatible model access, configure the optimizer and target models, and run the SearchQA optimization pipeline with a controlled sample limit to keep costs manageable. We first evaluate the original seed skill as a baseline, then run a real optimization loop in which SkillOpt improves the skill through rollout, reflection, aggregation, selection, updating, and validation-based gating. Along the way, we inspect the training history, visualize changes in accuracy, review edit-budget behavior, monitor cumulative token usage, and compare the evolved skill with the original baseline.
SkillOpt Environment Setup
import os, re, json, glob, subprocess, pathlib, difflib try: from google.colab import userdata OPENAI_KEY = userdata.get("OPENAI_API_KEY") except Exception: OPENAI_KEY = os.environ.get("OPENAI_API_KEY", "") OPENAI_KEY = OPENAI_KEY or "sk-PASTE-YOUR-KEY-HERE" assert OPENAI_KEY.startswith("sk-"), "Set a real OpenAI key (Colab Secrets -> OPENAI_API_KEY)." OPTIMIZER_MODEL = "gpt-4o" TARGET_MODEL = "gpt-4o-mini" RUN = "outputs/searchqa_adv" LIMIT = 24 RUN_KNOBS = dict(num_epochs=2, batch_size=8, minibatch=4, merge_batch=4, workers=2, lr=4, lr_sched="cosine", limit=LIMIT) if not pathlib.Path("https://www.marktechpost.com/content/SkillOpt/scripts/train.py").exists(): subprocess.run("git clone --depth 1 https://github.com/microsoft/SkillOpt.git", shell=True, cwd="https://www.marktechpost.com/content") subprocess.run('pip -q install -e . && pip -q install "openai>=1.0" pandas matplotlib', shell=True, cwd="https://www.marktechpost.com/content/SkillOpt") os.chdir("https://www.marktechpost.com/content/SkillOpt") os.environ["AZURE_OPENAI_ENDPOINT"] = "https://api.openai.com/v1" os.environ["AZURE_OPENAI_API_KEY"] = OPENAI_KEY os.environ["AZURE_OPENAI_AUTH_MODE"] = "openai_compatible" SPLIT = "data/searchqa_id_split" CFG = "configs/searchqa/default.yaml" COMMON = ["--azure_openai_endpoint","https://api.openai.com/v1", "--cfg-options","model.backend=azure_openai", "model.azure_openai_auth_mode=openai_compatible"]
We prepare the full Colab environment for running SkillOpt. We load the OpenAI API key, define the optimizer and target models, clone the SkillOpt repository, and install the required dependencies. We also configure the OpenAI-compatible backend so the SkillOpt scripts can communicate with the selected models.
Baseline Skill Evaluation
def run_cli(args, tag): print("n" + "#"*80 + f"n# {tag}n# $ " + " ".join(args) + "n" + "#"*80) p = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) buf = [] for line in p.stdout: print(line, end=""); buf.append(line) p.wait(); return "".join(buf) def parse_acc(txt): m = re.search(r"Results:s*hard=([d.]+)s+soft=([d.]+)", txt) if m: return {"hard": float(m.group(1)), "soft": float(m.group(2))} g = re.findall(r"hard=([d.]+)", txt) return {"hard": float(g[-1]), "soft": None} if g else None seed = "skillopt/envs/searchqa/skills/initial.md" if not pathlib.Path(seed).exists(): seed = "baseline_skill.md"; pathlib.Path(seed).write_text("You answer questions from the given context.n") base_out = run_cli(["python","scripts/eval_only.py","--config",CFG, "--skill",seed,"--split","valid_unseen","--split_dir",SPLIT, "--target_model",TARGET_MODEL,*COMMON, "env.workers=1",f"env.limit={LIMIT}"], "BASELINE EVAL (env seed skill, no training)") base = parse_acc(base_out)
We define helper functions to run SkillOpt commands and extract evaluation accuracy from the output. We then locate the initial seed skill used by the SearchQA environment and evaluate it on the unseen validation split. This gives us a baseline result before any optimization or training takes place.
Training And Visualization
k = RUN_KNOBS train_out = run_cli(["python","scripts/train.py","--config",CFG,"--split_dir",SPLIT, "--optimizer_model",OPTIMIZER_MODEL,"--target_model",TARGET_MODEL,"--out_root",RUN, *COMMON, "train.train_size=0", f"train.num_epochs={k['num_epochs']}", f"train.batch_size={k['batch_size']}", f"gradient.minibatch_size={k['minibatch']}", f"gradient.merge_batch_size={k['merge_batch']}", f"gradient.analyst_workers={k['workers']}", f"optimizer.learning_rate={k['lr']}", f"optimizer.lr_scheduler={k['lr_sched']}", "optimizer.use_slow_update=true", "optimizer.use_meta_skill=true", f"env.workers={k['workers']}", f"env.limit={k['limit']}"], "TRAIN (rollout->reflect->aggregate->select->update->gate; slow-update + meta-skill)") import pandas as pd, matplotlib.pyplot as plt hist = json.loads(pathlib.Path(f"{RUN}/history.json").read_text()) df = pd.json_normalize(hist) print("nhistory.json columns:", list(df.columns)) def col(*cands): for c in cands: for actual in df.columns: if c in actual.lower(): return actual return None c_step = col("step") x = df[c_step] if c_step else range(len(df)) c_tr, c_va = col("train_acc","train_hard","train"), col("val_acc","val_hard","valid","val") c_lr, c_tok = col("edit_budget","lr","learning_rate","budget"), col("token","cost") fig, ax = plt.subplots(1, 3, figsize=(16,4)) if c_tr: ax[0].plot(x, df[c_tr], "o-", label="train acc") if c_va: ax[0].plot(x, df[c_va], "s-", label="val acc (gate)") if base and base["hard"] is not None: ax[0].axhline(base["hard"], ls="--", c="grey", label="baseline (seed)") ax[0].set_title("Skill accuracy over steps"); ax[0].set_xlabel("step"); ax[0].legend(); ax[0].grid(alpha=.3) if c_lr: ax[1].plot(x, df[c_lr], "d-", c="purple") ax[1].set_title("Edit-budget / LR schedule (cosine)"); ax[1].set_xlabel("step"); ax[1].grid(alpha=.3) if c_tok: ax[2].plot(x, pd.to_numeric(df[c_tok],errors="coerce").cumsum(), c="darkorange") ax[2].set_title("Cumulative token usage"); ax[2].set_xlabel("step"); ax[2].grid(alpha=.3) plt.tight_layout(); plt.savefig(f"{RUN}/training_dashboard.png", dpi=120); plt.show()
We run the main SkillOpt training loop with the selected optimizer and target models. We configure important training settings such as epochs, batch size, minibatch size, learning rate, slow update, meta-skill, and data limit. We then read the training history, visualize accuracy, edit-budget behavior, and cumulative token usage on a dashboard.
Inspecting Skill Evolution
snaps = sorted(glob.glob(f"{RUN}/skills/skill_v*.md")) best = pathlib.Path(f"{RUN}/best_skill.md").read_text() print("n" + "="*80 + f"nSKILL EVOLUTION: {len(snaps)} snapshots; diff v0 -> best_skilln" + "="*80) if snaps: diff = difflib.unified_diff(pathlib.Path(snaps[0]).read_text().splitlines(), best.splitlines(), snaps[0].split('/')[-1], "best_skill.md", lineterm="") print("n".join(list(diff)[:120]) or "(no textual diff captured)") prot = re.search(r"(SLOW_UPDATE.*?)$", best, re.S) print("n--- protected SLOW_UPDATE block ---n", prot.group(1)[:1500] if prot else "(none — appears after an epoch boundary)") patch = (sorted(glob.glob(f"{RUN}/steps/step_*/patches/*.json")) or [None])[0] analy = (sorted(glob.glob(f"{RUN}/steps/step_*/analysis/*")) or [None])[0] print("n" + "="*80 + "nTEXTUAL GRADIENT — one aggregated patch (clipped to edit budget):n" + "="*80) print(pathlib.Path(patch).read_text()[:1500] if patch else "(no patch files)") print("n--- one raw Reflect-stage analysis ---n", pathlib.Path(analy).read_text()[:1000] if analy else "(no analysis files)") for name in ("slow_update", "meta_skill"): files = sorted(glob.glob(f"{RUN}/{name}/epoch_*/*")) print(f"n[{name}] {len(files)} artifact(s):", [pathlib.Path(f).name for f in files[:6]])
We inspect how the skill evolves during the optimization process. We compare the first saved skill snapshot with the final best skill, check whether a protected slow-update block appears, and review one generated patch and one reflection analysis. We also list the slow-update and meta-skill artifacts created during epoch-level training.
Final Evaluation Comparison
best_out = run_cli(["python","scripts/eval_only.py","--config",CFG, "--skill",f"{RUN}/best_skill.md","--split","valid_unseen","--split_dir",SPLIT, "--target_model",TARGET_MODEL,*COMMON,"env.workers=1",f"env.limit={LIMIT}"], "FINAL TEST EVAL (best_skill)") trained = parse_acc(best_out) print("n" + "="*80 + "nRESULT (hard = exact match, the gated metric)n" + "="*80) print(f"baseline seed skill : {base}") print(f"trained best_skill : {trained}") if base and trained: print(f"hard-match lift : {trained['hard'] - base['hard']:+.4f}") print(f"nDeployable artifact: {RUN}/best_skill.md ({len(best)} chars)")
We evaluate the final optimized best_skill.md file on the unseen validation split. We compare the trained skill’s hard-match score with the original baseline score to measure the improvement. We finish by printing the final lift and the path to the deployable optimized skill artifact.
Conclusion
In conclusion, we built a complete SkillOpt experiment that goes beyond simply starting a training command. We measured the baseline seed skill, optimized it using a stronger model as the optimizer and a smaller model as the target agent, and inspected how the skill evolved across training steps through saved snapshots, patches, reflections, slow updates, and meta-skill artifacts. We also generated a training dashboard that helps us understand whether the optimization process is improving performance and how much token usage accumulates during the run. By the end, we have a deployable best_skill.md file, a final evaluation on the unseen validation split, and a clear comparison between the original and optimized skills.
Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


