<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jcatanza.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jcatanza.github.io/" rel="alternate" type="text/html" /><updated>2026-06-15T01:18:32+00:00</updated><id>https://jcatanza.github.io/feed.xml</id><title type="html">Joseph Catanzarite</title><subtitle>AI/ML researcher and Johns Hopkins graduate student in AI. Projects in generative AI, LLMs, and machine learning.</subtitle><author><name>Joseph Catanzarite</name><email>jcat@alumni.caltech.edu</email></author><entry><title type="html">Professor.Claude.AI: An Autonomous Self-Teaching Research Agent</title><link href="https://jcatanza.github.io/generative-ai/agents/2026/06/14/professor-claude-ai.html" rel="alternate" type="text/html" title="Professor.Claude.AI: An Autonomous Self-Teaching Research Agent" /><published>2026-06-14T00:00:00+00:00</published><updated>2026-06-14T00:00:00+00:00</updated><id>https://jcatanza.github.io/generative-ai/agents/2026/06/14/professor-claude-ai</id><content type="html" xml:base="https://jcatanza.github.io/generative-ai/agents/2026/06/14/professor-claude-ai.html"><![CDATA[<p>Professor.Claude.AI is an autonomous, self-teaching research agent that runs every night so its human collaborator can stay at the frontier of AI research with roughly an hour of focused reading a day. Each cycle it ingests new arXiv papers in the agents-and-harnesses subfield, triages them — most are skipped, a few skimmed, the best one or two earn deep attention — then deep-reads those via Claude into structured analyses with provenance, writes the results to a persistent memory layer, and emits a digest both as email and as a markdown commit to a private repo. It is built on <strong>LangGraph</strong> using the <strong>Supervisor pattern</strong>: a nightly coordinator dispatches to four sub-agents (ingestion, triage, deep-read, synthesis). The current release is <strong>v0.1 — “the spine”</strong>: a narrow but genuinely end-to-end pipeline, not a finished product. The point of v0.1 was to prove the whole loop runs unattended before adding breadth.</p>

<h2 id="overview">Overview</h2>

<p>The problem this targets is throughput, not capability: the AI literature moves faster than any one person can track, and most papers don’t warrant deep reading. Professor.Claude.AI automates the funnel — wide ingestion, aggressive triage, selective deep reading — so human attention is spent only where it pays off. A <a href="https://github.com/jcatanza/professor-claude-ai/blob/main/docs/sample-digest-2026-05-07.md">sample digest from the first production run</a> shows the end-to-end output.</p>

<h2 id="what-it-does">What it does</h2>

<p>Each nightly cycle:</p>

<ol>
  <li><strong>Ingests</strong> new arXiv papers in the agents-and-harnesses subfield</li>
  <li><strong>Triages</strong> them — keyword filter plus a “you-model” and a taste model — so most are dropped and only the top 1–2 are promoted</li>
  <li><strong>Deep-reads</strong> the survivors via Claude, producing structured analyses with provenance</li>
  <li><strong>Writes</strong> to a persistent memory layer (short-term checkpoints plus a long-term semantic store)</li>
  <li><strong>Emits</strong> a daily digest by email and as a markdown commit to a private GitHub repo</li>
</ol>

<h2 id="architecture-at-a-glance">Architecture at a glance</h2>

<p>The Supervisor pattern (from Manning’s <em>AI Agents and Applications</em>, Roberto Infante, 2026) puts a coordinator in charge of routing work to specialist sub-agents:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Nightly Coordinator (Supervisor)
    ├── Ingestion Agent   (arXiv, RSS, HF Daily)
    ├── Triage Agent      (keyword filter + you-model + taste model)
    ├── Deep-Read Agent   (ReAct + ToolNode, structured analysis)
    └── Synthesis Agent   (digest formatting, email + repo commit)
</code></pre></div></div>

<p>State is deliberately layered by lifetime:</p>

<ul>
  <li><strong>Short-term</strong> — LangGraph <code class="language-plaintext highlighter-rouge">SqliteSaver</code> checkpointer holds within-run state, and checkpoints written after every node let a failed run resume from its last good step on the next invocation.</li>
  <li><strong>Long-term</strong> — ChromaDB provides vector RAG, DuckDB holds an episodic event log, and a four-layer Smallville-inspired memory model sits on top (v0.3+).</li>
</ul>

<h2 id="anti-failure-mechanisms">Anti-failure mechanisms</h2>

<p>An agent that reads on your behalf can fail in quiet, compounding ways. The design names five and guards against each:</p>

<ol>
  <li><strong>Information overload</strong> — a hard cap on deep-reads per run, with tunable triage thresholds.</li>
  <li><strong>Hallucinated claims</strong> — provenance everywhere, calibrated confidence, a verification pass on numerical claims, and honest uncertainty.</li>
  <li><strong>Echo chamber</strong> — periodic “outside view” sweeps and topic-diversity metrics.</li>
  <li><strong>Stale interests</strong> — a field-shift detector and quarterly recalibration of the taste model.</li>
  <li><strong>Over-engineered and unused</strong> — a spine-first build, plus a weekly check that the digests are actually being read.</li>
</ol>

<h2 id="status">Status</h2>

<p><strong>v0.1 — the spine.</strong> End-to-end working pipeline, narrow but real. The nightly workflow was <strong>paused May 11 – June 11, 2026</strong> during a high-coursework stretch and re-enabled afterward. The full v0.3 design lives in private project notes and will be sanitized for a future public revision; architecture decisions are tracked in the repo’s ADRs and decision log.</p>

<h2 id="links">Links</h2>

<ul>
  <li><a href="https://github.com/jcatanza/professor-claude-ai">GitHub repository</a></li>
  <li><a href="https://github.com/jcatanza/professor-claude-ai/blob/main/docs/sample-digest-2026-05-07.md">Sample digest (first production run)</a></li>
  <li><strong>Stack:</strong> Python 3.12+, LangGraph, ChromaDB, DuckDB, LangSmith tracing, GitHub Actions</li>
  <li><strong>License:</strong> MIT</li>
</ul>

<p><strong>Updated:</strong> June 14, 2026</p>]]></content><author><name>Joseph Catanzarite</name><email>jcat@alumni.caltech.edu</email></author><category term="generative-ai" /><category term="agents" /><summary type="html"><![CDATA[A nightly multi-agent system that ingests, triages, deep-reads, and synthesizes new AI research — keeping its human collaborator at the frontier with about an hour of focused review per day. Built on LangGraph with a Supervisor pattern. v0.1, the spine: narrow but genuinely end-to-end.]]></summary></entry><entry><title type="html">Generative Augmentation in Sparse Data Regimes: A Controlled Factorial Study in Chinese Character Classification</title><link href="https://jcatanza.github.io/generative-ai/research/2026/06/12/generative-augmentation.html" rel="alternate" type="text/html" title="Generative Augmentation in Sparse Data Regimes: A Controlled Factorial Study in Chinese Character Classification" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://jcatanza.github.io/generative-ai/research/2026/06/12/generative-augmentation</id><content type="html" xml:base="https://jcatanza.github.io/generative-ai/research/2026/06/12/generative-augmentation.html"><![CDATA[<h2 id="teaching-a-machine-to-read-chinese">Teaching a machine to read Chinese</h2>

<p>The Chinese writing system has tens of thousands of characters. Each one is a small, precise arrangement of strokes. To teach a computer to recognize even a single character, you need many labeled examples of it: handwritten, varied, each tagged with the right answer.</p>

<p>That is the catch. Collecting and labeling those examples is slow and expensive. And when a model has too few of them, it does not learn the character. It memorizes. It learns the exact pixels in the handful of images it was shown, then falls apart the moment it sees a new one. This is called overfitting. The model is brilliant on the practice test and helpless on the real thing.</p>

<h2 id="an-idea-let-the-model-invent-its-own-examples">An idea: let the model invent its own examples</h2>

<p>There is a standard fix for too little data. You make more of it.</p>

<p>The simple version rotates, flips, and crops the images you already have. It helps. But it is bounded. Every new image is just a warped copy of an old one.</p>

<p>The ambitious version is stranger. You train a second model to learn what the characters <em>look like</em>, the shapes and proportions and the logic of the strokes. Then you ask it to draw new ones. Not copies. New characters it has never seen, but that look like they belong.</p>

<p>The model that does the drawing is a GAN, a generative adversarial network. Picture two networks locked in a game. One is a forger, painting fake characters. The other is a detective, trying to tell the fakes from the real ones. They train against each other. The forger keeps losing, but slowly improving until, finally, its fakes are good enough to fool the detective. By then the forger has learned to produce convincing characters on demand. The exact version I used is a Wasserstein GAN with a gradient penalty. That is just a flavor that trains far more steadily than the original. I will call it the generator, and I will call the detective the critic. Remember the critic. It comes back later.</p>

<h2 id="a-first-look-and-a-trap">A first look, and a trap</h2>

<p>To test the idea I used Chinese-MNIST, a Chinese-character version of the classic MNIST (Modified National Institute of Standards and Technology) handwritten-digit benchmark. It has fifteen characters: the digits zero through nine plus five magnitude characters like <em>ten</em> and <em>thousand</em>, with a thousand small grayscale images each. To simulate scarcity, I held most of them back and trained on a thin slice per class. The classifier I trained on that slice was a plain convolutional network. Nothing fancy.</p>

<p>I started with two experiments.</p>

<p>In the first, I gave the classifier fifty real images per class to train with. It reached 88.5% accuracy. Then I mixed in synthetic characters from the generator, raw and uncurated. Accuracy slipped a point or two. The fakes had made it worse.</p>

<p>In the second, I starved the classifier: I gave it only twenty-five real images per class to train on, half as many examples as before. This time, before adding the synthetic images, I filtered them. I used the critic as a judge, kept only its top-rated fakes, and threw the rest away. Now accuracy jumped, from 71.8% up to around 80%. The filtered fakes had helped, and helped a lot.</p>

<p>Put the two together and the moral seems obvious. Raw synthetic data hurts. Filtered synthetic data helps. The filter is the magic. Curate your fakes and augmentation pays off.</p>

<p>It is a tidy story. It is also wrong. And the way it is wrong is the interesting part.</p>

<h2 id="what-changed-twice">What changed twice</h2>

<p>Look again at what I did. Between the first experiment and the second, I changed two things at the same time.</p>

<p>The first had plenty of data and no filter. The second had scarce data and a filter. So when the result flipped, from “synthetic data hurts” to “synthetic data helps,” what caused the flip? The filter I added? Or the scarcity I imposed? The two moved together, in lockstep. There is no way to separate them after the fact.</p>

<p>This is a confound, and it is the oldest trap in experimental design. When two things change at once, you cannot hand the credit to either one.</p>

<p>Maybe the filter was doing all the work. Or maybe, and this is the possibility I had not taken seriously, the filter was doing nothing. Maybe augmentation helped in the second experiment for a much duller reason: a starved classifier has room to grow, and a well-fed one does not. From those two experiments alone, both explanations fit the numbers perfectly. The data could not tell them apart.</p>

<p>There is only one cure. Change one thing at a time.</p>

<h2 id="changing-one-thing-at-a-time">Changing one thing at a time</h2>

<p>So I built the full grid and ran every combination.</p>

<p>There are three knobs. How much real data: twenty-five or fifty images per class. Whether I filter the synthetic data or not. And how much synthetic data I mix in: one fake per real image, or four. Every combination, plus a plain baseline at each data level with no synthetic data at all.</p>

<p>One rule makes the whole thing honest. The filtered and unfiltered fakes were drawn from the <em>same</em> pool of generated images. The only difference was whether I kept the critic’s top picks or a random handful. Same generator, same pool, one knob moving at a time.</p>

<p>This is a factorial design, and breaking confounds apart is exactly what it is for. Now I could ask two clean questions instead of one muddled one. Hold the filter fixed and vary the data: what does scarcity do? Hold the data fixed and vary the filter: what does the filter do?</p>

<p>Here is the full ledger.</p>

<table>
  <thead>
    <tr>
      <th>Real images / class</th>
      <th>Condition</th>
      <th>Accuracy (%)</th>
      <th>Change vs. baseline (%)</th>
      <th>p-value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>25</strong></td>
      <td>Baseline (no synthetic)</td>
      <td>71.77</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td> </td>
      <td>Unfiltered, 1:1</td>
      <td>77.52</td>
      <td>+5.75</td>
      <td>0.005</td>
    </tr>
    <tr>
      <td> </td>
      <td>Unfiltered, 4:1</td>
      <td>78.79</td>
      <td>+7.01</td>
      <td>0.003</td>
    </tr>
    <tr>
      <td> </td>
      <td>Filtered, 1:1</td>
      <td>77.88</td>
      <td>+6.11</td>
      <td>0.007</td>
    </tr>
    <tr>
      <td> </td>
      <td>Filtered, 4:1</td>
      <td>79.01</td>
      <td>+7.24</td>
      <td>0.004</td>
    </tr>
    <tr>
      <td><strong>50</strong></td>
      <td>Baseline (no synthetic)</td>
      <td>88.49</td>
      <td>—</td>
      <td>—</td>
    </tr>
    <tr>
      <td> </td>
      <td>Unfiltered, 1:1</td>
      <td>87.05</td>
      <td>−1.44</td>
      <td>0.049</td>
    </tr>
    <tr>
      <td> </td>
      <td>Unfiltered, 4:1</td>
      <td>86.35</td>
      <td>−2.15</td>
      <td>0.009</td>
    </tr>
    <tr>
      <td> </td>
      <td>Filtered, 1:1</td>
      <td>87.60</td>
      <td>−0.89</td>
      <td>0.222</td>
    </tr>
    <tr>
      <td> </td>
      <td>Filtered, 4:1</td>
      <td>86.48</td>
      <td>−2.01</td>
      <td>0.001</td>
    </tr>
  </tbody>
</table>

<p>The controlled re-run reproduced both baselines exactly, 88.5% and 71.8%, which is a small but reassuring sign that the whole thing is repeatable.</p>

<p>Read the ledger in two blocks. The bottom block is the well-fed classifier, fifty images per class. Its baseline is already high. Look what synthetic data does to it: every single row is negative. Filtered or not, one-to-one or four-to-one, the fakes drag it down. A classifier with enough real data does not want your inventions.</p>

<p>The top block is the starved classifier, twenty-five per class. Its baseline accuracy is a shaky 71.8%. Here every row is positive. Synthetic data lifts it by six or seven points, and the more you add, the higher it climbs. The best case buys back 7.2 of the 16.7 points I had lost by halving the data. That is about 43% of the damage undone, without having to add any new real data!</p>

<p>So the first question answers itself, and loudly. Whether augmentation helps or hurts is decided almost entirely by one thing: are you data-starved or not? Starved, it helps. Comfortable, it hurts.</p>

<h2 id="the-filter-i-was-sure-about">The filter I was sure about</h2>

<p>Now the second question. The one I had built the whole project around. Did the filter help?</p>

<p>To ask it cleanly, you compare filtered against unfiltered at the <em>same</em> data level and the <em>same</em> ratio. That holds everything else still and lets the filter be the only thing moving. There are four such comparisons. Here is what the filter bought in each:</p>

<ul>
  <li>Fifty per class, one-to-one: filtered won by 0.55 points.</li>
  <li>Fifty per class, four-to-one: by 0.13 points.</li>
  <li>Twenty-five per class, one-to-one: by 0.36 points.</li>
  <li>Twenty-five per class, four-to-one: by 0.23 points.</li>
</ul>

<p>The largest gap is barely half a point. Set that against the six- and seven-point swings from scarcity, and it is a rounding error.</p>

<p>So I went back to the images, expecting to find the explanation there. I expected the filtered set to look clearly better than the random one. It does not. The critic’s top-scored characters are only a little cleaner than a random handful from the same pool. And neither set looks good. Both are ratty: broken strokes, smears, characters that wobble between two shapes. The generator simply is not strong enough, and skimming off its best few percent does not rescue much. The “good” pile and the “whatever” pile look almost the same, and both look rough.</p>

<figure>
<img src="/assets/images/filtered-vs-unfiltered.png" alt="Filtered (top row) versus unfiltered (bottom row) synthetic characters from the same pool; the two rows look similar and rough" />
<figcaption>The same raw pool, sorted two ways, for five of the fifteen classes. The top row keeps the critic's highest-scored picks; the bottom row is a random draw. For a filter meant to separate good from bad, the two rows are remarkably hard to tell apart.</figcaption>
</figure>

<p>Once you see that, the weak result stops being mysterious. A filter that barely changes the images can only barely change the outcome. I was not watching high-quality data fail to help. I was watching mediocre data get swapped for slightly-less-mediocre data, and getting, predictably, a barely different result.</p>

<p>One number in that list still deserves a second look, because it hides a trap of its own. The first comparison, fifty per class at one-to-one, came in at p = 0.034. By the usual convention, anything under 0.05 counts as a real effect. On its face, this says the filter <em>did</em> help there. Should I believe it?</p>

<p>No, and the reason is worth understanding. I did not run one comparison. I ran four, and then picked out the smallest p-value. Giving yourself four tries and reporting the best one is like buying four lottery tickets and being amazed that one of them won. With enough tickets, something always hits, even when every ticket is a long shot. So when you make several comparisons, you have to raise the bar for what counts as real.</p>

<p>The Holm–Bonferroni correction is one standard way to raise it. The idea is plain. Divide your threshold by the number of comparisons. Four comparisons turn the old 0.05 bar into about 0.0125. My p = 0.034 clears the old bar easily and slams into the new one. It does not survive. Once you account for the fact that I went fishing through four results, that lone “significant” effect dissolves.</p>

<p>So here is the honest verdict, and it is sharper than “the filter does nothing.”</p>

<p>Per condition, the filter is invisible. None of the four filtered-versus-unfiltered comparisons is significant on its own, and the one that looked significant does not survive the correction above.</p>

<p>But the per-condition view throws one thing away: direction. All four comparisons favor filtering. So does every one of the five seeds, once you average its filter effect across the four conditions. That much agreement is itself evidence. Pool the four comparisons properly and the consistency becomes a number: a small but real positive effect of about a third of a percentage point, every seed pointing the same way, sitting right on the edge of statistical significance.</p>

<p>So the filter is not doing nothing. It is doing almost nothing: a genuine, repeatable nudge, far too small to matter beside the six- or seven-point swing from scarcity. And the reason it is so small is the one the pictures already gave away. Stage one barely improved the images, so it could only barely improve the result. And there is something satisfying in that. The effect came out exactly as small as its cause.</p>

<p>The pooled analysis, including what “properly” means and the borderline p-value it lands on, is in the appendix below.</p>

<h2 id="where-the-hope-is">Where the hope is</h2>

<p>This is exactly why I keep saying <em>stage one</em>.</p>

<p>The filter I designed has two stages, and I only built the first. Stage one is the critic: rank the fakes by how real they look, keep the top ones. We just saw what it does, which is to say, not much, because its top picks are not much better than the rest.</p>

<p>Stage two is a different kind of check, and a stricter one. Instead of trusting the critic’s single realism score, it compares each synthetic character directly against real characters of the same class. It uses two perceptual measures. The first, the Structural Similarity Index Measure (SSIM), asks how closely the structure of two images matches. The second, Learned Perceptual Image Patch Similarity (LPIPS), asks how close they sit in the feature space of a network trained to see the way people do. It is built to throw out the images that pass the critic but are subtly the wrong shape. Plausible at a glance, wrong on inspection.</p>

<p>And that is the hopeful part. Stage one helped only negligibly, because it barely sorted good from bad. A filter that <em>actually</em> separates them, strict enough to leave you with characters that look genuinely right rather than just least-wrong, is a different experiment, and an untested one. The door on quality filtering is not closed. I just have not yet built the filter that would knock on it properly.</p>

<h2 id="what-it-means">What it means</h2>

<p>Two things come out of this.</p>

<p>The practical one is for anyone working with scarce data, including the real prize behind all of this: fields like medical imaging, where every label is hard-won. Augmentation is not a free upgrade you sprinkle on everything. It depends entirely on your regime. If you are truly starved, augment. And for now, do not bother filtering, because the simple filter does not earn its keep. If you already have enough data, synthetic data will quietly cost you. The first question is not <em>how should I curate my fakes</em>. It is <em>am I actually starved</em>, and only if you are, augment.</p>

<p>The deeper lesson is the one I almost missed. My first two experiments told a clean, satisfying, completely wrong story. They did it precisely because they changed two things at once. That is worth sitting with. A confounded experiment does not just fail to find the answer. It can hand you a convincing wrong one: a mechanism that looks real and is not. The fix was not clever. It was the dullest discipline there is. Change one thing at a time. Old advice, easy to skip, and easiest of all to skip when the wrong answer is the one you were hoping to find.</p>

<h2 id="under-the-hood">Under the hood</h2>

<p>For the reader who wants the specifics:</p>

<ul>
  <li><strong>Generator.</strong> A conditional Wasserstein GAN with gradient penalty, trained for 500 epochs. The Wasserstein-with-gradient-penalty recipe is what kept training stable.</li>
  <li><strong>The filter, stage one, the part actually tested.</strong> For each character, one pool of 800 generated images, each scored once by the trained critic. <em>Filtered</em> keeps the top-scored images, between roughly the top 3% and the top 25% of the pool depending on how many were needed. <em>Unfiltered</em> takes a random draw from that same pool. Stage two, the perceptual screen, was never built.</li>
  <li><strong>Classifier.</strong> A plain three-block convolutional network, held identical across every condition, so that any difference traces to the data and not the model.</li>
  <li><strong>The runs.</strong> Five random seeds per condition, so every number above is a mean with a real spread behind it. Because all conditions share the same five seeds, each seed forms a matched pair, which lets a paired <em>t</em>-test cancel the run-to-run noise and compare treatments directly.</li>
  <li><strong>Significance.</strong> Paired <em>t</em>-test, threshold 0.05.</li>
</ul>

<h2 id="appendix-did-filtering-help-really">Appendix: did filtering help, really?</h2>

<p>The four filtered-versus-unfiltered comparisons in the body were all positive but, one at a time, unconvincing. Does their shared direction add up to something real once pooled? Here is the check.</p>

<p>The inputs are the per-seed accuracies behind every cell of the ledger: ten conditions, five seeds each, fifty classifier trainings in all. The four isolation comparisons (filtered − unfiltered, paired by seed, at fixed real-count and ratio) are:</p>

<table>
  <thead>
    <tr>
      <th>Comparison</th>
      <th>Filter effect (pp)</th>
      <th>Paired-<em>t</em> p</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>50/class, 1:1</td>
      <td>+0.55</td>
      <td>0.034</td>
    </tr>
    <tr>
      <td>50/class, 4:1</td>
      <td>+0.13</td>
      <td>0.78</td>
    </tr>
    <tr>
      <td>25/class, 1:1</td>
      <td>+0.36</td>
      <td>0.55</td>
    </tr>
    <tr>
      <td>25/class, 4:1</td>
      <td>+0.23</td>
      <td>0.64</td>
    </tr>
  </tbody>
</table>

<p>All positive; none significant after correcting for the four looks. To pool them, I averaged each seed’s filter effect across the four conditions. Because the same five seeds run through every condition, that yields five <em>independent</em> per-seed estimates of the average effect: +0.27, +0.18, +0.37, +0.67, +0.10 pp — every one positive.</p>

<p>A one-sample <em>t</em>-test on those five (the appropriate test, since the seed is the independent unit) gives a mean of <strong>+0.32 pp</strong>, 95% CI <strong>[+0.04, +0.59]</strong>, <em>t</em> = 3.21 (df = 4), <strong>p = 0.033</strong>. Two distribution-free checks — a Wilcoxon signed-rank test and an exact sign-flip permutation — both give <strong>p = 0.063</strong>. The effect sits right on the 0.05 line: significant parametrically, just shy of it without.</p>

<p>One shortcut is wrong: treating all twenty paired differences as independent gives p = 0.13 and finds nothing. The twenty are not independent (five share each seed), and that false inflation buries the signal; averaging within each seed first respects the structure and recovers it.</p>

<p>Two caveats keep it modest. The effect, about a third of a percentage point, is negligible beside the six- to seven-point scarcity swing. And the seeds randomize only classifier training, not the generator or the synthetic pool, both fixed here — so this is a tiny consistent edge <em>for this generator and this pool</em>, not a claim that it generalizes. The verdict: a small, consistent, borderline-significant positive effect, real and repeatable, and still too small to act on.</p>

<h2 id="the-work">The work</h2>

<ul>
  <li><a href="/assets/papers/Catanzarite_ResearchProject_FinalPaper_v16.pdf">Research Paper (PDF)</a></li>
  <li><a href="/assets/notebooks/Catanzarite_ResearchProject_rev13_FINAL.html">Jupyter Notebook</a></li>
  <li><a href="/assets/notebooks/filter_pooled_analysis.html">Pooled-analysis notebook</a></li>
  <li><strong>Course:</strong> EN.705.603 Introduction to Generative AI, Johns Hopkins University, Spring 2026</li>
</ul>]]></content><author><name>Joseph Catanzarite</name><email>jcat@alumni.caltech.edu</email></author><category term="generative-ai" /><category term="research" /><category term="WGAN" /><category term="data-augmentation" /><category term="CNN" /><category term="Chinese-MNIST" /><summary type="html"><![CDATA[I trained a GAN on a tiny set of Chinese characters and used it to invent more training data. When data was scarce, it worked: it bought back almost half the accuracy I had lost, for free. But the experiment fooled me first. The quality filter I was sure mattered turned out to do almost nothing. Here is how a confounded experiment manufactured a result that was not real, and what a cleaner one showed instead.]]></summary></entry><entry><title type="html">The Language of Agents Is Modal and Epistemic Logic</title><link href="https://jcatanza.github.io/math/logic/education/ai/2026/06/12/modal-epistemic-logic-tutorial.html" rel="alternate" type="text/html" title="The Language of Agents Is Modal and Epistemic Logic" /><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://jcatanza.github.io/math/logic/education/ai/2026/06/12/modal-epistemic-logic-tutorial</id><content type="html" xml:base="https://jcatanza.github.io/math/logic/education/ai/2026/06/12/modal-epistemic-logic-tutorial.html"><![CDATA[<p>Modal and epistemic logic are the formal languages agents use to reason about what is <em>possible</em>, what is <em>necessary</em>, and what is <em>known</em>. If you want to understand how AI agents represent knowledge, uncertainty, and belief — this is the mathematical foundation.</p>

<p>This interactive tutorial covers:</p>

<ul>
  <li><strong>Propositional modal logic</strong> — the operators □ (necessarily) and ◇ (possibly), Kripke semantics, accessibility relations</li>
  <li><strong>Epistemic logic</strong> — formalizing knowledge (K) and belief (B) for single and multiple agents</li>
  <li><strong>Common knowledge</strong> — what it means for a fact to be known by all agents, and why it matters for coordination</li>
  <li><strong>Applications to AI agents</strong> — how these formalisms underlie planning, multi-agent systems, and reasoning under uncertainty</li>
</ul>

<p>The tutorial is a self-contained interactive HTML file that runs entirely in your browser — no install required.</p>

<p><a href="https://jcatanza.github.io/The-Language-of-Agents-Is-Modal-and-Epistemic-Logic/modal_epistemic_logic_tutorial_rev58_no_Q1.html"><strong>Open the tutorial →</strong></a></p>

<p>Source and earlier revisions on <a href="https://github.com/jcatanza/The-Language-of-Agents-Is-Modal-and-Epistemic-Logic">GitHub</a>.</p>]]></content><author><name>Joseph Catanzarite</name><email>jcat@alumni.caltech.edu</email></author><category term="math" /><category term="logic" /><category term="education" /><category term="ai" /><category term="modal logic" /><category term="epistemic logic" /><category term="AI agents" /><category term="tutorial" /><category term="interactive" /><summary type="html"><![CDATA[An interactive tutorial on modal and epistemic logic — the formal language underlying reasoning, knowledge, and belief in AI agents.]]></summary></entry></feed>