9  Advanced Methods & AI

9.1 Markov Chains and Hidden Markov Models

This presentation is prepared by Reid Davis.

9.1.1 What is a Markov Chain?

  • A Markov Chain is a mathematical system (stochastic model) that models a sequence of events where the probability of transitioning to the next state depends solely on the current state, not on previous history
  • Utilizes the memoryless property, meaning only the present state matters for predicting the future

9.1.2 Components of a Markov Chain

  • States: the possible situations/nodes the chain can be in at any given time
  • Can be represented by the state space (set of all possible states)
  • Transition Probabilities: measure the probabiliy of a stochastic system moving from one state to another within a specific timeframe or single step (represented by arrows)
  • Written mathematically as \[ p_{ij} = \Pr(X_{n+1} = j \mid X_n = i)\]

9.1.3 Example of Weather Markov Chain

Weather Markov Chain

9.1.4 A Transition Matrix

  • Definition: A square matrix that describes the probabilities of transitioning from one state to another
  • \(p_{ij}\) represents the probability of moving from state i to state j
  • All rows must sum to 1
  • We can create our transition matrix using a numpy array
import numpy as np

# Create a 3x3 matrix
my_matrix = np.array([[0.7, 0.2, 0.1],[0.1, 0.6, 0.3],[0.2, 0.5, 0.3]])
print(my_matrix)
[[0.7 0.2 0.1]
 [0.1 0.6 0.3]
 [0.2 0.5 0.3]]

9.1.5 Stationary Distributions

A stationary distribution of a Markov chain is a probability vector \[ \pi = (\pi_1, \pi_2, \dots, \pi_n) \] that satisfies

\[ \pi P = \pi \]

where P is the transition matrix. The values of \(\pi\) represent the long run probabilities of being in a specific state.

It also satisfies the normalization condition:

\[ \sum_{i=1}^{n} \pi_i = 1 \]

Stationary distributions are important because they describe the long-term behavior of a Markov chain regardless of the initial state.

9.1.6 Stationary Distribution Calculation

Consider the transition matrix


\[\begin{bmatrix} 0.7 & 0.2 & 0.1 \\ 0.1 & 0.6 & 0.3 \\ 0.2 & 0.5 & 0.3 \end{bmatrix}\]



9.1.6.0.1 Step 1: Transition Matrix

The transition matrix is:

\[ P = \begin{bmatrix} 0.7 & 0.1 & 0.2 \\ 0.2 & 0.6 & 0.5 \\ 0.1 & 0.3 & 0.3 \end{bmatrix} \]

Let the stationary distribution be:

\[ \pi = (\pi_1, \pi_2, \pi_3) \]

satisfying:

\[ \pi P = \pi, \quad \pi_1 + \pi_2 + \pi_3 = 1 \]


9.1.6.0.2 Step 2: System of Equations

\[ \begin{aligned} \pi_1 &= 0.7 \pi_1 + 0.1 \pi_2 + 0.2 \pi_3 \\ \pi_2 &= 0.2 \pi_1 + 0.6 \pi_2 + 0.5 \pi_3 \\ \pi_3 &= 0.1 \pi_1 + 0.3 \pi_2 + 0.3 \pi_3 \end{aligned} \]

Move all terms to the left:

\[ \begin{aligned} 0.3 \pi_1 - 0.1 \pi_2 - 0.2 \pi_3 &= 0 \\ -0.2 \pi_1 + 0.4 \pi_2 - 0.5 \pi_3 &= 0 \\ -0.1 \pi_1 - 0.3 \pi_2 + 0.7 \pi_3 &= 0 \end{aligned} \]


9.1.6.0.3 Step 3: Solve for \(\pi_1\)

From the first equation:

\[ 0.3 \pi_1 = 0.1 \pi_2 + 0.2 \pi_3 \implies \pi_1 = \frac{1}{3} \pi_2 + \frac{2}{3} \pi_3 \]


9.1.6.0.4 Step 4: Substitute into second equation

\[ -0.2\left(\frac{1}{3}\pi_2 + \frac{2}{3}\pi_3\right) + 0.4 \pi_2 - 0.5 \pi_3 = 0 \]

Simplify:

\[ 0.3333 \pi_2 - 0.6333 \pi_3 = 0 \implies \pi_2 \approx 1.9 \pi_3 \]

Then:

\[ \pi_1 \approx \frac{1}{3} (1.9 \pi_3) + \frac{2}{3} \pi_3 \approx 1.3 \pi_3 \]


9.1.6.0.5 Step 5: Apply normalization

\[ \pi_1 + \pi_2 + \pi_3 = 1 \implies 1.3 \pi_3 + 1.9 \pi_3 + \pi_3 = 4.2 \pi_3 = 1 \]

\[ \pi_3 \approx 0.2381, \quad \pi_2 \approx 0.4524, \quad \pi_1 \approx 0.3095 \]


9.1.6.0.6 Step 6: Final Stationary Distribution

\[ \pi = (\pi_1, \pi_2, \pi_3) \approx (0.3095, 0.4524, 0.2381) \]

9.1.7 Multistep Transition Probabilities in Markov Chains

Let P be the one-step transition matrix of a Markov chain.

The n-step transition probability from state i to state j is defined as

\[ p_{ij}^{(n)} = \Pr(X_n = j | X_0 = i). \]


9.1.8 Matrix Interpretation

The matrix of \(n\)-step transition probabilities is given by

\[ P^{(n)} = P^n \]

That is, the \(n\)-step transition matrix is simply the matrix \(P\) to the power of n ex. 2 step matrix is P times P Each row of \(P^n\) gives the distribution of the chain after \(n\) steps starting from a specific state.


9.1.9 Connection to Convergence

If the chain is irreducible and aperiodic (full loop), then

\[ \lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \pi \\ \vdots \\ \pi \end{bmatrix} \]

where \(\pi\) is the stationary distribution.

This means:

  • The rows of \(P^n\) converge to the stationary distribution.
  • The chain “forgets” its initial state.

9.1.10 Matrix Power Function

import numpy as np

def matrix_power(matrix, power):
    if power == 0:
        return np.identity(len(matrix))
    elif power == 1:
        return matrix
    else:
        return np.dot(matrix, matrix_power(matrix, power - 1))

matrix_power(my_matrix, 1)
array([[0.7, 0.2, 0.1],
       [0.1, 0.6, 0.3],
       [0.2, 0.5, 0.3]])
for i in range(1, 17, 3):
    print(
        f"\nStep Transition Matrix at the nth power {i}\n",
        matrix_power(my_matrix, i),
        "\n"
    )

Step Transition Matrix at the nth power 1
 [[0.7 0.2 0.1]
 [0.1 0.6 0.3]
 [0.2 0.5 0.3]] 


Step Transition Matrix at the nth power 4
 [[0.3798 0.407  0.2132]
 [0.2714 0.477  0.2516]
 [0.2906 0.4646 0.2448]] 


Step Transition Matrix at the nth power 7
 [[0.3221704 0.4442144 0.2336152]
 [0.3026632 0.4568112 0.2405256]
 [0.3061184 0.45458   0.2393016]] 


Step Transition Matrix at the nth power 10
 [[0.31179963 0.45091134 0.23728903]
 [0.3082892  0.4531782  0.2385326 ]
 [0.30891099 0.45277668 0.23831233]] 


Step Transition Matrix at the nth power 13
 [[0.30993336 0.45211649 0.23795016]
 [0.30930164 0.45252442 0.23817394]
 [0.30941353 0.45245217 0.2381343 ]] 


Step Transition Matrix at the nth power 16
 [[0.30959751 0.45233336 0.23806913]
 [0.30948383 0.45240677 0.2381094 ]
 [0.30950396 0.45239377 0.23810227]] 

9.1.11 Real World Examples

Using Markov Chains to Predict K% and BB%

  • States are defined as the count
  • This is a good example of the memoryless property, the count two pitches ago doesn’t matter, only the count now
  • Transition probabilities were created using indexed pitching stats for each pitcher (normalized zone%, zoneswing%, oswing%, fair/foul%, compared to league average)
  • Based off these probabilities, we can find the stationary distribution that gives us expected K% and BB%
Single Pitch Transition Matrix:

     0-0    0-1    0-2    1-0    1-1    1-2    2-0    2-1    2-2    3-0  \
0-0    0  0.546  0.000  0.344  0.000  0.000  0.000  0.000  0.000  0.000   
0-1    0  0.000  0.471  0.350  0.000  0.000  0.000  0.000  0.000  0.000   
0-2    0  0.000  0.207  0.000  0.395  0.000  0.000  0.000  0.000  0.000   
1-0    0  0.000  0.000  0.542  0.290  0.000  0.000  0.000  0.000  0.000   
1-1    0  0.000  0.000  0.000  0.509  0.283  0.000  0.000  0.000  0.000   
1-2    0  0.000  0.000  0.000  0.000  0.240  0.000  0.317  0.000  0.000   
2-0    0  0.000  0.000  0.000  0.000  0.000  0.564  0.260  0.000  0.000   
2-1    0  0.000  0.000  0.000  0.000  0.000  0.000  0.541  0.225  0.000   
2-2    0  0.000  0.000  0.000  0.000  0.000  0.000  0.283  0.000  0.231   
3-0    0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.664   
3-1    0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000   
3-2    0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000   
K      0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000   
BB     0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000   
IP     0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000   

       3-1    3-2      K     BB     IP  
0-0  0.000  0.000  0.000  0.000  0.110  
0-1  0.000  0.000  0.000  0.000  0.180  
0-2  0.000  0.000  0.000  0.221  0.177  
1-0  0.000  0.000  0.000  0.000  0.168  
1-1  0.000  0.000  0.000  0.000  0.208  
1-2  0.000  0.000  0.000  0.238  0.204  
2-0  0.000  0.000  0.000  0.000  0.175  
2-1  0.000  0.000  0.000  0.000  0.234  
2-2  0.000  0.000  0.246  0.000  0.241  
3-0  0.000  0.000  0.000  0.298  0.038  
3-1  0.567  0.000  0.000  0.203  0.229  
3-2  0.000  0.332  0.242  0.144  0.282  
K    0.000  0.000  1.000  0.000  0.000  
BB   0.000  0.000  0.000  1.000  0.000  
IP   0.000  0.000  0.000  0.000  1.000  

Long-Run Limit Matrix:

     0-0  0-1  0-2  1-0  1-1  1-2  2-0  2-1  2-2  3-0  3-1  3-2      K     BB  \
0-0    0    0    0    0    0    0    0    0    0    0    0    0  0.285  0.041   
0-1    0    0    0    0    0    0    0    0    0    0    0    0  0.369  0.023   
0-2    0    0    0    0    0    0    0    0    0    0    0    0  0.530  0.014   
1-0    0    0    0    0    0    0    0    0    0    0    0    0  0.243  0.082   
1-1    0    0    0    0    0    0    0    0    0    0    0    0  0.341  0.046   
1-2    0    0    0    0    0    0    0    0    0    0    0    0  0.505  0.029   
2-0    0    0    0    0    0    0    0    0    0    0    0    0  0.202  0.197   
2-1    0    0    0    0    0    0    0    0    0    0    0    0  0.295  0.111   
2-2    0    0    0    0    0    0    0    0    0    0    0    0  0.459  0.069   
3-0    0    0    0    0    0    0    0    0    0    0    0    0  0.136  0.515   
3-1    0    0    0    0    0    0    0    0    0    0    0    0  0.205  0.326   
3-2    0    0    0    0    0    0    0    0    0    0    0    0  0.362  0.216   
K      0    0    0    0    0    0    0    0    0    0    0    0  1.000  0.000   
BB     0    0    0    0    0    0    0    0    0    0    0    0  0.000  1.000   
IP     0    0    0    0    0    0    0    0    0    0    0    0  0.000  0.000   

        IP  
0-0  0.674  
0-1  0.608  
0-2  0.455  
1-0  0.675  
1-1  0.613  
1-2  0.466  
2-0  0.602  
2-1  0.594  
2-2  0.471  
3-0  0.349  
3-1  0.469  
3-2  0.422  
K    0.000  
BB   0.000  
IP   1.000  

9.1.12 Hidden Markov Models

9.1.12.1 What is a Hidden Markov Model?

  • A Hidden Markov Model (HMM) is a Markov model where the observations are dependent on a hidden (latent) Markov process
  • An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X
  • Since X cannot be observed directly, the goal is to learn about state of X by observing Y

9.1.12.2 Requirements/Assumptions of HMMs

  1. the outcome of \(Y\) at time \(t=t_{0}\) must be “influenced” exclusively by the outcome of \(X\) at \(t=t_{0}\)

  2. the outcomes of \(X\) and \(Y\) at \(t<t{0}\) must be conditionally independent of \(Y\) at \(t=t_{0}\) given \(X\) at time \(t=t_0\)

Hidden Markov Model

9.1.12.3 Weather Example cont.

Let us continue with another example involving the weather.

  • Suppose that we cannot go outside and cannot tell what the weather is
  • Our only indicator of possible weather is what our roommate is doing

9.1.12.4 Hidden States

Let the hidden states represent the true weather:

\[ X_t \in \text{Sunny}, \text{Rainy} \]

The weather evolves according to a Markov chain:

\[ P(X_{t+1} \mid X_t) \]

This means tomorrow’s weather depends only on today’s weather.


9.1.12.5 Observations

Suppose we do not directly observe the weather.

Instead, we observe what activity our roommate does:

\[ Y_t \in \{\text{walking}, \text{shopping}, \text{cleaning}\} \]

The observation depends only on the current hidden state:

\[ P(Y_t \mid X_t) \]

For example:

  • If it is Rainy, cleaning or shopping probability might be high.
  • If it is Sunny, walking probability might be high.

9.1.12.6 Matrices

  • Remember we have a transition matrix describing how weather changes
  • Now we need a matrix to describe how the weather effects the activity, which is called an emission matrix
  • It defines the probability of a specific observable output symbol being generated from a particular hidden state
  • For N states and M observations, this is an N by M matrix
  • Rows sum to 1

9.1.12.7 Forward Algorithm

The forward algorithm is used in Hidden Markov Models (HMMs) to compute the probability of an observed sequence. Instead of listing every possible hidden state sequence (which grows exponentially), it uses dynamic programming to build the answer step by step.

At each time step, the algorithm keeps track of the probability of being in each hidden state, given all observations seen so far. It updates these probabilities by combining

  • the previous step’s probabilities
  • the transition probabilities between state
  • the likelihood of the current observation

By moving forward through time and updating these values recursively, the algorithm efficiently computes the total probability of the full observation sequence.

# 0 = Rainy,  1 = Sunny

# 0 = walk, 1 = shop, 2 = clean

import numpy as np

# Initial probabilities
pi = np.array([0.6, 0.4])

# Transition matrix A
A = np.array([
    [0.7, 0.3],   # from Rainy
    [0.4, 0.6]    # from Sunny
])

# Emission matrix B
B = np.array([
    [0.1, 0.4, 0.5],   # Rainy emits walk, shop, clean
    [0.6, 0.3, 0.1]    # Sunny emits walk, shop, clean
])

# Set our sequence of observations
obs = np.array([0, 1, 2])  # walk, shop, clean


def forward_np(obs, pi, A, B):
    T = len(obs)
    N = len(pi)
    
    alpha = np.zeros((T, N)) # creates empty matrix
    
    # Initialization.  init state prob, multiplied by emission prob of first ob
    alpha[0] = pi * B[:, obs[0]]
    
    # Recursion. sums all ways to arrive in next weather state, multiplied by emission prob
    for t in range(1, T):
        alpha[t] = (alpha[t-1] @ A) * B[:, obs[t]]
    
    # Termination. sums together all ways to observe the observation seq
    prob = np.sum(alpha[T-1])
    
    return alpha, prob

alpha, prob = forward_np(obs, pi, A, B)

print("Forward matrix:\n", alpha)
print("Total probability:", prob)
Forward matrix:
 [[0.06     0.24    ]
 [0.0552   0.0486  ]
 [0.02904  0.004572]]
Total probability: 0.033611999999999996

9.1.12.8 Hidden Markov Model: Weather Example

9.1.12.8.1 Model Setup

Initial probabilities:

\[ P(R_0 = \text{Rainy}) = 0.6 \]

\[ P(R_0 = \text{Sunny}) = 0.4 \]

Transition matrix:

\[ A = \begin{bmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{bmatrix} \]

Emission matrix:

\[ B = \begin{bmatrix} 0.1 & 0.4 & 0.5 \\ 0.6 & 0.3 & 0.1 \end{bmatrix} \]

Observations:

\[ O = (\text{walk}, \text{shop}, \text{clean}) \]


9.1.12.9 Forward Algorithm

We compute:

\[ \alpha_t(j) = P(O_0, \dots, O_t, X_t = j) \]


9.1.12.9.1 Step 0 (Initialization)

\[ \alpha_0(j) = \pi_j B_{j,O_0} \]

For Rainy:

\[ \alpha_0(R) = 0.6 \cdot 0.1 = 0.06 \]

For Sunny:

\[ \alpha_0(S) = 0.4 \cdot 0.6 = 0.24 \]

So,

\[ \alpha_0 = \begin{bmatrix} 0.06 & 0.24 \end{bmatrix} \]


9.1.12.9.2 Step 1 (Observation = shop)

\[ \alpha_1(j) = \left( \sum_i \alpha_0(i) A_{ij} \right) B_{j,O_1} \]

9.1.12.9.3 Rainy

\[ \alpha_1(R) = (0.06 \cdot 0.7 + 0.24 \cdot 0.4)\cdot 0.4 \]

\[ = (0.042 + 0.096)\cdot 0.4 \]

\[ = 0.138 \cdot 0.4 \]

\[ = 0.0552 \]

9.1.12.9.4 Sunny

\[ \alpha_1(S) = (0.06 \cdot 0.3 + 0.24 \cdot 0.6)\cdot 0.3 \]

\[ = (0.018 + 0.144)\cdot 0.3 \]

\[ = 0.162 \cdot 0.3 \]

\[ = 0.0486 \]

So,

\[ \alpha_1 = \begin{bmatrix} 0.0552 & 0.0486 \end{bmatrix} \]


9.1.12.9.5 Step 2 (Observation = clean)
9.1.12.9.6 Rainy

\[ \alpha_2(R) = (0.0552 \cdot 0.7 + 0.0486 \cdot 0.4)\cdot 0.5 \]

\[ = (0.03864 + 0.01944)\cdot 0.5 \]

\[ = 0.05808 \cdot 0.5 \]

\[ = 0.02904 \]

9.1.12.9.7 Sunny

\[ \alpha_2(S) = (0.0552 \cdot 0.3 + 0.0486 \cdot 0.6)\cdot 0.1 \]

\[ = (0.01656 + 0.02916)\cdot 0.1 \]

\[ = 0.04572 \cdot 0.1 \]

\[ = 0.004572 \]

So,

\[ \alpha_2 = \begin{bmatrix} 0.02904 & 0.004572 \end{bmatrix} \]


9.1.12.9.8 Termination Step

\[ P(O) = \sum_j \alpha_2(j) \]

\[ = 0.02904 + 0.004572 \]

\[ = 0.033612 \]


9.1.12.10 Final Result

\[ P(\text{walk, shop, clean}) = 0.033612 \]

This matches the full sum over all 8 possible hidden state paths.


9.1.12.11 Viterbi Algorithm

The Viterbi Algorithm finds the most likely sequence of hidden states that could have generated a given observation sequence

How it works

  1. Initialization: Initialize a probability matrix with the initial state probabilities multiplied by the emission probabilities.
  2. Recursion: For each time step and state, calculate the highest probability of reaching that state from a previous state, keeping track of the best previous state using “backpointers”.
  3. Termination: Find the maximum probability among the final states.
  4. Path Backtracking: Follow the backpointers from the final, most likely state back to the beginning to reconstruct the optimal sequence.
def viterbi_np(obs, pi, A, B):
    T = len(obs)
    N = len(pi)
    
    delta = np.zeros((T, N))     # stores max prob of path ending in state j at time t
    psi = np.zeros((T, N), dtype=int)  # stores index of prev state that gave max prob for j at t

    
    # Initialization
    delta[0] = pi * B[:, obs[0]] # initial prob times emission prob
    
    # Recursion
    for t in range(1, T):  # loop over time steps for all states
        for j in range(N):
            probs = delta[t-1] * A[:, j] # best prob of reaching i at t-1 times prob i to j
            psi[t, j] = np.argmax(probs) # store which prev state i gives max prob
            delta[t, j] = np.max(probs) * B[j, obs[t]] # max trans prob times emiss prob for current obs at j
    
    # Termination
    best_last_state = np.argmax(delta[T-1]) # chooses final state w highest pron
    best_prob = delta[T-1, best_last_state] # prob of best overall path
    
    # Backtracking
    best_path = np.zeros(T, dtype=int) # creates array to store best state sequence
    best_path[T-1] = best_last_state # set final state
    
    for t in range(T-2, -1, -1):  # loop backwards
        best_path[t] = psi[t+1, best_path[t+1]] # follow backpointer to prev state that led to current best
    
    return best_path, best_prob # return most likely seq and probability
path, prob = viterbi_np(obs, pi, A, B)

print("Best path (state indices):", path)
print("Probability of best path:", prob)
Best path (state indices): [1 0 0]
Probability of best path: 0.01344

9.1.12.12 Other Algorithms

9.1.12.12.1 Forward-Backward Algorithm
  • The Forward-Backward algorithm computes the probability of each hidden state at every time step given the full observation sequence.
  • It uses a forward pass to calculate the probability of observations up to each time step and a backward pass to calculate the probability
  • of the remaining observations. Multiplying these forward and backward probabilities and normalizing gives the posterior probability of being in each state at each time.
9.1.12.12.2 Baum-Welch Algorithm
  • The Baum-Welch algorithm is an Expectation-Maximization (EM) method used to learn the HMM parameters from observation sequences when hidden states are unknown.
  • It alternates between computing expected counts of state visits and transitions (E-step) and updating the initial, transition, and emission probabilities to maximize likelihood (M-step).
  • Repeating these steps iteratively improves the model to better fit the observed data.

9.1.12.13 Applications of HMMS

  • Financial Time Series Data
    • Algorithmic trading strategies
    • Credit risk modeling
    • Portfolio optimization
    • Interest rate modeling
  • Bioinformatics
    • Gene finding
    • DNA sequencing
    • Protein family modeling

9.1.12.14 Advantages and Limitations

9.1.12.14.1 Markov Chains

Advantages

  • Simple and easy to understand; only requires transition probabilities between states
  • Can be analyzed mathematically with well-established tools
  • Accurately models systems where only the present, not the past history, matters, which simplifies complex calculations
  • Can be applied to diverse fields, including marketing attribution (customer paths), finance (stock trends), gaming, and weather modeling

Limitations

  • Assumes the current state fully determines the next state (memoryless), which may be unrealistic
  • Transition probabilities must be known or estimated accurately, which can be hard for large state spaces
  • They predict the next state but do not explain why it happened (poor explanatory power)
  • They assume probabilities remain constant, failing to adapt to dynamic systems

9.1.12.14.2 Hidden Markov Models

Advantages

  • Can model sequences where the true states are hidden and only observations are seen
  • Captures uncertainty by using probabilities for transitions and emissions
  • Good at handling missing or incomplete data (power to infer states/observation)
  • Algorithms like the Forward-Backward algorithm and Viterbi algorithm provide polynomial-time solutions for training and decoding, ensuring fast inference

Limitation

  • Assumes Markov property and conditional independence of observations given states, which may not always hold
  • Performance heavily relies on the quality of initial parameter estimates (e.g., transition probabilities)
  • Hidden states may not have clear, real-world meaning, making results harder to interpret
  • With too many hidden states, HMMs can fit the training data very well but generalize poorly.

9.1.12.15 Further Readings

9.2 LLM Agents

This presentation was prepared by Jon Trnka.

9.2.1 Introduction

Hello! My name is Jonathan Trnka and this is my presentation on LLM Agents!

9.2.2 Objective

  • Quickly explain what an LLM is
  • Explain what makes an LLM-Agent different
  • Compare them
  • Explain my process
  • Use statistical tools to explain my Data
  • Conclusion

9.2.2.1 What exactly is an LLM Agent?

  • To put it simply, an LLM Agent is the next step in evolution of LLMs.
  • What is an LLM aka Large Language Model?
    • LLM chatbots are reactive conversational systems that receive user prompts, process them using pretrained knowledge, and generate text responses. Typical use cases include open-domain dialogue, customer support, and question answering. Despite being very powerful in their own way, they are still limited by outdated information and hallucinated responses.
  • What’s a hallucinated response you say?
    • A hallucinated response in regards to LLMs is when it creates incorrect, or fabricates content that believes is correct when it is not. For example, you could ask for a 5 step plan and it gives you 4 believe it to be correct. It’s all based on the prompt given to it by the user.
  • What does the ‘agent’ part mean?
    • The ‘agent’ part corresponds with the autonomous part of an LLM-Agent. This allows the LLM-Agent to resolve tasks on its own, without further input from a user. It can access APIs to access certain cotent, anything on the web, and solve errors it runs into.
  • What is RAG?
    • RAG, which stands for Retrieval Augmented Generation, is one approach to the limitations of standard LLMs. This is done by drawing real-time data from external sources such as APIs.
    • Limitations such as outdated information, single prompt answers, no recovery plans and so on.

9.2.3 LLM-Agent vs. Non‑LLM-Agent AI comparisons + examples

9.2.3.1 Table side-by-side Comparison

Aspect LLM-Agent Non‑LLM-Agent
Task Handling Multi-step planning Single-shot answer
Tool Use Dynamic, context-driven None
Accuracy High on factual & numeric tasks Variable; prone to errors
Adaptability Adjusts plan mid-run Fixed output
Error Recovery Retries or switches tools No recovery
Transparency Shows steps & traces Opaque
Latency Slower Fast
Cost Higher Low
Best For Complex tasks Simple tasks

9.2.3.2 Types of LLMs and LLM-Agents

Category Model / System Description Agentic?
Standard LLM GPT‑4o / GPT‑4 Turbo General-purpose chat models; strong reasoning but no built‑in planning. No
Standard LLM Claude 3 Opus / Sonnet High‑quality reasoning and writing; non‑agentic unless wrapped in a system. No
Standard LLM LLaMA‑3.1 / LLaMA‑3.2 / LLaMA‑3.3 Open models used in research; non‑agentic by default. No
Standard LLM DeepSeek‑V3 Efficient, high‑performance model; not agentic without a tool layer. No
Standard LLM Mixtral 8x7B / 8x22B Sparse MoE models; strong performance but no autonomy. No
LLM-Agent System OpenAI o‑series with function calling Uses planning + tool calls + multi‑step execution when orchestrated. Yes
LLM-Agent System Google Gemini 2.0 + Tools Can plan, call APIs, and execute multi‑step tasks when tools are enabled. Yes
LLM-Agent System Groq LLaMA‑3.3 + my agent pipeline Becomes agentic through my code of planning loops, tool calls, and error recovery. Yes
LLM-Agent System LangChain Agents Framework that wraps any LLM with planning, tools, and autonomy. Yes
LLM-Agent System Microsoft Autogen Agents Multi‑agent orchestration layer enabling planning and tool use. Yes
LLM-Agent System ReAct‑style LLM Agents LLMs using reasoning + action loops to solve tasks step‑by‑step. Yes

9.2.4 Specific Aims

My goal is to evaluate how effective a LLM-Agent system is compared to a baseline LLM. To do this, I will construct two models: an LLM-Agent capable of multi‑step planning, tool use, API integration, and limited error recovery, and a baseline LLM that processes only one prompt at a time without tool use or autonomous decision‑making. I will benchmark both systems on the same set of tasks and use linear regression techniques to analyze differences in performance, error rates, and execution time.

9.2.5 Data

9.2.5.1 My data will be from 1 non-api and 3 different API sources

  • Tavily (web search base)
  • OpenWeather
  • Calculator
  • Groq (the brain)

9.2.5.2 Main research question

“How well can an LLM-agent perform multi‑step tasks,self-recovery, accesseing API keys, replanning actions, compared to a baseline LLM that does not?”

9.2.5.3 Sub-research questions

  • ’Does the complexity(# of tool calls required to complete a task) of the task predict total_latency?
    • Y = total_latency
    • X = task complexity (# of tool calls required to complete a task)
  • Does the number of tools(actual # of tool calls the agent executed) used predict total task latency?
    • Y = total_latency
    • X = task type
  • Does API difficulty predict total_latency?
    • Y = total_latency
    • X = API difficulty (numerical: calculator: 1, web-search:2, weather: 3)

9.2.6 Methods and Research Design

9.2.6.1 Experimental Framework

The study uses a between‑systems comparative design. Two AI systems are constructed: - LLM-Agent — a system capable of multi‑step planning, tool use, API calls, and limited error recovery. It can autonomously break down tasks, select tools, and retry operations when failures occur. - Baseline AI — a non‑agentic model restricted to single‑prompt, single‑response interactions. It cannot use tools, plan, or recover from errors. It behaves like a standard chat assistant. - Both systems are evaluated on identical tasks to isolate the effect of agentic capabilities.

9.2.6.2 Task Selection

Tasks are chosen to require reasoning, multi-step execution, or external information retrieval. Each task is designed so that: - The LLM-Agent system can leverage planning and tools. - The baseline system must attempt the same task without those capabilities. - Success, failure, and time-to-completion can be measured consistently.

9.2.6.3 Examples include

  • Multi-step problem solving (e.g., weather + calculation + summarization).
  • API-dependent tasks (e.g., retrieving structured data).
  • Error-prone tasks where retries matter (e.g., handling missing parameters).

9.2.6.4 Research Planning

  • Identify what my main research question would be and create my sub-research questions. All of which evaluate the performance of an LLM-Agent vs non-LLM-Agent (baseline)
  • Select and test my choosen variables.
    • task_complexity
    • api_difficulty
    • total_latency
    • tool_count
    • error count

9.2.6.5 Tool and Document Design

  • Chose tools that would be show multi-step agentic behavior. This would be weather retrieval, calculator functions, and web search.
  • Gather the required API keys and create a .ENV file. This would allow seamless access when executing code without hard-coding the credentials.
  • Create my ‘Agents_Multi_step’ python file. This holds all core functionality such as calling API keys, tool definitions, LLM-Agent and baseline LLM decision making, log functions, and more.
  • Create my ‘LLM_Agent_pipeline’ python file. This one holds code for running statiscal analysis, creating summary tables, the plots, the legends and more.

9.2.6.6 How it all comes together

  1. Benchmark Queries (36 total) │ ▼
  2. Run Agent + Baseline
    • Agent: multi-step, may call 0–many tools
    • Baseline: single LLM call │ ▼
  3. Write Raw Logs
    • agent_runs.jsonl (one row per tool call)
    • baseline_runs.jsonl (one row per query) │ ▼
  4. Load JSONL Logs inside the .qmd pipeline │ ▼
  5. Clean + Aggregate
    • collapse retries
    • compute total latency
    • extract final answers
    • normalize errors
    • count tool usage
    • produce 1 row per query │ ▼
  6. Export Clean Results to CSV summaries:
    • agentic_results.csv
    • baseline_results.csv │ ▼
  7. Quarto Loads CSVs for regressions, plots, comparisons, and legends │ ▼
  8. Generate Final PDF (plots, regressions, comparisons, summaries)

9.2.6.7 Before Diving in

  • 6 questions for each type of query listed:

    • Math
    • weather
    • Web search
    • Multi Tool
    • Failure-recovery
    • Ambiguous/complex
  • Each question will have its own linear regression plot, a comparison table, an error summary, recovery actions used summary, and 2 legends on the linear regression plot. Legends for errors, the other for all queried data.

  • HTTP 432 Error = Request Signature Authentication Fails.

  • HTTP 404 Error = Not Found.

  • Recovery actions set in place:

    • The LLM-Agent tries to “fix” the input and try again.
    • Next, it has the capability to switch tools.
    • Lastly, it will just use the baseline LLM. This option is used last.

9.2.6.8 Finally, Some Code

The original benchmark step wipes old .JSONL and .CSV files and reruns all 36 queries through the live Groq, OpenWeather, and Tavily APIs.

For this shared course repository, the render below is shown but not executed. The report uses the saved benchmark artifacts already included in the LLM-Agents/ folder so that it can be rendered without requiring API keys during normal course use.

import os

def reset_logs():
    """Delete all JSONL and CSV log files so the next experiment starts clean."""
    files_to_delete = [
        "agent_runs.jsonl",
        "baseline_runs.jsonl",
        "agent_runs.csv",
        "baseline_runs.csv"
    ]

    for f in files_to_delete:
        if os.path.exists(f):
            os.remove(f)
            print(f"Deleted: {f}")
        else:
            print(f"Not found (skipped): {f}")

# Start fresh
reset_logs()

from Agents_Multi_step import run_full_benchmark_and_save_csv

agentic_stats, baseline_stats = run_full_benchmark_and_save_csv()

Remark: To rerun this benchmark from scratch, set GROQ_API_KEY, OPENWEATHER_API_KEY, and TAVILY_API_KEY, then re-enable evaluation for the code chunk above.

9.2.6.9 Does the complexity (# of tool calls required to complete a task) predict total latency?

import sys
from pathlib import Path

topic_dir = Path("LLM-Agents")
if str(topic_dir.resolve()) not in sys.path:
    sys.path.insert(0, str(topic_dir.resolve()))

from LLM_Agent_pipeline import run_regression_comparison
run_regression_comparison(
    agent_file="LLM-Agents/agent_runs.jsonl",
    baseline_file="LLM-Agents/baseline_runs.jsonl",
    predictor="task_complexity",
    outcome="total_latency"
);

=== Comparison Table ===
              Metric  LLM-Agent  Baseline
0  Slope (predictor)     0.2258    0.0000
1          Intercept     0.2456    0.2828
2    p-value (slope)     0.0476       NaN
3          R-squared     0.1207   -0.0000
4      Adj R-squared     0.0924   -0.0000
5                  N    33.0000   36.0000

=== LLM-Agent Error Summary (grouped) ===
error_group
HTTP 432             51
Syntax error         23
HTTP 404             11
Invalid character     7
Invalid number        2
Name: count, dtype: int64

=== LLM-Agent Recovery Actions Used ===
recovery_action
replan         92
no_recovery     2
Name: count, dtype: int64

9.2.6.10 Does the number of tools (actual # of tool calls the agent executed) predict total task latency?

run_regression_comparison(
    agent_file="LLM-Agents/agent_runs.jsonl",
    baseline_file="LLM-Agents/baseline_runs.jsonl",
    predictor="tool_count",
    outcome="total_latency"
);

=== Comparison Table ===
              Metric  LLM-Agent  Baseline
0  Slope (predictor)     0.1229    0.0000
1          Intercept     0.1196    0.2828
2    p-value (slope)     0.0001       NaN
3          R-squared     0.3905   -0.0000
4      Adj R-squared     0.3708   -0.0000
5                  N    33.0000   36.0000

=== LLM-Agent Error Summary (grouped) ===
error_group
HTTP 432             51
Syntax error         23
HTTP 404             11
Invalid character     7
Invalid number        2
Name: count, dtype: int64

=== LLM-Agent Recovery Actions Used ===
recovery_action
replan         92
no_recovery     2
Name: count, dtype: int64

9.2.6.11 Does API difficulty (calculator: 1, web search: 2, weather: 3) predict total latency?

run_regression_comparison(
    agent_file="LLM-Agents/agent_runs.jsonl",
    baseline_file="LLM-Agents/baseline_runs.jsonl",
    predictor="api_difficulty",
    outcome="total_latency"
);

=== Comparison Table ===
              Metric  LLM-Agent  Baseline
0  Slope (predictor)     0.2344    0.0000
1          Intercept    -0.2216    0.2828
2    p-value (slope)     0.0004       NaN
3          R-squared     0.3405   -0.0000
4      Adj R-squared     0.3192   -0.0000
5                  N    33.0000   36.0000

=== LLM-Agent Error Summary (grouped) ===
error_group
HTTP 432             51
Syntax error         23
HTTP 404             11
Invalid character     7
Invalid number        2
Name: count, dtype: int64

=== LLM-Agent Recovery Actions Used ===
recovery_action
replan         92
no_recovery     2
Name: count, dtype: int64

9.2.7 Conclusion

The comparison between LLM-Agents and a baseline LLM is clear. This data shows how multi-step reasoning, tool use, recovery actions, and API use can change how effective a LLM can be. You can cleary see how the LLM-Agent is able to handle errors, self recover, access API keys and do multi-step reasoning and use all of that to predict total latency. I hope I showed just how effective an LLM-Agent can be. Which one would you choose?

9.2.8 For those wanting more

  • https://www.vellum.ai/llm-leaderboard?utm_source=bing&utm_medium=organic (compares top LLM models and LLM-Agents)

  • https://www.thinkstack.ai/blog/what-are-llm-agents/

9.2.9 Disclaimer

Remember to use any AI tool responsibily. AI is a wonderful tool but can be very easy to rely on it! Do not blindly trust the AI! Use those critical problem solving skills on hand!