# The Limits of Pure Language Models

### The Tokenization Problem

The fundamental issue with using LLMs for market analysis begins at the tokenization level. When processing numerical data, LLMs break numbers into tokens based on their characters rather than their mathematical significance. Consider a simple price sequence:

$$
P = {19857.32, 19857.33, 19857.34}
$$

To an LLM, this might be tokenized as:

```
['19', '857', '.', '32'], ['19', '857', '.', '33'], ['19', '857', '.', '34']
```

This tokenization destroys the numerical relationships that are crucial for market analysis. The model has no inherent understanding that these represent a monotonically increasing sequence with constant differences. Instead, it must try to reconstruct this understanding through pattern matching across tokens.

### Computational Inefficiency

The attention mechanism in transformer-based LLMs, while powerful for natural language, becomes computationally inefficient for numerical analysis:

$$
\text{Complexity} = O(n^2 d)
$$

Where n is the sequence length and d is the embedding dimension. For high-frequency market data, this quadratic complexity becomes prohibitive. A single day of minute-level data for multiple market indicators can easily exceed practical processing limits.

### The Hidden State Problem

LLMs lack explicit state management for tracking market conditions. Their understanding of state must be encoded in the attention patterns:

$$
\text{Attention}(Q\_t, K\_{1:t}, V\_{1:t})ht​
$$

This makes it difficult to maintain consistent tracking of:

* Position sizes
* Portfolio values
* Running statistics
* Risk metrics

### Temporal Understanding Limitations

Market data has explicit temporal structure that LLMs struggle to capture:

$$
\text{Auto-correlation}: R(\tau) = \mathbb{E}\[(X\_t - \mu)(X\_{t+\tau} - \mu)] \\
\text{Volatility clustering}: \sigma\_t^2 = \alpha\_0 + \alpha\_1 r\_{t-1}^2 + \beta\_1 \sigma\_{t-1}^2
$$

These temporal dependencies require specialized architectures that can:

1. Maintain explicit time awareness
2. Process multiple timeframes simultaneously
3. Capture regime changes
4. Model temporal dependencies directly

### Context and Causality

LLMs process market data as a sequence of tokens without understanding causality:

$$
p(x\_t|x\_{1:t-1}) \neq p(x\_t|\text{Relevant}(x\_{1:t-1}))
$$

This leads to:

* Spurious correlations
* Inability to distinguish cause from effect
* Poor handling of regime changes
* Limited understanding of market microstructure

### Real-world Impact

These limitations manifest in practical trading scenarios:

1. **Delayed Reactions**: The processing overhead leads to missed opportunities
2. **Inconsistent Analysis**: The same market condition can yield different interpretations
3. **Poor Risk Management**: Inability to maintain consistent risk metrics
4. **Resource Inefficiency**: High computational cost for basic market analysis

The solution isn't to abandon LLMs entirely, but to recognize their appropriate role within a broader market analysis framework. They excel at:

* Processing market news
* Sentiment analysis
* Strategy description
* Explaining complex market events

But they should not be the primary engine for:

* Price prediction
* Risk calculation
* Portfolio optimization
* Trade execution


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.replicats.ai/old-technical-foundations/beyond-llms/the-limits-of-pure-language-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.