We're Inventing a Language for Computers

Cheney Li

We are essentially inventing a language for computers. Much like viewing a Monet painting, often we can only grasp the full picture of today's developments from a god-like perspective many years later. Recently, I delved a bit into CNNs and found that, like RNNs, they handle discrete language. However, one can also see many shadows of CNNs within transformers. The core feature of transformers is the attention mechanism, which allows the relationships between contexts in the entire text to be captured and computed, rather than processing one token at a time like RNNs. The model is not trained merely through text embeddings; instead, it utilizes attention to obtain the contextual relationships between each token and all preceding tokens. Interestingly, these contextual relationships are represented by three vectors: q, k, and v, rather than the intuitive two. The multiplication of k and q yields attention weights, which are then used to adjust the contribution of q in the output. The design seems to suggest that the meaning represented by each word in language consists of two parts: the information it inherently carries (v) and the information carried by its contextual relationships. Furthermore, these contextual relationships are determined by the pointing and pointed-at relationships. For example, consider the following two sentences: 1. Lian Po suffered a great defeat. 2. Lian Po suffered a great defeat against the Qin army. The meaning of "great defeat" depends on the Qin army, while "great defeat" itself provides the relationship that Lian Po and the Qin army are opposing sides in battle. The original intention behind the design of q, k, and v is to represent some structural aspect of how text conveys information, much like how CNNs abstract the features of an image through filters. In transformers, the q, k, and v in a single attention head are a structure artificially designed for text. They exist after the embedding process that converts text tokens into computable vectors and before the addition of non-linear relationships through feed-forward networks (FFNs). Of course, there are also multi-attention heads that capture features at different levels, which seem to correspond directly to the convolution-related steps in CNNs. There are three points worth discussing here: 1. We find that, like convolution, the attention step is entirely linear, essentially a fancier version of weighted averaging achieved through matrix multiplication. I suspect this is because the features designed are meant to be simple and clear, and most are linear. Humans cannot intuitively grasp non-linear relationships well; our intuition about numbers comes from counting and geometry. 2. The embedding step is unnecessary when CNNs process images. One could argue that the reason LLMs have such a significant impact is, in some sense, a historical gift. Language itself is already a highly dense representation of information, and when we train embeddings to capture the meaning of each word in a space using distance, we are building upon thousands of years of language, civilization, and knowledge evolution. For instance, even without writing any formulas, the phrase "calculus" already carries a lot of meaning regarding the underlying mathematical relationships. This vast amount of high-density information is the foundation of transfer learning and emerging capabilities. 3. Speaking of which, we haven't yet discussed neural networks. As mentioned in a previous article, neural networks are essentially a type of information juicer. However, from a theoretical perspective, I believe the manifold hypothesis holds water. For binary classification tasks, what we ultimately need to do is separate points in an n-dimensional space using an n-1 dimensional surface. In reality, most data tends to cluster on a certain surface in high-dimensional space due to the generation process; for example, handwritten strokes are continuous in space, and time series like stock prices are continuous over time. To cut as cleanly as possible, we hope to unfold the curved surface while reducing dimensions. If we break down neural networks into linear and non-linear transformations, the effect of the linear transformation is equivalent to PCA, which reduces dimensions by forming a plane from artificially combined vectors; the non-linear transformation part stretches and distorts the surface to flatten it as much as possible. Since the release of "Attention is All You Need," there seems to have been no revolutionary updates in the architecture of transformers. Instead, improvements in model performance have largely come from the powerful GPUs of mining tycoons, the capabilities of large companies, and the remarkable efforts of ML systems. However, as analyzed above, the essence of deep learning and AI still lies in representation, which is how we present information to computers. The language revolution thousands of years ago enabled humanity to acquire knowledge, society, and history, thus creating civilization; now, a new civilization requires us to use a computable language to represent all that can be expressed, compressing it through computation and finding ways to harness it through structure. Perhaps, like training embedding models, the final representation can be entirely fitted through extensive computational methods. However, at least for now, collecting the states of all atoms on Earth and stuffing them into a neural network does not yield all physical laws. We can even describe the simple harmonic motion of a pendulum using angular velocity and initial position, or Kepler's area conservation to describe planetary orbits. The discovery of physical laws itself is a human endeavor to find nearly linear relationships that can be represented with a minimal number of features from high-dimensional data. But at least for now, simply stuffing YouTube pixel data into a neural network does not seem to be a viable solution. From this perspective, does it seem that transformers still represent a path toward a Universal model? If not, what other methods might be needed? Could it be diffusion or energy-based approaches?