Normalization Makes Activation Geometry Imaginable

High-dimensional activation spaces are hard to picture directly. Here we make Gemma 2 [2] layer-12 residual activations easier to think about by discarding magnitude and studying their directions. This is motivated by RMSNorm [1]: downstream submodules, including attention and MLP blocks, see residual-stream activations after normalization. RMSNorm is not identical to L2 normalization, because it also applies learned coordinate-wise weights, but in a fixed hidden dimension its scale is proportional to the L2 norm, as shown in Eq. 1.

\[ \operatorname{RMS}(x) = \sqrt{\frac{1}{d} \sum_i x_i^2} = \frac{\lVert x \rVert_2}{\sqrt{d}} \] \[ \frac{x}{\operatorname{RMS}(x)} = \sqrt{d}\,\frac{x}{\lVert x \rVert_2} \] (1)

We therefore L2-normalize every activation vector. After this transformation, all points lie on the unit hypersphere, so distances become angular: two activations are close when their cosine similarity is high.

To inspect this angular geometry, we build a graph with one node per activation and an edge whenever two normalized activations have cosine similarity at least 0.7. The resulting component size spectrum is shown in Figure 1. The largest connected component is huge and likely contains several substructures that would separate at a higher threshold. Starting from the second component, however, the components are small enough to inspect by reading their token contexts. Many of these components group activations with recognizable textual roles: Q/A formatting, license boilerplate, code syntax, legal citations, LaTeX fragments, and similar repeated structures.

Component size spectrum for cosine-threshold connected components
Figure 1. Component sizes reveal how the normalized activation graph breaks into one giant cluster plus many smaller components. Left: log-log component size spectrum for the cosine-threshold graph on 1M L2-normalized activations. Apart from the giant rank-1 component, the remaining component sizes fall off roughly like a power law. Right: zoom-in on the first 20 components after removing rank 1; these are the medium-sized components manually inspected below.

Setup

Cosine 0.7 Component Structure

At cosine threshold 0.7, the graph has 136,410 connected components. The largest component contains 826,000 activations; outside it, there are 19,916 non-singleton components and 116,494 singleton components.

Top 20 full-run component sizes: 826000, 276, 259, 210, 187, 185, 110, 107, 73, 67, 65, 60, 58, 53, 50, 47, 47, 46, 46, 44

Medium Components, Ranks 2-20

Each card shows a text context from the dataset. The highlighted span is the exact token whose layer-12 activation belongs to the component. Ranks 2-20 are large enough to show repeated structure, but small enough to inspect manually by reading representative contexts. The labels in the card headers are provisional guesses, not final component names.

Rank 2 276 points Q/A prompt separator colon
Unique docs
276
Top token texts
: 276

Representative Contexts

doc 13 · pos 2

<bos>Q: ¿Porqué en este loop de

doc 33 · pos 2

<bos>Q: TextView Not centered in app but centered

doc 37 · pos 2

<bos>Q: Python Segmentation Fault? First off

doc 61 · pos 2

<bos>Q: StAX and arraylist java

doc 75 · pos 2

<bos>Q: A japanese saying "一をいう

doc 78 · pos 2

<bos>Q: Doctrine2 entity default value for Many

Rank 3 259 points newline after Q/A prompt header
Unique docs
259
Top token texts
259

Representative Contexts

doc 61 · pos 3

<bos>Q: StAX and arraylist java I

doc 80 · pos 3

<bos>Q: Not populating tableview with structure array

doc 149 · pos 3

<bos>Q: Issue with jquery remove method on IE7

doc 233 · pos 3

<bos>Q: Is a low number of members in a

doc 290 · pos 3

<bos>Q: Pass values to IN operator in a Work

doc 336 · pos 3

<bos>Q: Identify slow solr queries There are

Rank 4 210 points software license boilerplate terms
Unique docs
119
Top token texts
License 110 Public 40 General 37 license 21 licensing 1 License 1

Representative Contexts

doc 183 · pos 80

to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of

doc 205 · pos 39

HTML5 UP /// html5up.net | @ajlkn /// Free for personal and commercial use under the CCA 3.0 license (html5up.net/license

doc 294 · pos 184

useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. *

doc 294 · pos 185

, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * *

doc 294 · pos 203

FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program

doc 353 · pos 37

2010-2013 Amazon.com, Inc. or its affiliates. All Rights Reserved. * * Licensed under the Apache License, Version 2.0 (the

Rank 5 196 points JSON unicode-escape fragments
Unique docs
12
Top token texts
0 54 4 35 u 27 \ 21 3 18 e 11 a 5 2 5 8 4 9 4

Representative Contexts

doc 379 · pos 269

Dilolo" ], "DAY": [ "Lumingu", "Nkodya", "Nd\u00e0ay\u00

doc 379 · pos 286

ingu", "Nkodya", "Nd\u00e0ay\u00e0", "Ndang\u00f9",

doc 379 · pos 287

", "Nkodya", "Nd\u00e0ay\u00e0", "Ndang\u00f9", "

doc 379 · pos 288

"Nkodya", "Nd\u00e0ay\u00e0", "Ndang\u00f9", "Nj

doc 379 · pos 299

\u00e0ay\u00e0", "Ndang\u00f9", "Nj\u00f2wa", "

doc 379 · pos 310

e0", "Ndang\u00f9", "Nj\u00f2wa", "Ng\u00f2vya",

Rank 6 110 points single digit tokens in structured numeric contexts
Unique docs
56
Top token texts
0 22 2 16 1 15 4 11 6 10 5 8 9 8 3 7 7 6 6

Representative Contexts

doc 107 · pos 65

of Cannon Falls at the junction of State Highway 19 (MN 19) and County 7 Boulevard. It is within ZIP code 55089 based in Welch. Nearby

doc 195 · pos 117

. Taraboura features a closed arena where Olympiada Patras plays. It is located at 24 Tisonas Street with the postcode 26623. Its capacity is

doc 195 · pos 118

Taraboura features a closed arena where Olympiada Patras plays. It is located at 24 Tisonas Street with the postcode 26623. Its capacity is

doc 202 · pos 378

personal checks can also be mailed to Gabbard at: Tulsi Now, PO Box 75255, Kapolei, HI, 96707. Here’s

doc 202 · pos 379

checks can also be mailed to Gabbard at: Tulsi Now, PO Box 75255, Kapolei, HI, 96707. Here’s a

doc 294 · pos 244

; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307

Rank 7 107 points single digit tokens in legal citation contexts
Unique docs
60
Top token texts
7 18 1 15 6 14 9 11 5 10 3 10 8 9 2 9 0 6 4 5

Representative Contexts

doc 194 · pos 83

.D. 1991732. Supreme Court of Alabama. April 20, 2001. *648 Sherryl Snodgrass Ca

doc 194 · pos 1023

phoned Dr. Giddens, the obstetrician-gynecologist ("Ob/Gyn") on call for Jackson County Hospital that *650 night, to discuss the case. Dr

doc 322 · pos 206

. September 26, 1966. Rehearing denied October 24, 1966. *145 *146 Earl S. Hodges

doc 322 · pos 575

specific negligence theory; that there was error by the court in denying defendant's motion for mistrial because of prejudicial conduct of counsel; that conduct of *147 a juror was prejudicial to defendant;

doc 529 · pos 152

to appeal from an order filed on July 21, 1967, by Judge Robert I.H. Hammerman, sitting *268 in the Criminal Court of Baltimore, denying

doc 529 · pos 592

1 Md. App. 61. However, we note that the lower court found that there was nothing in the testimony of the applicant to indicate *269 that his arrest was illegal.

Rank 8 90 points license condition and disclaimer wording
Unique docs
30
Top token texts
following 21 the 19 this 12 of 11 , 9 and 8 list 5 nor 2 or 1 list 1

Representative Contexts

doc 1070 · pos 140

com/protocol-buffers/ // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: //

doc 1070 · pos 160

with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list

doc 1070 · pos 194

copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer

doc 1070 · pos 195

// notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer //

doc 1070 · pos 199

this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and

doc 1219 · pos 335

this software // is hereby granted without fee, provided that the above copyright // notice appears in all copies, and that both that copyright notice // and this permission notice appear in supporting documentation. None

Rank 9 67 points scientific markup superscript/subscript markers
Unique docs
36
Top token texts
~ 42 ^ 18 c 2 ε 2 n 2 ε 1

Representative Contexts

doc 127 · pos 804

10](#mrm27594-bib-0010){ref-type="ref"} Together with its ability to measure simultaneously T~1~ and T~2~, MR

doc 396 · pos 884

numbers and proportions. Baseline characteristics between groups were compared using Student's *t* test or the nonparametric Mann-Whitney test for continuous data and the χ^2^ test for categorical data. Receiver

doc 415 · pos 515

the researchers and resolved through consensus. Searches were then conducted to obtain specific polysaccharide product information: safety (using the search terms: toxicity, NOAEL, LD~50~), composition and structure,

doc 490 · pos 45

deck. The largest industrial application of olefin metathesis today is the synthesis of propylene from ethylene and butenes^[@ref1]^ employing WO~3~ on SiO~2~, a

doc 490 · pos 214

has to admit that there may not be a single answer for all supported oxide catalysts or all olefins. Copéret and Mashima employ Me~4~BTDP to reduce four-coordinate (

doc 490 · pos 231

all olefins. Copéret and Mashima employ Me~4~BTDP to reduce four-coordinate (SurfO)~2~WO~2~ sites on silica in the absence of ole

Rank 10 65 points legal citation ordinal suffix
Unique docs
56
Top token texts
d 65

Representative Contexts

doc 156 · pos 9

<bos> 58 Cal.App.3d 439 (197

doc 156 · pos 377

920, 925-926 [101 Cal. Rptr. 568, 496 P.2d 480].) In March

doc 322 · pos 9

<bos> 75 Ill. App.2d 144 (196

doc 1958 · pos 10

<bos> 718 S.E.2d 145 (201

doc 2388 · pos 25

<bos> 45 Md. App. 489 (1980) 413 A.2d 1365 CARLTON

doc 2475 · pos 10

<bos> 299 F.Supp.2d 166 (200

Rank 11 64 points patent section heading marker
Unique docs
59
Top token texts
2 34 . 30

Representative Contexts

doc 403 · pos 127

apparatuses such as an ink-jet printer, a facsimile machine, etc. to jet fluid through a nozzle, and a manufacturing process thereof. 2. Description of the Related Art A print

doc 612 · pos 88

in a mounting case, in which the electro-optical device is accommodated and a projection display apparatus including the electro-optical device encased in the mounting case. 2. Description of Related Art In general

doc 645 · pos 53

for a vehicle having a double clutch transmission (DCT), and more particularly, to a technology for improving a response to a speed change during a kickdown. 2. Description of Related Art Unlike an

doc 645 · pos 54

a vehicle having a double clutch transmission (DCT), and more particularly, to a technology for improving a response to a speed change during a kickdown. 2. Description of Related Art Unlike an automatic

doc 718 · pos 35

Field of the Invention The present invention relates to a manufacturing method of a semiconductor device, which forms semiconductor integrated circuit patterns by using charged particle beams. 2. Description of the Related Art A lith

doc 723 · pos 34

. Field of the Invention This invention relates in general to fuel cells and electrical motors and, more particularly, to a fuel cell powered electrical motor. 2. Description of the Related Prior Art The

Rank 12 58 points copyright notice starts
Unique docs
58
Top token texts
Copyright 48 Copyright 10

Representative Contexts

doc 262 · pos 2

<bos>// Copyright 2000-20

doc 374 · pos 5

<bos>package network // Copyright (c) Microsoft and contributors.

doc 435 · pos 8

<bos>/*####################################################### * Copyright (c) 2014

doc 828 · pos 29

<bos>/* * TupleTypeUtil.java * * This source file is part of the FoundationDB open source project * * Copyright 2015-20

doc 969 · pos 4

<bos>/* * Copyright (c) 2017

doc 1160 · pos 3

<bos>/* Copyright 2018 The Kubernetes Authors

Rank 13 53 points warranty disclaimer phrase
Unique docs
37
Top token texts
FITNESS 19 FOR 16 A 13 NESS 3 FIT 2

Representative Contexts

doc 969 · pos 126

* This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU

doc 1070 · pos 302

ERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIM

doc 1613 · pos 146

PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. */ #

doc 1613 · pos 147

"AS IS" AND WITHOUT ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED * WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. */ #include

doc 1704 · pos 157

* This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the

doc 1704 · pos 159

This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General

Rank 14 51 points social handles and mentions
Unique docs
42
Top token texts
(@ 25 @ 18 @ 5 / 3

Representative Contexts

doc 631 · pos 914

example TreeMap, SortedMap, or any class that implements the Map interface), you will always get a HashMap out of it. A: The answer by @David Wasser is right on in terms of

doc 702 · pos 883

interoperability.” Source: Facebook Is Taking On Zoom With a 50-Person Video Chat Feature Please follow my instagram: http://instagram.com/arminhamidian67 Facebook

doc 1701 · pos 362

of five times a Trump casino property filed for bankruptcy. Paul Milo may be reached at pmilo@njadvancemedia.com. Follow him on Twitter@PaulMilo2. Find NJ.com

doc 2167 · pos 57

, like this in Xcode: I am creating my own IDE and would like to know if there is there a view for this? A: As @TheNextman said, I need NS

doc 2199 · pos 880

report has performed a signal service: Revealing the PAFMM & its true nature is important to deterring its future use. — Andrew Erickson 艾立信 (@AndrewSErickson) August 16

doc 2404 · pos 465

it’s not patriotic… ok got it. 🙄 #morons https://t.co/wkxDRZs6bM — Donald Trump Jr. (@DonaldJTrumpJr) July 3

Rank 15 50 points patent field-of-invention heading
Unique docs
50
Top token texts
of 28 the 22

Representative Contexts

doc 95 · pos 4

<bos>1. Field of the Invention The present invention relates to

doc 153 · pos 5

<bos>1. Field of the Invention The present invention relates to multi

doc 409 · pos 5

<bos>1. Field of the Invention The invention relates generally to a

doc 1097 · pos 4

<bos>1. Field of the Invention The present invention relates to

doc 1575 · pos 4

<bos>1. Field of the Invention The present application relates to

doc 1624 · pos 4

<bos>1. Field of the Invention This invention relates in general

Rank 16 47 points copyright markers
Unique docs
47
Top token texts
) 45 © 2

Representative Contexts

doc 435 · pos 38

* Copyright (c) 2014 Jeff Martin * Copyright (c) 2015 Pedro Lafuente * Copyright (c) 2017-20

doc 447 · pos 63

GF www.4thEstate.co.uk This eBook first published in Great Britain by 4th Estate in 2019 Copyright © Tash Aw 2019

doc 643 · pos 29

<bos>/* * linux/include/asm-arm/proc-armv/processor.h * * Copyright (C) 1996-19

doc 1488 · pos 14

<bos>/* Mantis PCI bridge driver Copyright (C) Manu Abraham (abraham.manu@

doc 1514 · pos 15

<bos>/** * Durandal 2.0.1 Copyright (c) 2012 Blue Spire Consulting

doc 1583 · pos 9

<bos>/** * Copyright (c) Rich Hickey. All rights reserved.

Rank 17 46 points Go import-block indentation tabs
Unique docs
24
Top token texts
46

Representative Contexts

doc 258 · pos 25

<bos>package x509util import ( "crypto/rand" "crypto/rsa" "crypto/x509"

doc 374 · pos 186

generated by Microsoft (R) AutoRest Code Generator. // Changes may cause incorrect behavior and will be lost if the code is regenerated. import ( "context" "github.

doc 374 · pos 191

) AutoRest Code Generator. // Changes may cause incorrect behavior and will be lost if the code is regenerated. import ( "context" "github.com/Azure/go

doc 374 · pos 206

be lost if the code is regenerated. import ( "context" "github.com/Azure/go-autorest/autorest" "github.com/Azure/go

doc 781 · pos 75

command import ( "github.com/go-openapi/errors" "github.com/go-openapi/strfmt" "github.com/go-openapi

doc 2484 · pos 14

<bos>// +build !appengine package mail import ( "bytes" "encoding/

Rank 18 46 points LaTeX subscript/superscript syntax
Unique docs
23
Top token texts
_{ 22 _ 7 ^\ 6 _{\ 4 _\ 2 }^{\ 1 { 1 ^ 1 ^{ 1 }^\ 1

Representative Contexts

doc 859 · pos 938

, which is defined as $$\label{BorelIntegral} f(g) = \frac{1}{g^\lambda} \, \int_0^\infty {\rm d}u

doc 881 · pos 76

_i)dp_i $$ where $p_i$ is the probability of the $i^{th}$ state and where $ \sum_i p_i = 1 $

doc 935 · pos 255

assumptions upon the distribution of the environment, the existence of a new exponent $\nu\in (0, {1\over 2}]$ such that $\max_{0\le i \le n}

doc 1314 · pos 60

theory, and class of sub-guassian / sub-exponential random variables is of interest. In the literature it gave an inequality as following: $\sup_{p\geq 1} \frac

doc 1314 · pos 86

as following: $\sup_{p\geq 1} \frac{\|X^2\|_p}{p} \leq 2\sup_{p\geq 1} (\frac

doc 1314 · pos 335

sup_{p\geqslant 2} \left(\frac{\|X\|_p}{\sqrt{p}}\right)^2\leqslant 2\,\sup_{p\geqslant 1} \left

Rank 19 44 points Java package/import dot separators
Unique docs
29
Top token texts
. 44

Representative Contexts

doc 183 · pos 215

.apache.stanbol.entityhub.web.reader; import java.io.IOException; import java.io.InputStream; import java.lang.annotation.Annotation; import

doc 183 · pos 235

; import java.io.InputStream; import java.lang.annotation.Annotation; import java.lang.reflect.Type; import java.util.Arrays; import java.

doc 824 · pos 136

.os.Bundle; import android.view.View; import android.widget.DatePicker; import android.widget.EditText; import java.text.SimpleDateFormat; import java.

doc 824 · pos 160

.widget.EditText; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; import java.util.Locale; public class MainActivity

doc 828 · pos 281

; import com.google.protobuf.ProtocolMessageEnum; import javax.annotation.Nonnull; import javax.annotation.Nullable; import java.math.BigInteger; import java.

doc 1393 · pos 23

<bos>package io.quarkus.it.panache; import java.io.Serializable; import java.util.Objects; import javax.

Rank 20 40 points spam-like generated text
Unique docs
2
Top token texts
position 1 serious 1 success 1 certain 1 example 1 portion 1 activity 1 character 1 text 1 wear 1

Representative Contexts

doc 791 · pos 199

first pure. The circumstances of the marked degrees are, as one would expect, of the most former brain. The cheap nureflex online with prescription in the position of the aggra was thus such and the

doc 791 · pos 354

is thoroughly enfeebled. Take one tear trunks well as i go for nureflex prescription discounts of you. This may be followed by anatomist, by serious or latin courage, or by above case

doc 791 · pos 370

ureflex prescription discounts of you. This may be followed by anatomist, by serious or latin courage, or by above case of some psychical tumour like the success. The weeks of ohlshausen

doc 791 · pos 430

include first well its place, but personally its polished where can i buy nureflex over the counter in usa as proportioned to narcosis, the eruption of the certain bulk by fevers of the example,

doc 791 · pos 437

personally its polished where can i buy nureflex over the counter in usa as proportioned to narcosis, the eruption of the certain bulk by fevers of the example, the deafness's tion often related

doc 791 · pos 517

ureflex price philippines and keep my observations with me. This nature is the contrast passion of the other, and it free includes fairly a treatment of the pharmacies of portion and tissue. He demonstrated then abundantly that

References

  1. Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. arXiv:1910.07467.
  2. Gemma Team. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118.
  3. Leo Gao et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027.
  4. Neel Nanda. NeelNanda/pile-10k. Hugging Face dataset.