Introduction

论文来自2023 ICLR会议论文。看了三个月GNN相关内容，所看到的GCN的支撑不是从实例性能出发，就是从思想来源，由CNN、图信号等领域自然发展而来，却没有找到理论上的支撑，所以翻了翻理论方面的相关文献，找到了这篇。

Main body

理论推导过于繁琐，这里只作简要说明。我们知道GCN中有很多依托于注意力的版本，例如GAT，本文并不关注于所谓的注意力，只关注卷积算符的合理性;其次，GCN层一般只有2 $\sim$ 3层，由于过平滑，本文关注GC层的作用，则不考虑过平滑问题，所以只考虑2层，3层情况下的神经网络;最后，由于GCN在异配图下的表现往往差过同配图，本文关注GC层的作用，在更新过程引入 $ξ = s g n (p - q)$ 其中 $p$ 为同类连接的概率， $q$ 为异类连接的概率。

Contributions

本文理论层面使用合成数据集，记为XOR data,并研究了在该数据集上执行二分类任务的性能，数据集的节点特征从高斯混合中采样(这样选取的目的是使其非线性可分的)，接下来是主要成果：

首先证明了结合图信息的网络表现优于不是用图的方法。事实上，与不使用图的方法相比，单个GC层允许多层神经网络在更宽的范围对节点进行分类，表现在二分类中，GC的网络对两类特征均值之差的需要仅为不使用图的网络的 $\frac{1}{\sqrt[4]{n (p + q)}}$ 。进一步，可以证明，在较稠密的图中，这个值可以达到 $\frac{1}{\sqrt[4]{n}}$ 。
证明了对于多次网络来讲，配备多个GC算子是要比配备单个GC算子展现更好的性能，同时，配备相同GC算子的情况下，对其不同的排布产生的神经网络性能相似。
在实例上，使用了真是数据集和大规模数据集验证了本文提出的理论成果，显示了GC在网络多层和同(异)质下的各种组合的性能趋势。

Preliminaries

Data model

证明主要依托的数据集。现介绍其设置及符号含义：
图基础结构： $n, d$ 分别表示节点数量和特征维度，取正整。 $ϵ_{1}, \dots, ϵ_{n}$ 代表节点的分类，为服从伯努利分布的随机变量。 $C_{b} = {i \in [n] | ϵ_{i} = b}$ for $b \in {0, 1}$ 表示属于 $b$ 类的点集。
点特征： $X$ 代表节点特征，其中 $X_{i}$ 代表节点 $i$ 的特征。设置两个量， $μ, ν$ , $X_{i} \sim N (((2 η_{i} - 1) (1 - ϵ_{i}) μ + ϵ_{i} ν), σ^{2})$ 。事实上，可以看到， $μ, ν$ 在此充当了某类点特征集中的均值。记号 $X_{i} \sim O R - G M M (n, d, μ, ν, σ^{2})$ 。
图信息： $A$ 代表邻接矩阵，值得一提，这里的图是允许自环存在的，这也是由卷积算子的特点所决定的。使用 $D$ 表示度矩阵，记号 $d e g (i) = D_{i, i}$ ，使用 $N_{i}$ 表示点 $i$ 的邻居，定义同类点连边概率为 $p$ ，不同类点间连边概率为 $q$ 。记号 $(A, X) \sim X O R - C S B M (n, d, μ, ν, σ^{2}, p, q)$ 。

Network Architecture

本文关注的是使用RELU层的MLP：

\begin{aligned} H^{(0)} = X, \\ \begin{aligned} f^{(l)} (X) = (D^{- 1} A)^{k_{l}} H^{(l - 1)} W^{(l)} + b^{(l)} \\ H^{(l)} = Re LU (f^{(l)} (X)) \end{aligned}} for l \in [L], \\ \hat{y} = φ (f^{(L)} (X)) . \end{aligned}

这里， $φ (x) = s i g m o i d (x) = \frac{1}{1 + e^{- x}}$ 。网络最终输出 $\hat{y}$ 。注意 $D^{- 1} A$ 为归一化的邻接矩阵，事实上，在通常GCN中归一化使用 $D^{- \frac{1}{2}} A D^{- \frac{1}{2}}$ ，事实上后面的实验可以说明他们具有相似的表现。模型中 $k_{l}$ 控制了在第 $l$ 层中添加GC的数量，事实上，当该层不使用图信息则令 $k_{l} = 0$ 即可。
$(W^{(l)}, b^{(l)})$ 是可训练的参数，对于模型的训练使用交叉熵作为损失函数 $\begin{aligned} ℓ_{θ} (A, X) & = - \frac{1}{n} \sum_{i \in [n]} y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i}), \end{aligned}$ ，则

OPT (A, X) = min_{θ \in C} ℓ_{θ} (A, X),

这里， $C$ 表示参数可选范围，事实上，在这里我们限制 $| | W^{(1)} | |_{2} \leq R, | | W^{(> 1)} | | \leq 1$ 。理论部分可以看到对参数的限制是很有必要的，否则，若允许 $R$ 无限大，那么损失函数的值可以任意的逼近0，不具备研究意义。以后，使用 $ℓ_{θ} (X)$ 代替 $ℓ_{θ} (I_{n}, X)$ 。

Theory

这里为了保持严谨性，对于定理的描述使用原文描述即英文版本

Main results

Thm1

Theorem $1. L e t$ X $\in R^{n \times d} \sim X O R$ - $G M M (n, d, μ, ν, σ^{2})$ and defne $γ = ∥ μ - ν ∥_{2}$ to be the $d i s t a n c e$ between the means. Then we have the following:

Assume that $γ \leq K σ$ and let $h (x) : R^{d} \to {0, 1}$ be any binary classifter. Then for any $K > 0$ and any $ϵ \in (0, 1), a t$ least a fraction $2 Φ_{c} {(K /_{2})}^{2} - O (n^{- ϵ / 2}) o f a l l$ data points are $m i s c l a s s i f i e d b y h w i t h p r o b a b i l i t y a t l e a s t 1 - \exp (- 2 n^{1 - ϵ}) .$
For any $ϵ > 0, i f t h e$ distance between the means is $γ = Ω (σ (\log n)^{\frac{1}{2} + ϵ}), t h e n$ for any $c > 0$ , with probability at least $1 - O (n^{- c}), t h e r e$ exist a two-layer and a three-layer network that $p e r f e c t l y c l a s s i f y t h e d a t a, a n d o b t a i n a c r o s s - e n t r o p y l o s s g i v e n b y$

ℓ_{θ} (X) = C \exp (- \frac{R}{\sqrt{2}} γ (1 \pm \sqrt{c} / (\log n)^{ϵ})),

$w h e r e C \in [1 /_{2}, 1]$ is an absolute constant$

首先回忆一下定义，这将有利于我们理解定理含义， $γ = ∥ μ - ν ∥_{2}$ ， $X_{i} \sim N (((2 η_{i} - 1) (1 - ϵ_{i}) μ + ϵ_{i} ν), σ^{2})$ ，注意到，比较 $γ, σ$ ，我们说， $\frac{γ}{σ}$ 的值能够表现不同类别的特征重合程度。
定理一的第一部分是告诉我们，当其分离程度较小的情况下， $\frac{γ}{σ} \leq K$ ，在高概率 $1 - e x p - 2 n^{1 - ϵ}$ 下总会有小部分 $2 Φ_{c} {(K /_{2})}^{2} - O (n^{- ϵ / 2})$ 的点被分类错误。注意到，在这种情况下，当 $K \to \infty$ 时，可以看到此时错误分类点的比例将趋于0,事实上在第二部分给出了更加准确的界，而如果 $K \to 0$ ，及其节点的特征混杂在一起，这时候错误分类的比例将接近 $\frac{1}{2}$ ！注意这是二分类任务，此时分类器无任何作用！
定理的第二部分表明，当其不同类特征分散程度较大达到 $γ = Ω (σ (\log n)^{\frac{1}{2} + ϵ})$ 时，不使用图信息的网络才能够高概率 $1 - O (n^{- c})$ 情况下完美分类。

Thm2

Theorem $2. L e t (A, X) \sim X O R$ -CSBM $(n, d, μ, ν, σ^{2}, p, q), γ = ∥ μ - ν ∥_{2}, a n d Γ (p, q) = ∥ p -$ $g | / (p + q) . T h e r e$ exist a two-layer network and a three-layer network with the following properties:

If the intra-class and inter-class edge probabilities are $p, q = Ω (\frac{\log^{2} n}{n})$ , and it holds that $Γ (p, q) ζ (γ /_{2 σ}) = ω (\sqrt{\frac{\log n}{n (p + q)}}), t h e n f o r a n y c > 0, w i t h p r o b a b i l i t y a t l e a s t 1 - O (n^{- c})$ , the networks equipped with a graph convolution in the second or the third layer perfectly $c l a s s i f y$ the data, and obtain the following loss:

ℓ_{θ} (A, X) = C^{'} \exp (- C σ R Γ (p, q) ζ (γ /_{2 σ}) (1 \pm \sqrt{c /_{\log n}})),

where $C > 0$ and $C^{'} \in [1 /_{2}, 1]$ are constants$

$I f p, q = Ω (\frac{\log n}{\sqrt{n}}) a n d Γ (p, q)^{2} ζ (γ / 2 σ) = ω (\sqrt{\frac{\log n}{n}}), t h e n f o r a n y c > 0, w i t h p r o b a b i l i t y$ $a t least 1 - O (n^{- c})$ , the networks with any combination of two graph convolutions in the second andlor the third layers perfectly classify the data, and obtain the following loss:

ℓ_{θ} (A, X) = C^{'} \exp (- C σ R Γ (p, q)^{2} ζ (γ / σ) (1 \pm \sqrt{c /_{\log n}})),

where $C > 0$ and $C^{'} \in [1 /_{2}, 1]$ are constants.$

定理二的第一部分告诉我们当图在较小密度 $p, q = Ω (\frac{\log^{2} n}{n})$ 的条件下如果能够满足 $Γ (p, q) ζ (γ /_{2 σ}) = ω (\sqrt{\frac{\log n}{n (p + q)}})$ 那么一层GC的神经网络能够以高概率 $1 - O (n^{- c})$ 的情况下实现完美分类。想要理解这一部分的内容，如果我们限制 $Γ (p, q) = Ω (1)$ 可以看到此时选择满足条件 $Γ (p, q) ζ (γ /_{2 σ}) = ω (\sqrt{\frac{\log n}{n (p + q)}})$ 的 $γ$ 仅仅定理一中不使用图信息时完美匹配所需要的 $γ$ 的 $\frac{1}{\sqrt[4]{n (p + q)}}$ 。
定理二的第二部分表明，在较大的图密度 $p, q = Ω (\frac{\log n}{\sqrt{n}})$ 下，甚至能够将这个数值改进到 $\frac{1}{\sqrt[4]{n}}$ 。当然这并非表明图的密度越大越好，事实上，当 $p, q = Ω (1)$ 时，多个GC层的性能甚至比不上单个GC层。

Corollary

Corollary 2.1. Consider the data model XOR-CSBM $(n, d, μ, ν, σ^{2}, p, q)$ and the network architecture.

Assume that $p, q = Ω (\log^{2} n /_{n})$ $a n d c o n s i d e r t h e t h r e e - l a y e r n e t w o r k c h a r a c t e r i z e d b y$ part one of Theorem 2, with one graph convolution. For this network, placing the graph $c o n v o l u t i o n i n t h e s e c o n d l a y e r (k_{2} = 1, k_{3} = 0) o b t a i n s t h e s a m e r e s u l t s a s p l a c i n g i t i n$ $t h e$ third layer $(k_{2} = 0, k_{3} = 1) .$
Assume that $p, q = Ω (\log n / \sqrt{n}), a n d$ consider the three-layer network characterized by part two of[Theorem 2] with two graph convolutions. For this network, placing both convolutions in the second layer $(k_{2} = 2, k_{3} = 0)$ or both of them in the third layer $(k_{2} = 0, k_{3} = 2)$ $o b t a i n s$ the same results as placing one convolution in the second layer and one in the third $l a y e r (k_{2} = 1, k_{3} = 1) .$

推论2.1能够直接由定理2推得，推论二事实上描述了多层网络分类能力的提高取决于卷积算子的数量而不取决于其位置。特别的，在XOR-CSBM数据中将相同数量的卷积放在第二层或第三层的任意组合中，其对分类任务的性能是相似的。

Proof

这一部分涉及的定理、引理将非常多，在此只列出，而没有精力去记录详细的证明。

Pre

列出证明使用的工具：

Hoeffding's inequality
Chernoff bound
Union bound
Gaussian concentration 做出基本假设：
为方便计算同时保证能够展示背后思想的前提下，做出如下假设

\begin{aligned} Assumption 1. F o r the XOR-GMM data model, the means of the Gaussian mixture are such that \\ ⟨ μ, ν ⟩ = 0 a n d {‖ μ ‖}_{2} = {‖ ν ‖}_{2} . \end{aligned}

同时，证明中用到的符号假设：

$[x]_{+} = R E L U (x)$
$φ (x) = s i g m o i d (x) = \frac{1}{1 + e^{- x}}$
$\hat{v} = \frac{v}{| | v | |_{2}}$
$γ = | | μ - ν | |_{2}$
$γ^{'} = γ / 2$
$Γ (p, q) = \frac{| p - q |}{p + q}$
$ϕ (x)$ 表示标准高斯分布密度函数
$Φ (x)$ 表示标准高斯分布分布函数
$Φ_{c} (x) = 1 - Φ (x)$

Graph

这一小节主要介绍了关于我们设计的图的一些性质，例如度、共同邻居等的集中分布特性。

\begin{aligned} Proposition A. 1 (Concentration of degrees) . Assume that the graph density is p, q = Ω (\frac{\log^{2} n}{n}) . Then \\ \deg (i) = \frac{n}{2} (p + q) (1 \pm o_{n} (1)), \frac{1}{\deg (i)} = \frac{2}{n (p + q)} (1 \pm o_{n} (1)), \\ \frac{1}{\deg (i)} (\sum_{j \in C_{1}} a_{i j} - \sum_{j \in C_{0}} a_{i j}) = (2 ε_{i} - 1) \frac{p - q}{p + q} (1 + o_{n} (1)), \\ {where the error term o}_{n} (1) = O (\sqrt{\frac{c}{\log n}}) . \end{aligned}

性质1表明节点度高概率在 $\frac{n}{2} (p + q)$ 附近，同时给出一个点同类点邻居与不同类点邻居数量差的估计。

\begin{aligned} Proposition A.2 \\ Assume that the graph density is p,q = Ω (\frac{\log n}{\sqrt{n}}) . Then for any constant c > 0, with probability at least 1 - 2 n^{- c} \\ | N_{i} \cap N_{j} | = \frac{n}{2} (p^{2} + q^{2}) (1 \pm o_{n} (1)) f o r a l l i \sim j, \\ \begin{array}{r} | N_{i} \cap N_{j} | = n p q (1 \pm o_{n} (1)) \end{array} f o r a l l i ≁ j, \\ where the error term o_{n} (1) = O (\sqrt{\frac{c}{\log n}}) . \end{aligned}

性质二给出在相对稠密情况下，同类节点间共同邻居和不同类节点间共同邻居的一个数量估计。

\begin{aligned} Lemma A.3 (Variance reduction). Denote the event from Proposition A.l to be B. Let {X_{i}}_{i \in [n]} \in \\ R^{n \times d} be an iid sample of data. For a graph with adjacency matrix A (including self-loops) and a \\ fixed integer K > 0, define a K -convolution to be \tilde{X} = (D^{- 1} A)^{K} X . Then we have \\ Cov ({\tilde{X}}_{i} ∣ B) = ρ (K) Cov (X_{i}), where ρ (K) = {(\frac{1 + o_{n} (1)}{Δ})}^{2 K} \sum_{j \in [n]} A^{K} (i, j)^{2} . \\ Here, A^{K} (i, j) {is the entry in the ith row and jth column of the exponentiated matrix A}^{K} and \\ Δ = E \deg = \frac{n}{2} (p + q) \end{aligned}

引理3表明了随着引入卷积层数量 $K$ 增多， $C o v (\tilde{X})$ 会减小。这表明卷积操作在增加不同节点间特征间的相似性，使其变得难以区分。这也表明了添加更多的卷积层不一定有利于性能提升。

Basic Network

这里假定，由于我们的数据服从高斯分布，贝叶斯方法在该类数据上性能表现最优。

\begin{aligned} Lemma A.4. L e t h (x) =∣ ⟨ x, \hat{ν} ⟩ ∣ - ∣ ⟨ x, \hat{μ} ⟩ ∣ for all x \in R^{d} and defıne \\ \begin{aligned} ζ (t) & = t \erf (t) - \frac{1}{\sqrt{π}} (1 - e^{- t^{2}}) . \end{aligned} \\ Then we have \\ l . T h e expectation E h (X_{i}) = {\begin{array}{c} - \sqrt{2} σ ζ (γ /_{2 σ}) & i \in C_{0} \\ \sqrt{2} σ ζ (γ /_{2 σ}) & i \in C_{1} \end{array} . \\ 2. For any γ, σ > 0 such that γ = Ω_{n} (σ), we have that ζ (\frac{γ}{σ}) = Ω (\frac{γ}{σ}) . \\ 3. For any γ, σ > 0 such that γ = o_{n} (σ), we have that ζ (\frac{γ}{σ}) = Ω (\frac{γ^{2}}{σ^{2}}) . \end{aligned}

假定贝叶斯分类器形如 $h^{*} (x) = a r g m a x_{b \in {0, 1}} P r [y = b | x = x]$ ，引理6给出了在XOR-GMM数据集下贝叶斯分类器的准确表达形式

\begin{aligned} Lemma A.6. F o r s o m e f i x e d μ, ν \in R^{d} and σ^{2} > 0, the Bayes optimal classiffer, h^{*} (x) : R^{d} \to \\ {0, 1} for the data model XOR-GMM (n, d, μ, ν, σ^{2}) is given by \\ h^{*} (x) = 1 (| ⟨ x, μ ⟩ | < | ⟨ x, ν ⟩ |) = {\begin{cases} 0 & | ⟨ x, μ ⟩ | \geq | ⟨ x, ν ⟩ | \\ 1 & | ⟨ x, μ ⟩ | < | ⟨ x, ν ⟩ | \end{cases}, \\ where 1 is the indicator function. \end{aligned}

\begin{aligned} Proposition A.7. Consider two-layer and three-layer networks of the form described above, without biases (i.e., \\ b^{(l)} = 0 f o r a l l l a y e r s l), for parameters W^{(l)} and some R \in R^{+} a s f o l l o w s . \\ l. For the two-laye r network, \\ W (1) = R (\hat{μ} - \hat{μ} \hat{ν} - \hat{ν}), W^{(2)} = (- 1 - 1 1 1)^{⊤} . \\ 2. For the three-layer network \\ W^{(1)} = R (\hat{μ} - \hat{μ} \hat{ν} - \hat{ν}), W^{(2)} = (\begin{array}{c} - 1 & 1 \\ - 1 & 1 \\ 1 & - 1 \\ 1 & - 1 \end{array}), W^{(3)} = (\begin{array}{c} 1 \\ - 1 \end{array}) . \end{aligned}

性质7给出了2层和3层神经网络框架实现引理6中的贝叶斯分类方法。有必要说明的是，性质7中神经网络给出的输出为取值label = 1的概率，事实上他与引理6里的贝叶斯分类器等效。

Network no graph

到目前为止，使用以上的引理可完成定理一的证明。

Network with GC

通过向引理7提出的模型中加入GC证得以下结论

\begin{aligned} Proposition A.8. F ix. a positive integer d > 0, σ \in R^{+} and μ, ν \in R^{d} . Let (A, X) \sim \\ ХОВ-СSВМ (n, d, μ, ν, σ^{2}, p, q) . Defıne \tilde{X} to be the transformed data after applying a graph como- \\ lution on X, i.e., \tilde{X} = D^{- 1} A X . Then in the regine where p, q = Ω (\frac{\log^{2} n}{n}), with probability at least \\ 1 - 1 / poly (n) we have that \\ E {\tilde{X}}_{i} = {\begin{cases} \frac{p μ + q ν^{'}}{2 (p + q)} \cdot o_{n} (1) & i \in C_{0} \\ \frac{p ν + q μ}{2 (p + q)} \cdot o_{n} (1) & i \in C_{1} \end{cases} . \\ Hence, the distance betwee the means of the convolved data, given by \frac{p - q}{2 (p + q)} {‖ μ - ν ‖}_{2} \cdot o_{n} (1) \\ diminishes to 0 for n \to \infty . \end{aligned}

性质8表明在第一层加入GC是有害的，可以发现，当 $n \to \infty$ 时，经过第一层卷积作用后，节点特征将收敛到0变得无法区分。

接下来是对定理2的证明，首先通过引理9给出添加一个GC后神经网络输出的表达形式，基于此完成了对定理2第一部分关于添加单个GC的证明。引理10表明添加2个GC 时，对于不同组合，总是有相同的输出形式。基于引理10,可征得定理2第二部分。

\begin{matrix} Lemma A.9. Let h (x) = | ⟨ x, \hat{ν} ⟩ | - | ⟨ x, \hat{μ} ⟩ | for any x \in R^{d} . Consider the two-layer and three- \\ layer networks in | Proposition A.7 | where the weight parameter of the last layer, W^{(L)}, is scaled by a \\ factor of ξ = sgn (p - q) . If a graph comvolution is added to these networks in either the second or the \\ \begin{array}{l} t h i r d layer then for a sample (A, X) \sim X O R - C S B M (n, d, μ, ν, σ^{2}, p, q), the output of the networks \\ f o r a point i \in [n] i s \end{array} \\ \begin{aligned} {\hat{y}}_{i} & = φ (f_{i}^{(L)} (X)) = φ (\frac{R sgn (p - q)}{\deg (i)} \sum_{j \in [n]} a_{i j} h (X_{j})) . \end{aligned} \end{matrix}

\begin{aligned} Lemma A.10. & Let h (x) : R^{d} \to R =∣ ⟨ x, \hat{ν} ⟩ ∣ - ∣ ⟨ x, \hat{μ} ⟩ ∣ . Consider the networks constructed in \\ Proposition A. 7 equipped with two graph convolutions in the following combinations: \\ 1. Both convolutions in the second layer of the two-layer network . \\ 2. Both convolutions in the second layer of the three-layer network . \\ 3. One convolution in the second layer and one in the third layer of the three -layer network. \\ 4. Both convolutions in the third layer of the three-layer network . \\ T h e n f o r a s a m p l e (A, X) \sim X O R \cdot C S B M (n, d, μ, ν, σ^{2}, p, q), the output of the netnorks in all the \\ a b o v e d e s c r i b e d c o n b i n a t i o n s f o r a p o i n t i \in [n] is \\ {\hat{y}}_{i} = φ (f_{i}^{(L)} (X)) = φ (\frac{R}{\deg (i)} \sum_{j \in [n]} τ_{i j} h (X_{j})), where τ_{i j} = \sum_{k \in [n]} \frac{a_{i k} a_{j k}}{\deg (k)} . \end{aligned}

Experiments

实验部分较为简单，参考原文章即可。

Introduction ​

Main body ​

Contributions ​

Preliminaries ​

Data model ​

Network Architecture ​

Theory ​

Main results ​

Thm1 ​

Thm2 ​

Corollary ​

Proof ​

Pre ​

Graph ​

Basic Network ​

Network no graph ​

Network with GC ​

Experiments ​