Are Loss Functions All the Same?

阅读原文时间：2023年07月11日阅读：2

概
主要内容

Rosasco L, De Vito E, Caponnetto A, et al. Are loss functions all the same[J]. Neural Computation, 2004, 16(5): 1063-1076.

@article{rosasco2004are,

title={Are loss functions all the same},

author={Rosasco, Lorenzo and De Vito, Ernesto and Caponnetto, Andrea and Piana, Michele and Verri, Alessandro},

journal={Neural Computation},

volume={16},

number={5},

pages={1063--1076},

year={2004}}

作者给出了不同的损失函数, 在样本数量增多情况下的极限情况. 假设\(p(x,y)\)为\((x,y)\)的密度函数，其中\(x\in \mathbb{R}^d\)为输入样本, \(y\in \mathbb{R}\)为值(回归问题) 或类别信息(分类问题). 设\(V(w,y),\)为损失函数, 则期望风险为:

\[\tag{1}
I[f]=\int_Z V(f(x),y)p(x,y)\mathrm{d} x \mathrm{d}y,
\]

其中\(f\)为预测函数, 不妨设\(f_0\)最小化期望风险. 在实际中, 我们只有有限的样本\(D=\{(x_1,y_1),\ldots, (x_l,y_l)\}\), 在此情况下, 我们采取近似

\[\tag{2}
I_{emp}[f]=\frac{1}{l}\sum_{i=1}^lV(f(x_i),y_i),
\]

同时

\[\tag{3}
f_D=\arg\min_{f \in \mathcal{H}} I_{emp}[f].
\]

其中\(\mathcal{H}\)为hypothesis space.

\(f_D\)与\(f_0\)之间的差距如何, 是本文的核心.

一些假设

首先\(f_D\)的在空间\(\mathcal{H}\)中寻找, Reproducing Kernel Hilbert Space(RKHS)一文中(没看)给出了这种空间的构造方式. 给定对称正定函数\(K(x,s)\)(Mercer核):

\[K: X \times X \rightarrow \mathbb{R},
\]

同时\(K(\cdot, x)\)是连续函数.

函数\(f\)通过下述方式构造:

\[\tag{4}
f(x) = \langle f, K(\cdot, x)\rangle_{\mathcal{H}}.
\]

给定常数\(R>0\), 构造hypothesis space \(\mathcal{H}_{R}\):

\[\mathcal{H}_{R} = \{f \in \mathcal{H}, \|f\|_{\mathcal{H}}\le R\},
\]

则在\(\|\cdot\|_{\infty}\)下, \(\mathcal{H}_R\)是连续函数\(C(X)\)上的一个紧集，其中\(X\subset \mathbb{R}^d\)是紧的(这个证明要用到经典的Arela-Ascoli定理, 只需证明\(\mathcal{H}_R\)中的元素是等度连续即可).

另外:

\[|f(x)|= |\langle f, K(\cdot, x)\rangle_{\mathcal{H}}.| \le \|f\|_{\mathcal{H}} \sqrt{K(x,x)},
\]

故

\[\|f(x)\|_{\infty} \le RC_K,
\]

其中\(C_K=\sup_{x \in X} \sqrt{K(x,x)}\).

损失函数\(V\)为凸函数且满足: