/ 20220404 Week 1 - 2 /
Arthur Samuel
The field of study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Supervised Learning 监督学习:given a labeled data set; already know what a correct output/result should look like
Unsupervised Learning 无监督学习:given an unlabeled data set or an data set with the same labels; group the data by ourselves
Others: Reinforcement Learning, Recommender Systems…
Training Set 训练集
\[\begin{matrix}
x^{(1)}_1&x^{(1)}_2&\cdots&x^{(1)}_n&&y^{(1)}\\
x^{(2)}_1&x^{(2)}_2&\cdots&x^{(2)}_n&&y^{(2)}\\
\vdots&\vdots&\ddots&\vdots&&\vdots\\
x^{(m)}_1&x^{(m)}_2&\cdots&x^{(m)}_n&&y^{(m)}
\end{matrix}\]
符号说明
\(m=\) the number of training examples 训练样本的数量 - 行数
\(n=\) the number of features 特征数量 - 列数
\(x=\) input variable/feature 输入变量/特征
\(y=\) output variable/target variable 输出变量/目标变量
\((x^{(i)}_j,y^{(i)})\) :第\(j\)个特征的第 \(i\) 个训练样本,其中 \(i=1, …, m\),\(j=1, …, n\)
\[\begin{matrix}
x_0&x^{(1)}_1&x^{(1)}_2&\cdots&x^{(1)}_n&&y^{(1)}\\
x_0&x^{(2)}_1&x^{(2)}_2&\cdots&x^{(2)}_n&&y^{(2)}\\
\vdots&\vdots&\vdots&\ddots&\vdots&&\vdots\\
x_0&x^{(m)}_1&x^{(m)}_2&\cdots&x^{(m)}_n&&y^{(m)}\\
\\
\theta_0&\theta_1&\theta_2&\cdots&\theta_n&&
\end{matrix}\]
Hypothesis Function
\[h_{\theta}(x)=\theta_0+\theta_1x
\]
Cost Function - Square Error Cost Function 平方误差代价函数
\[J(\theta_0,\theta_1)=\frac{1}{2m}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2
\]
Goal
\[\min_{(\theta_0,\theta_1)}J(\theta_0,\theta_1)
\]
Hypothesis Function
\[\theta=
\left[
\begin{matrix}
\theta_0\\
\theta_1\\
\vdots\\
\theta_n
\end{matrix}
\right],\
x=
\left[
\begin{matrix}
x_0\\
x_1\\
\vdots\\
x_n
\end{matrix}
\right]\]
\[\begin{aligned}h_\theta(x)&=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_nx_n\\
&=\theta^Tx
\end{aligned}\]
Cost Function
\[J(\theta^T)=\frac{1}{2m}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2
\]
Goal
\[\min_{\theta^T}J(\theta^T)
\]
算法过程
Repeat until convergence(simultaneous update for each \(j=1, …, n\))
\[\begin{aligned}
\theta_j
&:=\theta_j-\alpha{\partial\over\partial\theta_j}J(\theta^T)\\
&:=\theta_j-\alpha{1\over{m}}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j
\end{aligned}\]
Feature Scaling 特征缩放
对每个特征 \(x_j\) 有$$x_j={{x_j-\mu_j}\over{s_j}}$$
其中 \(\mu_j\) 为 \(m\) 个特征 \(x_j\) 的平均值,\(s_j\) 为 \(m\) 个特征 \(x_j\) 的范围(最大值与最小值之差)或标准差。
Learning Rate 学习率
令
\[X=\left[
\begin{matrix}
x_0&x^{(1)}_1&x^{(1)}_2&\cdots&x^{(1)}_n\\
x_0&x^{(2)}_1&x^{(2)}_2&\cdots&x^{(2)}_n\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
x_0&x^{(m)}_1&x^{(m)}_2&\cdots&x^{(m)}_n\\
\end{matrix}
\right],\
y=\left[
\begin{matrix}
y^{(1)}\\
y^{(2)}\\
\vdots\\
y^{(m)}\\
\end{matrix}
\right]\]
其中 \(X\) 为 \(m\times(n+1)\) 维矩阵,\(y\) 为 \(m\) 维的列向量。则
\[\theta=(X^TX)^{-1}X^Ty
\]
如果 \(X^TX\) 不可逆(noninvertible),可能是因为:
If a linear \(h_\theta(x)\) can't fit the data well, we can change the behavior or curve of \(h_\theta(x)\) by making it a quadratic, cubic or square root function(or any other form).
e.g.
\(h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_1^2,\ x_2=x_1^2\)
\(h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_1^2+\theta_3x_1^3,\ x_2=x_1^2,\ x_3=x_1^3\)
\(h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{x_1},\ x_2=\sqrt{x_1}\)
手机扫一扫
移动阅读更方便
你可能感兴趣的文章