オウルです。

今回は単回帰分析と重回帰分析です。日頃からお世話になっている Chainer Tutorial の 7. 単回帰分析と重回帰分析にある単回帰分析と重回帰分析を学ぶ理由が、しっくりきたので、まずはそれを最初に紹介します。

単回帰分析と重回帰分析を紹介することには 2 つの理由があります。 1 つ目は、単回帰分析と重回帰分析の数学がニューラルネットワーク含めたディープラーニングの数学の基礎となるためです。 2 つ目は、単回帰分析のアルゴリズムを通して微分、重回帰分析のアルゴリズムを通して線形代数に関する理解を深めることができるためです。

ということで、Chainer Tutorialで紹介されている単回帰分析と重回帰分析の内容を python で実装しながら理解を深めます。重回帰分析は、scikit-learn と numpy で実装して推論を比較してみます。

単回帰分析

そもそも回帰分析とは、何モノなんでしょう。

回帰（かいき、英: regression）とは、統計学において、Y が連続値の時にデータに Y = f(X) というモデル(「定量的な関係の構造[1]」)を当てはめる事。別の言い方では、連続尺度の従属変数（目的変数）Y と独立変数（説明変数）X の間にモデルを当てはめること。X が1次元ならば単回帰、X が2次元以上ならば重回帰と言う。Y が離散の場合は分類と言う。回帰分析（かいきぶんせき、英: regression analysis）とは、回帰により分析する事。回帰で使われる、最も基本的なモデルは Y = AX + B という形式の線形回帰である。

『ウィキペディア（Wikipedia）』回帰分析より引用

回帰分析の説明を確認したので、まずは単回帰分析からです。Chainer Tutorial 7.1. 単回帰分析では、 1 つの説明変数（部屋の広さ）から 1 つの目的変数（家賃）を予測しています。それでは、手を動かしながら理解を深めていきましょう。

「部屋の広さ」から「家賃」を予測

ライブラリのインポート

In [0]:

# ライブラリのインポート
import numpy as np
import matplotlib.pyplot as plt

				1
2
3

						# ライブラリのインポート
import numpy as np
import matplotlib.pyplot as plt

データ作成

In [1]:

# y: 家賃
y_data = np.array([60000, 115000, 155000])
# x: 部屋の広さ
x_data = np.array([20, 40, 60])

				
					
				1
2
3
4

						# y: 家賃
y_data = np.array([60000, 115000, 155000])
# x: 部屋の広さ
x_data = np.array([20, 40, 60])

					

			

データのプロット

In [2]:

# y 軸に「家賃」、x 軸に部屋の広さのデータをプロット
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)
ax.set_title('rest / area')
ax.grid()
ax.set_ylim([0, 200000])
ax.set_xlim([0, 80])
ax.set_ylabel('rest')
ax.set_xlabel('area')
ax.scatter(x, y, color='blue')
plt.show()

				
					
				1
2
3
4
5
6
7
8
9
10
11

						# y 軸に「家賃」、x 軸に部屋の広さのデータをプロット
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(111)
ax.set_title('rest / area')
ax.grid()
ax.set_ylim([0, 200000])
ax.set_xlim([0, 80])
ax.set_ylabel('rest')
ax.set_xlabel('area')
ax.scatter(x, y, color='blue')
plt.show()

					

			

Out [2]:

データの中心化

Chainer Tutorial にあるデータの中心化の数式を見ると、偏差を求める数式ですね。

雲の上はいつも晴れ

~~https://book2525stack.com/2019/09/02/beginner-machine-learning-statistics-1~~

準オタクが綴るマンガ、アニメ、ときどきプログラミングのゆるーいブログ

In [3]:

def centering(data):
  return data - np.mean(data)

				1
2

						def centering(data):
  return data - np.mean(data)

パラメータの最適化

次に最適な \(w\) （重み）を以下の数式で求めます。

\(w = \frac{\sum_{n=1}^N x_n t_n}{\sum_{n=1}^N x^2_n}\)

7.1.5. Step 3：最適なパラメータを求める（単回帰分析）で上記の数式の導出を、僕でも分かるくらい分かり易く説明をしてくれています。

In [4]:

def reg1dim(x, t):
  return np.sum(x * t) / np.sum(x**2)

				1
2

						def reg1dim(x, t):
  return np.sum(x * t) / np.sum(x**2)

In [5]:

y_c = centering(y_data)
x_c = centering(x_data)
print(y_c)
print(x_c)

				
					
				1
2
3
4

						y_c = centering(y_data)
x_c = centering(x_data)
print(y_c)
print(x_c)

					

			

Out [5]:

[-50000.   5000.  45000.]
[-20.   0.  20.]

1 2	[-50000. 5000. 45000.] [-20. 0. 20.]

In [6]:

w = reg1dim(x_c, y_c)
print(w)

				1
2

						w = reg1dim(x_c, y_c)
print(w)

Out [6]:

2375.0

2375.0

In [7]:

# 中心化したデータをプロットして線を引く
fig_c = plt.figure(figsize=(5, 5))
ax_c = fig_c.add_subplot(111)
ax_c.set_title('rest area')
ax_c.grid()
ax_c.set_ylim(-60000, 55000)
ax_c.set_xlim(-40, 40)
ax_c.set_ylabel('rest')
ax_c.set_xlabel('area')
ax_c.scatter(x_c, y_c, color='blue')
plt.plot([-20, x_c.max()], [-50000, w * x_c.max()])
plt.show()

				
					
				1
2
3
4
5
6
7
8
9
10
11
12

						# 中心化したデータをプロットして線を引く
fig_c = plt.figure(figsize=(5, 5))
ax_c = fig_c.add_subplot(111)
ax_c.set_title('rest area')
ax_c.grid()
ax_c.set_ylim(-60000, 55000)
ax_c.set_xlim(-40, 40)
ax_c.set_ylabel('rest')
ax_c.set_xlabel('area')
ax_c.scatter(x_c, y_c, color='blue')
plt.plot([-20, x_c.max()], [-50000, w * x_c.max()])
plt.show()

					

			

Out [7]:

推論

In [8]:

# 推論
def predict(x, y, w, x_test):
  return w * (x_test - np.mean(x)) + np.mean(y)

				1
2
3

						# 推論
def predict(x, y, w, x_test):
  return w * (x_test - np.mean(x)) + np.mean(y)

Chainer Tutorial 7.1.6. 数値例の説明にあるように、モデル化の際に中心化を行っていた処理を推論の際には元に戻して計算します。

In [9]:

# 部屋の広さ: 50㎡ の家賃を推論
t50 = predict(x_data, y_data, w, 50)
print(t50)

				1
2
3

						# 部屋の広さ: 50㎡ の家賃を推論
t50 = predict(x_data, y_data, w, 50)
print(t50)

Out [9]:

133750.0

133750.0

Chainer Tutorial 7.1.6. 数値例の結果と一致しています。 python での実装は、ひとまず問題なさそうです。

重回帰分析

続いて重回帰分析では、Chainer Tutorial 9. scikit-learn 入門で出てくる Boston house prices dataset というデータセットを使用します。

まず、scikit-learn「サイキット・ラーン」で実装します。実装内容は、Chainer Tutorial scikit-learn の内容に沿っています。詳細は、Chainer Tutorial 9. scikit-learn 入門を参照してください。その後、 numpy での実装と比較してみます。

scikit-learn

データセットのロード

In [0]:

# ライブラリのインポート
from sklearn.datasets import load_boston
# データのロード
dataset = load_boston()

				
					
				1
2
3
4

						# ライブラリのインポート
from sklearn.datasets import load_boston
# データのロード
dataset = load_boston()

					

			

In [1]:

x = dataset.data
t = dataset.target

				1
2

						x = dataset.data
t = dataset.target

データセットの分割

In [2]:

# データセットの分割
from sklearn.model_selection import train_test_split
x_train, x_test, t_train, t_test = train_test_split(x, t, test_size=0.3, random_state=0)

				1
2
3

						# データセットの分割
from sklearn.model_selection import train_test_split
x_train, x_test, t_train, t_test = train_test_split(x, t, test_size=0.3, random_state=0)

モデルの訓練

In [3]:

# モデルの訓練
from sklearn.linear_model import LinearRegression
reg_model = LinearRegression()

				1
2
3

						# モデルの訓練
from sklearn.linear_model import LinearRegression
reg_model = LinearRegression()

Out[3]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

1	LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

公式ドキュメントの sklearn.linear_model.LinearRegression

In [4]:

# 訓練後のパラメータ: w
reg_model.coef_

				1
2

						# 訓練後のパラメータ: w
reg_model.coef_

Out[4]:

array([-1.21310401e-01,  4.44664254e-02,  1.13416945e-02,  2.51124642e+00,
       -1.62312529e+01,  3.85906801e+00, -9.98516565e-03, -1.50026956e+00,
        2.42143466e-01, -1.10716124e-02, -1.01775264e+00,  6.81446545e-03,
       -4.86738066e-01])

array([-1.21310401e-01, 4.44664254e-02, 1.13416945e-02, 2.51124642e+00,

-1.62312529e+01, 3.85906801e+00, -9.98516565e-03, -1.50026956e+00,

2.42143466e-01, -1.10716124e-02, -1.01775264e+00, 6.81446545e-03,

-4.86738066e-01])

In [5]:

# 訓練後のバイアス: b
reg_model.intercept_

				1
2

						# 訓練後のバイアス: b
reg_model.intercept_

Out[5]:

37.93710774183309

1	37.93710774183309

In [6]:

# 精度の検証
reg_model.score(x_train, t_train)

				1
2

						# 精度の検証
reg_model.score(x_train, t_train)

Out[6]:

0.7645451026942549

1	0.7645451026942549

推論

In [7]:

# 推論
reg_model.predict(x_test[:1])

				1
2

						# 推論
reg_model.predict(x_test[:1])

Out[7]:

array([24.9357079])

1	array([24.9357079])

numpy

In [0]:

# ライブラリのインポート
import numpy as np

				1
2

						# ライブラリのインポート
import numpy as np

パラメータの最適化

パラメータの最適化に正規方程式を使用します。

\(\begin{split}\begin{aligned}
{\bf w} &= ({\bf X}^{\rm T}{\bf X})^{-1}{\bf X}^{\rm T}{\bf t}
\end{aligned}\end{split}\)

正規方程式の導出は、7.2.4. Step 3：パラメータを最適化する（重回帰分析）でとても丁寧に説明してくれています。

In [1]:

# パラメータの最適化
def normal_equation(x, t):
  return np.linalg.inv(x.T.dot(x)).dot(x.T).dot(t)

				1
2
3

						# パラメータの最適化
def normal_equation(x, t):
  return np.linalg.inv(x.T.dot(x)).dot(x.T).dot(t)

In [2]:

w = normal_equation(x_train, t_train)
print(w)

				1
2

						w = normal_equation(x_train, t_train)
print(w)

Out [2]:

[-0.11897829  0.04806563 -0.02139036  2.77854735 -0.31370893  6.12550172
 -0.01906494 -0.9824806   0.09990441 -0.00808653 -0.46446137  0.01248049
 -0.3610918 ]

[-0.11897829 0.04806563 -0.02139036 2.77854735 -0.31370893 6.12550172

-0.01906494 -0.9824806 0.09990441 -0.00808653 -0.46446137 0.01248049

-0.3610918 ]

推論

In [3]:

# 推論
def predict(w, x):
  return np.dot(w.T, x)

				1
2
3

						# 推論
def predict(w, x):
  return np.dot(w.T, x)

In [4]:

y = predict(x_test[:1].T, w)
print(y)

				1
2

						y = predict(x_test[:1].T, w)
print(y)

Out [4]:

[24.22543525]

1	[24.22543525]

推論値の比較

scikit-learn での x_test[:1]テストデータの推論値は ”24.9357079”、numpy でのx_test[:1]テストデータの推論値は ”24.22543525” ・・・違ってる・・・

実装では同じ訓練データを使用しているので、データの問題ではなさそう。LinearRegression 内部の最適化方法が違うのかもと思いつつ、もう一度「Chainer Tutorial 7. 単回帰分析と重回帰分析」を読み直します。

すると、これだと思うところがありました。

重回帰分析では、 \(M\) 個の重み \(w_{1}, w_{2}, \dots, w_{M}\) と 1 個のバイアス \(b\) があり、合わせて \(M+1\) 個のパラメータが存在します。これらのパラメータをうまく定式化することを考えます。そこで、今回は \(x_0 = 1\)、\(w_0 = b\) とおくことで、

\(\begin{split}\begin{aligned}
y
&= w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{M}x_{M} + b \\
&= w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{M}x_{M} + w_{0}x_{0} \\
&= w_{0}x_{0} + w_{1}x_{1} + \cdots + w_{M}x_{M} \\
&= \sum_{m=0}^M w_{m} x_{m}
\end{aligned}\end{split}\)

のように \(b\) を総和の内側の項に含めて、簡潔に表記できるようにします。

Chainer Tutorial 7. 単回帰分析と重回帰分析より引用

そういえば、バイアスを考慮していないです。ということで numpy の実装を修正します。

In [2]:

ones = np.ones((x_train.shape[0], 1))
X_train = np.concatenate((ones, x_train), axis=1)
w = normal_equation(X_train, t_train)
print(w)

				
					
				1
2
3
4

						ones = np.ones((x_train.shape[0], 1))
X_train = np.concatenate((ones, x_train), axis=1)
w = normal_equation(X_train, t_train)
print(w)

					

			

Out [2]:

[ 3.79371077e+01 -1.21310401e-01  4.44664254e-02  1.13416945e-02
  2.51124642e+00 -1.62312529e+01  3.85906801e+00 -9.98516565e-03
 -1.50026956e+00  2.42143466e-01 -1.10716124e-02 -1.01775264e+00
  6.81446545e-03 -4.86738066e-01]

				
					
				1
2
3
4

						[ 3.79371077e+01 -1.21310401e-01  4.44664254e-02  1.13416945e-02
  2.51124642e+00 -1.62312529e+01  3.85906801e+00 -9.98516565e-03
 -1.50026956e+00  2.42143466e-01 -1.10716124e-02 -1.01775264e+00
  6.81446545e-03 -4.86738066e-01]

					

			

In [3]:

# 推論
def predict(w, x, b):
  return np.dot(w.T, x) + b

				1
2
3

						# 推論
def predict(w, x, b):
  return np.dot(w.T, x) + b

In [4]:

b = w[:1]
print(b)
W = w[1:]
print(W)
y = predict(W, x_test[:1].T, b)
print()
print(y)

				
					
				1
2
3
4
5
6
7

						b = w[:1]
print(b)
W = w[1:]
print(W)
y = predict(W, x_test[:1].T, b)
print()
print(y)

					

			

Out [4]:

[37.93710774]
[-1.21310401e-01  4.44664254e-02  1.13416945e-02  2.51124642e+00
 -1.62312529e+01  3.85906801e+00 -9.98516565e-03 -1.50026956e+00
  2.42143466e-01 -1.10716124e-02 -1.01775264e+00  6.81446545e-03
 -4.86738066e-01]

[24.9357079]

[37.93710774]

[-1.21310401e-01 4.44664254e-02 1.13416945e-02 2.51124642e+00

-1.62312529e+01 3.85906801e+00 -9.98516565e-03 -1.50026956e+00

2.42143466e-01 -1.10716124e-02 -1.01775264e+00 6.81446545e-03

-4.86738066e-01]

[24.9357079]

scikit-learn での x_test[:1]テストデータの推論値は ”24.9357079”、numpy での x_test[:1]テストデータの推論値は ”24.9357079” ・・・同じ。

手を動かしていなかったら、分かったつもりになっているところでした。

【初心者】ASP.NET CoreとReactでWebアプリをつくる -開発環境編-

【超初心者向け】重回帰分析を numpy と scikit-learn で実装して比較すると推論値が違う？！原因は？

単回帰分析

「部屋の広さ」から「家賃」を予測

ライブラリのインポート

データ作成

データのプロット

データの中心化

パラメータの最適化

推論

重回帰分析

scikit-learn

データセットのロード

データセットの分割

モデルの訓練

推論

numpy

パラメータの最適化

推論

推論値の比較

関連記事:

コメントを残すコメントをキャンセル

単回帰分析

「部屋の広さ」から「家賃」を予測

ライブラリのインポート

データ作成

データのプロット

データの中心化

パラメータの最適化

推論

重回帰分析

scikit-learn

データセットのロード

データセットの分割

モデルの訓練

推論

numpy

パラメータの最適化

推論

推論値の比較

関連記事:

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル