mnist_data: MNIST 数据集的分类子集
一个将 MNIST
数据集加载到 NumPy 数组中的函数。
from mlxtend.data import mnist_data
概述
MNIST 数据集由美国国家标准与技术研究院 (NIST) 的两个数据集构建而成。训练集包含来自 250 个不同人的手写数字,其中 50% 是高中生,50% 是人口普查局的员工。注意,测试集包含来自不同人的手写数字,分布与训练集相同。
特征
每个特征向量(特征矩阵中的行)由 784 个像素(强度)组成——从原始 28x28 像素图像展开而来。
-
样本数:5000 张图像的子集(每个类别的最前 500 位数字)
-
目标变量 (离散): {每个类别 500 个样本}
参考文献
- 来源:https://yann.lecun.com/exdb/mnist/
- Y. LeCun 和 C. Cortes. Mnist 手写数字数据库。AT&T Labs [在线]。可用地址:https://yann.lecun.com/exdb/mnist, 2010。
示例 1 - 数据集概述
from mlxtend.data import mnist_data
X, y = mnist_data()
print('Dimensions: %s x %s' % (X.shape[0], X.shape[1]))
print('1st row', X[0])
Dimensions: 5000 x 784
1st row [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 51. 159. 253. 159. 50.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 48. 238.
252. 252. 252. 237. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 54. 227. 253. 252. 239. 233. 252. 57. 6. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 10. 60. 224. 252. 253. 252. 202. 84. 252.
253. 122. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 163. 252. 252. 252. 253.
252. 252. 96. 189. 253. 167. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 51. 238.
253. 253. 190. 114. 253. 228. 47. 79. 255. 168. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 48. 238. 252. 252. 179. 12. 75. 121. 21. 0. 0.
253. 243. 50. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 38. 165. 253. 233. 208. 84. 0. 0.
0. 0. 0. 0. 253. 252. 165. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 7. 178. 252. 240. 71.
19. 28. 0. 0. 0. 0. 0. 0. 253. 252. 195. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 57.
252. 252. 63. 0. 0. 0. 0. 0. 0. 0. 0. 0.
253. 252. 195. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 198. 253. 190. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 255. 253. 196. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 76. 246. 252. 112. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 253. 252. 148. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 85. 252.
230. 25. 0. 0. 0. 0. 0. 0. 0. 0. 7. 135.
253. 186. 12. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 85. 252. 223. 0. 0. 0. 0. 0. 0. 0.
0. 7. 131. 252. 225. 71. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 85. 252. 145. 0. 0. 0.
0. 0. 0. 0. 48. 165. 252. 173. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 86. 253.
225. 0. 0. 0. 0. 0. 0. 114. 238. 253. 162. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 85. 252. 249. 146. 48. 29. 85. 178. 225. 253.
223. 167. 56. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 85. 252. 252. 252. 229. 215.
252. 252. 252. 196. 130. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 28. 199.
252. 252. 253. 252. 252. 233. 145. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 25. 128. 252. 253. 252. 141. 37. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
import numpy as np
print('Classes: Setosa, Versicolor, Virginica')
print(np.unique(y))
print('Class distribution: %s' % np.bincount(y))
Classes: Setosa, Versicolor, Virginica
[0 1 2 3 4 5 6 7 8 9]
Class distribution: [500 500 500 500 500 500 500 500 500 500]
示例 2 - 可视化 MNIST
%matplotlib inline
import matplotlib.pyplot as plt
def plot_digit(X, y, idx):
img = X[idx].reshape(28,28)
plt.imshow(img, cmap='Greys', interpolation='nearest')
plt.title('true label: %d' % y[idx])
plt.show()
plot_digit(X, y, 4)
API
mnist_data()
来自 MNIST 手写数字数据集的 5000 个样本。
数据来源
: https://yann.lecun.com/exdb/mnist/
返回值
-
X, y
: [样本数, 特征数], [类别标签数]X 是特征矩阵,包含 5000 个图像样本作为行,每行由从原始 28x28 像素图像展开而来的 784 个像素特征向量组成。y 包含 10 个唯一的类别标签 0-9。
示例
有关使用示例,请参阅 https://mlxtend.cn/mlxtend/user_guide/data/mnist_data/