Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

关键词：Local vector，Labeled point，Local matrix，Distributed matrix，RowMatrix，IndexedRowMatrix，CoordinateMatrix，BlockMatrix。
前言：MLlib支持本地向量和存储在单机上的矩阵，当然也支持被存储为RDD的分布式矩阵。一个有监督的机器学习的例子在MLlib里面叫做标签点。

1. 本地向量

一个本地向量是由整数类型和从０开始的索引存储在单机上
。MLlib支持两种本地向量，稠密向量和稀疏向量。稠密向量由一个浮点数组表示它的的所有值，而一个稀疏矩阵由两个平行的数组组成，索引和值。举个例子，一个向量，(1.0,0.0,3.0)能个用稠密表现为[1.0,0.0,3.0] 或者以稀疏的形式表现为（３，[0,2],[1.0,3.0]）,3是这个向量的大小。（本人注解：３为长度，即是元素个数，[0,2]为索引，[1.0,3.0],为值）

１．１MLlib认为如下数据类型是稠密向量：
~NumPys array
~Python list

1.2MLlib认为如下数据类型是稀疏向量：
~MLlib’s SparseVector.
~SciPy’s csc_matrix with a single colum

为了效率，我们推荐使用Numpy arrays ,并使用工厂方法继承Vectors 来创建稀疏矩阵。

import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors# Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0]) # Use a Python list as a dense vector. dv2 = [1.0, 0.0, 3.0] # Create a SparseVector. sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) # Use a single-column SciPy csc_matrix as a sparse vector. sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))

２．标签点

标签点可以是一个本地的向量，可以是稀疏的也可以是稠密的，总之他们是带有标签的。在MLlib中，标签点用来进行有监督的学习算法。我们使用双精度数来存储一个标签，这样我们既可以用标签点做分类，也可以用来做回归了。对于二分法，一个标签应该不是０就是１。对于多种分类，标签应该是索引从0,1,2,3….

一个标签点用LabelPoint来表示。

from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

稀疏数据
在训练中，有一个稀疏训练数据是有一件很平常的事情。MLlib支持读取一个以LIBSVM格式存储训练例子。LIBSVM是LIBSVM和LIBLINEAR默认的格式。这是一种每一行带有一个标签的的稀疏向量格式如下：

label index1:value1 index2:value2 ...

索引从１开始的升序排列的。当读取完成之后，这些特征索引转化为从０开始。

MLUtils.loadLibSVMFile　读取存储LIBSVM格式的训练模型

from pyspark.mllib.util import MLUtilsexamples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

３．本地矩阵
一个本地矩阵有整数型的行，和双精度的列索引，并且存储在单机上。MLlib支持将所有数据存储在一个单独的数组上并且以列为顺序的稠密矩阵，也支持稀疏矩阵。举个例子，比如像下面的稠密矩阵：

这个矩阵是一个存储在一维数组 [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]上的大小为(3,2)的矩阵。

本地矩阵的基础类是Matrix,我们提供两个实现函数：DenseMatrix和SparseMatrix。我们推荐Matrices里面的工厂实现方法来创造本地矩阵。记住，MLlib的本地矩阵是列为序存储。

from pyspark.mllib.linalg import Matrix, Matrices# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6]) # Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0)) sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

４．分布式矩阵
一个分布式矩阵有long类型的行和列，还有double类型的值，并且被分布式存储在一个或者更多的RDDs中。选择正确的格式来存储巨大的分布式矩阵是很重要的事情。将一个分布式矩阵转换可能需要一个全局的清洗，这是代价非常昂贵的。直到现在，四种类型的分布式矩阵已经实现了。
这四种中基础类型叫做 RowMatrix。这种 RowMatrix一种面向行的分布式矩阵，没有所谓的行索引。比如：一个特征向量的集合。它依赖与RDD自己的行，并且RDD的每个行是一个本地向量。对于一个RowMatrix我们假设列的数量不是非常巨大，以便一个单独的本地向量能够合理正确的与驱动交流并且能够存储操作在一个正在使用它的节点上。
IndexedRowMatrix与 RowMatrix除了能被用来区分行和执行合并的行索引不同之外，其他都非常相似。CoordinateMatrix是一种以coordinate list (COO) 格式存储在RDDs条目中的分布式矩阵。
BlockMatrix 是一种被RDDMatrixBlock支持的分布式矩阵，MatrixBlock是元祖(Int, Int, Matrix).

NOTE
潜在的分布式矩阵RDD必须是确定的，因为我们缓存了矩阵的大小，一般来说使用一个非确定性的RDD会导致错误。

RowMatrix
RowMatrix是一个面向行的分布式矩阵，没有所谓的行索引,可以使用RDD的行，这些行是本地向量。既然每个行都被本地向量表示，列的数目被整数范围限制，但是列数目在实际情况应该是比行小的多的。

from pyspark.mllib.linalg.distributed import RowMatrix# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) # Create a RowMatrix from an RDD of vectors. mat = RowMatrix(rows) # Get its size. m = mat.numRows() # 4 n = mat.numCols() # 3 # Get the rows as an RDD of vectors again. rowsRDD = mat.rows

IndexedRowMatrix

IndexedRowMatrix与 RowMatrix除了有意义的行索引外，其他都非常相似。它使用RDD索引行，以便每一行代表它的索引和本地向量。
一个 indexedRowMatrix可以被 indexedRowMatrix 创造出来，一个 indexedRowMatrix以能够被转化为RowMatrix通过去掉行索引。

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix# Create an RDD of indexed rows.
#   - This can be done explicitly with the IndexedRow class: indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]), IndexedRow(1, [4, 5, 6]), IndexedRow(2, [7, 8, 9]), IndexedRow(3, [10, 11, 12])]) # - or by using (long, vector) tuples: indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]), (2, [7, 8, 9]), (3, [10, 11, 12])]) # Create an IndexedRowMatrix from an RDD of IndexedRows. mat = IndexedRowMatrix(indexedRows) # Get its size. m = mat.numRows() # 4 n = mat.numCols() # 3 # Get the rows as an RDD of IndexedRows. rowsRDD = mat.rows # Convert to a RowMatrix by dropping the row indices. rowMat = mat.toRowMatrix()

CoordinateMatrix

CoordinateMatrix是一个分布式矩阵，
并且由RDD的条目支持的。每一个条目就是一个元祖(i: Long, j: Long, value: Double），i是行索引，j 是列索引，value是条目值。CoordinateMatrix应该仅仅使用当矩阵规模特别大并且矩阵很稀疏的时候。

CoordinateMatrix 能够被MatrixEntry条目创建， CoordinateMatrix能被转化为 RowMatrix通过使用toRowMatrix，或者一个稀疏行IndexedRowMatrix通过使用 toIndexedRowMatrix.

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry# Create an RDD of coordinate entries.
# - This can be done explicitly with the MatrixEntry class: entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)]) # - or using (long, long, float) tuples: entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)]) # Create an CoordinateMatrix from an RDD of MatrixEntries. mat = CoordinateMatrix(entries) # Get its size. m = mat.numRows() # 3 n = mat.numCols() # 2 # Get the entries as an RDD of MatrixEntries. entriesRDD = mat.entries # Convert to a RowMatrix. rowMat = mat.toRowMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() # Convert to a BlockMatrix. blockMat = mat.toBlockMatrix()

BlockMatrix

BlockMatrix是一个分布式矩阵，并且被 MatrixBlocks支持， MatrixBlocks是一个元祖， ((Int, Int), Matrix)，(Int, Int)是块索引，Matrix是rowsPerBlock x colsPerBlock的形状。

from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg.distributed import BlockMatrix # Create an RDD of sub-matrix blocks. blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))]) # Create a BlockMatrix from an RDD of sub-matrix blocks. mat = BlockMatrix(blocks, 3, 2) # Get its size. m = mat.numRows() # 6 n = mat.numCols() # 2 # Get the blocks as an RDD of sub-matrix blocks. blocksRDD = mat.blocks # Convert to a LocalMatrix. localMat = mat.toLocalMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() # Convert to a CoordinateMatrix. coordinateMat = mat.toCoordinateMatrix()

原文转自 http://blog.csdn.net/qq_30115765/article/details/52594421

转载于:https://www.cnblogs.com/honey01/p/8043201.html

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API相关推荐

【Python百日基础系列】Day03 - Python 数据类型
文章目录一.Python中的数据类型 1.1 数据类型系统是编程语言的核心 1.2 Python中的数据类型种类 1.3 数据类型的可变性 1.4 数据类型查看函数 - type() 二.Numbe ...
基于sparkMLlib的机器学习_[1]_基本数据类型[4]分布式矩阵DistrubutedMatrix
参考: <spark MLlib机器学习实践> <SPARK MLLIB机器学习算法.源码及实战详解> http://spark.apache.org/docs/latest ...
机器学习之数学基础(二)～数组、向量、矩阵、向量空间、二维矩阵
1. 概述在学习机器学习(machine learning)或模式识别(pattern recognition)过程中,我经常会困惑于向量.数组和矩阵这三种数据结构,而在学习张学工教授<模式识 ...
【机器学习线性代数】19 复数域中的向量与矩阵
1 回顾:复数和复平面首先我们快速的回顾一下复数和复平面的基本知识,便于后续知识的介绍.我记得当我们接触到 x 2 = − 1 x^2=-1 x2<
Spark机器学习实战 (十二) - 推荐系统实战
0 相关源码将结合前述知识进行综合实战,以达到所学即所用.在推荐系统项目中,讲解了推荐系统基本原理以及实现推荐系统的架构思路,有其他相关研发经验基础的同学可以结合以往的经验,实现自己的推荐系统. 1 ...
Spark Mllib里的分布式矩阵（行矩阵、带有行索引的行矩阵、坐标矩阵和块矩阵概念、构成）（图文详解）...
不多说,直接上干货! Distributed matrix : 分布式矩阵一般能采用分布式矩阵,说明这数据存储下来,量还是有一定的.在Spark Mllib里,提供了四种分布式矩阵存储形式,均由支持 ...
Spark入门实战系列--8.Spark MLlib（上）--机器学习及SparkMLlib简介
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 1.机器学习概念 1.1 机器学习的定义在维基百科上对机器学习提出以下几种定义: l&qu ...
Spark机器学习库（MLlib）指南
spark-1.6.1 机器学习库(MLlib)指南 MLlib是Spark的机器学习(ML)库.旨在简化机器学习的工程实践工作,并方便扩展到更大规模.MLlib由一些通用的学习算法和工具组成,包括分 ...
笔记：python spark机器学习与hadoop大数据
目录 1.1机器学习的介绍机器学习架构 1.2Spark的介绍 1.3Spark数据处理RDD.DataFrame.Spark SQl 1.4使用python开发spark机器学习与大数据应用 1. ...

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API相关推荐

最新文章

热门文章