MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
MLlib是Spark机器学习库。它的目标是构造实用的、可扩展的、简单的机器学习。它的通用组成部分分为学习算法和工具包,包括:分类、回归、聚集、协同过滤、降维,也提供了lower-level级别的原型优化和higher-level级别的pipeline API。
It divides into two packages:
contains the original API built on top of .
provides higher-level API built on top of for constructing ML pipelines.
它分为两个包:
:包括构建在 之上的原型API。
:提供构建在 上的 higher-level API ,而 是为了构造ML管道的。
Using spark.ml
is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting spark.mllib
along with the development of spark.ml
. Users should be comfortable using spark.mllib
features and expect more features coming. Developers should contribute new algorithms to spark.ml
if they fit the ML pipeline concept well, e.g., feature extractors and transformers.
推荐使用 spark.ml ,因为基于DataFrames的API 更加通用和灵活。但是我们将继续支持spark.mllib 和spark.ml一起发展。用户可以舒畅的使用spark.mllib特性,并且期望更多特色的到来。开发人员安装了可以贡献新的算法给spark.ml,当然这些算法应与ML pipeline概念相适应。
e.g:extractors(提取器) 和 transformers(转换器)
We list major functionality from both below, with links to detailed guides.
我们在下面列出了主要的功能,通过连接进入详细指南。
spark.mllib: data types, algorithms, utilities