spark ml 与mllib 包有什么区别

spark.ml是基于DataFrame 数据集,是spark官方现在推荐的包,未来会持续更新

spark.mllib是基于Rdd 数据集,原有基于RDD的API目前处于维护状态,不再加新Feature,预计在Spark3.0会删除该包


参考Spark官方原文:


As of Spark 2.0, the RDD-based APIs in thespark.mllibpackage have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in thespark.mlpackage.

What are the implications?

  • MLlib will still support the RDD-based API inspark.mllibwith bug fixes.
  • MLlib will not add new features to the RDD-based API.
  • In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
  • After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
  • The RDD-based API is expected to be removed in Spark 3.0.


标签: based、rdd、mllib、spark、api、面试
猜你感兴趣的圈子:
Spark与机器学习
  • 回复
隐藏