machine learning - Multinomial Logistic Regression in spark ml vs mllib -
spark version 2.0.0 has stated goal bring feature parity between ml
, now-deprecated mllib
packages.
presently ml
package provides elasticnet support binary regression. obtain multinomial apparently have accept using deprecated mllib?
the downsides of using mllib:
- it deprecated. have "why using old stuff" questions field
- they not use
ml
workflow not integrate cleanly - for above reasons have rewrite.
is there approach available achieving one-vs-all multinomial ml
package?
this answer-in-progress. there is onevsrest
classifier in spark.ml
.
apparently approach provide logisticregressionclassifier
binary classifier - run binary version across classes , return class highest score.
update in response @zero323. here info xiangrui meng
on deprecation of mllib:
switch rdd-based mllib apis maintenance mode in spark 2.0
hi all, more year ago, in spark 1.2 introduced ml pipeline api built on top of spark sql’s dataframes. since new dataframe-based api has been developed under spark.ml package, while old rdd-based api has been developed in parallel under spark.mllib package. while easier implement , experiment new apis under new package, became harder , harder maintain both packages grew bigger , bigger. , new users confused having 2 sets of apis overlapped functions. started recommend dataframe-based api on rdd-based api in spark 1.5 versatility , flexibility, , saw development , usage gradually shifting dataframe-based api. counting lines of scala code, 1.5 current master added ~10000 lines dataframe-based api while ~700 rdd-based api. so, gather more resources on development of dataframe-based api , users migrate on sooner, want propose switching rdd-based mllib apis maintenance mode in spark 2.0. mean exactly? * not accept new features in rdd-based spark.mllib package, unless block implementing new features in dataframe-based spark.ml package. * still accept bug fixes in rdd-based api. * add more features dataframe-based api in 2.x series reach feature parity rdd-based api. * once reach feature parity (possibly in spark 2.2), deprecate rdd-based api. * remove rdd-based api main spark repo in spark 3.0. though rdd-based api in de facto maintenance mode, announcement make clear , hence important both mllib developers , users. we’d appreciate feedback! (as side note, people use “spark ml” refer dataframe-based api or entire mllib component. causes confusion. clear, “spark ml” not official name , there no plans rename mllib “spark ml” @ time.) best, xiangrui
another update there jira , work nearing completion of may 2016 support multiclass logistic regression in spark.ml
Comments
Post a Comment