1.2 ML Pipelines: Spark in a High - Level API for MLlib
ML Pipelines: A New High-Level API for MLlib Article from Databricks Blog by Xiangrui Meng, Joseph Bradley, Evan Sparks (UC Berkeley) and Shivaram Venkataraman (UC Berkeley).
In each version update, in addition to the new algorithm and performance upgrade, Databricks on the usability of MLlib also spent a lot of energy.Similar to Spark Core, MLlib API provides three programming languages: Python, Java and Scala.In addition, MLlib also provides a code example, to facilitate users to learn and use different background.In Spark 1.2, with AMPLab (UC Berkeley), a pipeline API is added to the MLlib, simplified MLlib establish work again, and add the tuning mechanism for ML pipelines.
In practical application, a ML pipeline often include a series of stages, such as data preprocessing, feature extraction, model simulation and visualization.For example, text categorization may contain text segmentation and cleaning, feature extraction, and a classification model by training cross validation.Although there are many libraries can use the moment every step, but connect each step is not easy, especially in the large scale scenarios.At present, most of the library does not support distributed computing, or they do not support a native pipeline established and optimized.Unfortunately, the problem is often ignored by the academic circles, and in the industry have to focus on.
This blog will briefly Databricks and AMPLab in ML pipeline (MLlib) work, by some of the designScikit - learn the projectAnd some of the earlyMLI workInspired by.
In the new pipeline design, data sets are usually made of Spark, SQL SchemaRDD a series of data sets and ML pipeline conversion performance.Every transformation intake of a set of input data, and outputs a transformed data set, and the output data set will be the next step of the input data set.Using Spark SQL, mainly considering the following factors: data input/output, flexible column type and operation, as well as the implementation plan optimization.
Data input and output is a ML the beginning and end of the pipeline.MLlib now has a type provides a practical tool for the input and output, including LabeledPoint used for classification and regression, Rating for collaborative filtering, etc.However real data sets may include a variety of types, such as user/item ID, timestamp, or original records, right now, the tools are not perfectly supports all of these types.At the same time, they also use the inherit the inefficiency of the text from other ML repository storage format.
Mainstream ML pipeline usually contain feature conversion phase, you may consider the characteristics of conversion to add a new column to the existing columns.For example,
such as: text participle documents into a large number of words, whereas tf - idf converts these words to a feature vector.In this process, the tag will be applied to the model fitting.At the same time, in the actual process, feature conversion also often appear more complex.Therefore, data sets need to support different types of columns, including dense/sparse vector, as well as the existing column set up the operation of the new column.
In the example above, the id, the text and the words in the transformation will be moved to.In the model fitting, they are not needed, but in the prediction and model checking when they will be used.If the forecast data set contains only predicted labels, so they do not provide too much information.If we want to test results, such as false positives, then combined with predicted labels, raw input text and tokenized words are necessary.Thus, if the underlying execution engine optimized, and only load the required column will be very necessary.
Fortunately, the function of the Spark has provided the most desired SQL, institutions do not need to start again.Spark support read from the Parque SchemaRDDs, and support to write SchemaRDDs into corresponding Parque.Parque is a very effective storage formats, can free conversion between RDDs and SchemaRDDs, it also supports the Hive and Avro such external data sources.Using Spark SQL, the establishment of (said statement might be more accurate) a new column will be very convenient and friendly.SchemaRDD materialization lazy mode, Spark can be based on SQL column needs to optimize execution plan, can better meet the demand of users.Support SchemaRDD standard data type, in order to make its can better support ML, the technical team to add the support for vector type (user-defined types), at the same time support the dense and sparse feature vector.
Here is a Scala code, it implements the ML input/output data sets, and some simple function.In Spark knowledge base "examples/" directory, you find that some of the more complex data sets sample (using the Scala and Python).Here, we recommend that users readSpark SQL’s user guideTo see more SchemaRDD details, and its support operations.
val sqlContext = SQLContext(sc)
import sqlContext._ // implicit conversions
// Load a LIBSVM file into an RDD[LabeledPoint].
val labeledPointRDD: RDD[LabeledPoint] =
// Save it as a Parquet file with implicit conversion
// from RDD[LabeledPoint] to SchemaRDD.
// Load the parquet file back into a SchemaRDD.
val dataset = parquetFile("/path/to/parquet")
// Collect the feature vectors and print them.
The new Pipeline API called "spark. Ml" bag.Pipeline is composed of multiple steps, these steps can generally be divided into two types: the Transformer and the Estimator.Transformer will consume a set of data, and the output of a new data set.Points, for example, the phrase is a Transformer, it will be a set of text data into a tokenized words data set.Estimator must first meet the input data set, and according to the input data set to create a model.Logic, for example, return is an Estimator, it will have labels and feature in a training data set, and returns a logistic regression model.
Pipeline set up is simple: simple declaration of its steps, configuration parameters, and will be in a Pipeline encapsulated in the object.The following code illustrates a simple text categorization pipeline, by 1 points phrase pieces, 1 hash Term Frequency feature extraction module, and a logistic regression.
val tokenizer = new Tokenizer()
val hashingTF = new HashingTF()
val lr = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Pipeline is in itself a Estimator, so we can easy to use.
val model = pipeline.fit(trainingDataset)
Fitting model includes the phrase, hash TF feature extraction module, as well as fitting a logistic regression model.The chart below to draw the entire workflow, dashed part only in pipeline fitting.
The fitting model of Pipeline is a Transformer that can be used to predict and model validation and model test.
.select('text, 'label, 'prediction)
On the ML algorithm, there is a trouble thing is that they have a lot of hyperparameters need to be adjusted.At the same time, these hyperparameters and be MLlib to optimize the model parameters are completely different.Of course, if the lack of data and the professional knowledge on the algorithm, it is difficult to find the optimal combination of the combination of these hyperparameters.However, even with the professional knowledge, as the pipeline and the increasing scale of hyperparameters, this process will be complicated.In practice, the adjustment of the hyperparameters is usually associated with works is the final result.For example, in the pipeline below, we need two hyperparameters tuning, we were given three different values.As a result, may eventually produce nine different combinations, we expect to find an optimal combination.
Here, the spark support hyperparameter cross validation.Cross validation is as an element method, by the user to specify parameters combination for low-level Estimator.The Estimator can be a pipeline, it can team up with the Evaluator and output a scalar measurement is used to predict, such as precision.Tuning a Pipeline is very easy: // Build a parameter grid.
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(10, 20, 40))
.addGrid(lr.regParam, Array(0.01, 0.1, 1.0))
// Set up cross-validation.
val cv = new CrossValidator()
// Fit a model with cross-validation.
val cvModel = cv.fit(trainingDataset)
In a ML in the pipeline, of course, the user can embed their own transformers or estimators is very important (has achieved pipeline based on the user interface).This API makes MLlib external code to use and easy to share, we recommend that users to read spark.ml user guideTo get more information about pipeline API。