Built by Kiji real-time, personalized recommendation
Can now be found online recommendations.Amazon and other mainstream e-commerce sites based on their page properties in various forms to the recommended products by the user.Financial planning websites such as Mint.com provides users with a lot of Suggestions, such as referring users to they may want to handle credit card, bank can offer better interest rates.Google according to user's search history information optimization search results, find a correlation between higher results.
These recommendations provide contextualized, well-known companies use correlation of user experience, in order to improve the conversion rate and customer satisfaction.These proposals originally generally generated by every night, weekly, or monthly new recommended batch job offer.
For certain types of recommended, however, the response time is necessary to more shorter than the time needed for batch processing operations, such as providing consumers with recommendations based on geographic location.Movie recommendation system, for example, if the user previously seen action movies, but now I want to find a comedy, batch recommendation is likely to give more action, rather than the most relevant comedy.This article will introduce how to use Kiji framework, it is a big data applications to build open source framework and real-time recommendation system.
Kiji, centering on the entity data and 360 degree Angle
To build the real-time recommendation system, first of all need a can store360 perspectiveThe customer's system.In addition, we need the ability to access data associated with the specified user quickly, so that when users interact with web sites and mobile applications to make recommendations.Kiji is a real-time application of modular open source framework, its collection, storage and analysis of such data.
In general, a 360 - degree view the required data can be referred to as the data for the center with the entity.An entity can be any number of things, such as customers, users, accounts, or POS system or mobile devices, such as more abstract things.
aStorage system for the center with the entityTo be in a line of data stored in all the information related to a particular entity.This is a challenge for the traditional relational database, because this information can be both a status data (such as name, email address, etc.) and flow of events (e.g., click).Traditional systems need to put these data stored in multiple tables, processing these linked table, which makes it difficult to achieve real-time processing.In order to solve this problem, Kiji with Apache HBase, it in four dimensions -
row, column family, column identifies and timestamp - store data.With the help of a timestamp dimensions and HBase's ability to store multiple versions Cell, Kiji can store more slowly changing event stream data state.
HBase is Apache Hadoop use a key-value storage system, it is built upon HDFS, for big data solution provides the necessary extensibility.Application development faces huge challenges on HBase, it requires that all access to system data is an array of bytes.In order to solve this problem, Kiji ultimate core components is Apache Avro, by Kiji used to store data type is easy to deal with, such as standard strings and integers, and the more complex data types defined by the user.When reading and writing data, Kiji do necessary for the application of serialization reconciliation serialization process.
Development with the model in real time
Kiji provides development model with two set of API, Java and Scala, two sets of API supports batch and real-time component.Such division is the purpose of the model implementation is divided into different stages.Bulk phase is the stage of training, is a typical learning process, in the process, will use the complete data set to train the model.It could be a phase of the output of linear classifier parameters, or clustering algorithm clustering location, or in a collaborative filtering system similarity matrix of related items.Real-time stages called score, obtains the trained model, and combining it with the entity data to produce information.It is critical that the derived data are treated as first-class citizens, meaning that it can save back to the real line, used for recommendations, or as input for subsequent calculations.
Java API is called KijiMR and Scala API constitutes the core of KijiExpress tools.KijiExpress using Scalding library provides API to build complex graphs workflow, and to avoid a large amount of redundant code, Java and necessary task scheduling series graphs homework and collaboration.
Individual and overall
Should be divided into batch training and real-time score in two stages, because Kiji observed overall trend changes slowly, and individual trend changes rapidly.
Contains tens of millions of times to buy records such as a user data set as a whole.More than a purchase is unlikely to be significant effects on the overall trend of likes and dislikes.But for only 10 times to buy a record of a particular user, 11th purchase will determine the user interest have a huge impact on the system.Given this, the application just enough to affect the overall trend of the data has been collected when training the model
again.But for a particular user, we can through the real-time response user behavior in order to improve the recommended correlation.
Real-time model to score
In order to achieve real-time score, KijiScoring module provides a lazy evaluation system, the application can only often interacting with active users to generate new recommendations.Through the calculation of inertia, Kiji applications need not for those who don't frequent or user-generated recommend never come back again.This also some additional benefits, Kiji can recommend to consider the situation of information like the location of the mobile devices.
The main component is called Freshener KijiScoring.Freshener is actually the combination of the two other Kiji components: ScoringFunctions and FreshnessPolicies.As mentioned earlier, a model including training and score two stages.ScoringFunction is a piece of code, describes how the trained model and single entity data combined to produce a score or advice.FreshnessPolicy define data become obsolete or outdated time.Ordinary FreshnessPolicy will point out, for example, more than an hour after the data is expired.More complex strategies may be entity after a certain number of events will mark it as obsolete, such as access events such as click or products.Finally, ScoringFunction and FreshnessPolicy was attached on the Kiji specific columns in the table, be triggered when necessary to refresh the data.
Real-time score application layer contains one server, called KijiScoring server, it is responsible for executive level to refresh the old data.When users interact with the application, the request will be passed to the KijiScoring server layer, it directly with HBase cluster communication.KijiScoring server will request data, and after the access to the data according to check whether the data is the latest FreshnessPolicy.If it is the latest data, it will be returned to the client directly.If it is out of date data, KijiScoring server to the requesting user to run a specified ScoringFunction.You need to know is it only the main points of the requesting user refreshes the data or recommended;Rather than perform batch operations, refresh all user data.So Kiji can only do the necessary work.After the completion of the data refresh will be returned to the user, at the same time to write back to HBase for later use.
A typical Kiji application will include a certain number of KijiScoring server, they can extend outward stateless Java process, and to be able to run with a single entity ScoringFunction data as input.Kiji filtering through KijiScoring server client request, is decided by its data is the latest.If necessary, it can be recommended in all operation ScoringFunction refresh before back to the client, and will restart after the data is written to HBase, for later use.
Will be deployed to a production system model
Can easily iteration of its underlying prediction model is an important goal of real-time recommendation system, avoid because to new or improved models deployed to stop production and application.Kiji provides Kiji model library, it combines description model and to train the model and give score code how to perform metadata.KijiScoring server needs to know what kind of access will trigger a refresh, to use FreshnessPolicy ScoringFunction and will be executed on the user data, and all the position of the trained model, or to score the necessary external data model.The metadata table, there are also a Kiji system is just another at the bottom of the HBase table.In addition, the model base in managed Maven repository for the registered model storage code artifacts.KijiScoring server for newly registered or unregistered model regularly polling model libraries, demand loading or unloading code. together
Using collaborative filtering is a very common way of recommendations.Collaborative filtering algorithm is usually the creation of a large similarity matrix, used to store a product information associated with other products in the catalogue.Each line in the matrix represents a product Pi, each column represents another product Pj.The values in the (Pi, Pj) is the degree of similarity between two products.
In Kiji, similarity matrix is calculated by batch training process, and then be stored in a file or Kiji table.Every line of similar matrix will be stored in Kiji products the separate column in a row in the table.In practice, this column may become very large, because it put the directory listing of all products and similarity.Normally, the batch job will pick out the similarity of the highest entries to the table.
Similar matrix when the score is accessed through KeyValueStore API, this API can access external data.For can't completely in the large memory matrix, can put them in a distributed in the table, so that applications can only request to calculate the required data, thus greatly reduced the demand for memory..
Now that we have done before scoring stage a lot of hard work, then score naturally became a fairly simple operation.If we want to be viewed based display recommendation information entry, a general score function just search for related products in the product in the table, and display them.
A little push in the process and the results do personalized processing is a relatively simple task.In a personalized system, grading function will get user ratings for products recently, and use KeyValueStore API to find similar products and user evaluation of products.Combination of rating and stored in the similarity in the products table, the application can predict user ratings under to the related items, and to predict the highest rating products recommended to the user.By limiting the rating and all have been used in the rating of the number of similar products, system when users interact with the application can easily deal with the above operation.
In this article, we can learn how to use Kiji to develop a recommendation system can real-time refresh recommend.Using HBase low-latency processing, with the Avro store complex data types, the use of graphs and Scalding processing data, the application to provide
users with related recommendations in the real-time situation.If you want to see the sample
code in this system, please visit.