Through the Amazon Machine Learning to build a numerical regression model

By Joanne Gardner,2015-04-16 20:11
81 views 0
Through the Amazon Machine Learning to build a numerical regression model

    Through the Amazon Machine Learning to build a

    numerical regression model

    In the actual production, business often encounter to predict the future value.Prediction can help to better resource planning and business decision-making.Often, given that cannot bear the brunt of complex models such as numerical regression overhead, average content using the last phase and additional assumptions to change this kind of cheap mode.

    This blog bike rental program, for example, predict a specific city every hour bike needs.In this scenario, you need to machine learning model based on a set of features (or predictor) to predict a value.Here, you will be open on Kaggle data to establish a regression model.This model is established by learning, you can in your own application scenario of machine learning.

    The difference between analysis and machine learning

    Bicycle rental program is a good way to describe analysis system was in a accurate prediction of restrictions.Kaggle participants built one of the web site for the analysis of the number provided.If you click Plots TAB, you can see a data visualization using R.Shiny is a very popular free software, is also a very popular R a network interface, details to access the View Bike Sharing Demand page.

    This picture shows the working days and holidays for the different needs of bicycle, peak time is 8 points and 17 points.At the same time, you can also do some more depth of mining, such as the contrast of registered users and visitors.Data visualization shows the temporary users prefer to rent a car at the weekend and Monday, while registered users prefer to rent a car in working days.

    That is to say, through the above visualization operation, you can predict bicycle rental service usage.Beyond this, however, whether there are other factors affecting the prediction results?Such as weather, holidays, etc.After add these additional factors, prediction will become complicated, that is the reason why we turned to the machine learning.

    Prepare data for machine learning model is set up

    If you want to build a successful machine learning model, first you need to find the right data.Its core idea is "Garbage In - Garbage Out -" (or "Gold - In - Gold - Out", depending on your opinion "). In the process of feature recognition, professional knowledge can help you decide whether to make a training for a feature.

    In the example above, you see the weekday and weekend has important influence on the results.On the bike use, what can you see the same great the influence of the weather, the rain or cold air strikes, people may have little to rent a car.We can be easily made historical weather information, and the weather forecast for the coming days.Kaggle race organisers prepared on some of the following data:

    datetime - hourly date + timestamp

    season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

    holiday - whether the day is considered a holiday

    workingday - whether the day is neither a weekend nor holiday

    weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy

    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light

    Rain + Scattered clouds

    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

    temp - temperature in Celsius

    atemp - "feels like" temperature in Celsius

    humidity - relative humidity

    windspeed - wind speed

    casual - number of non-registered user rentals initiated

    registered - number of registered user rentals initiated

    count - number of total rentals

    The list below shows you the most need to predict three Numbers: the number of temporary user leasing, the number of registered users, leasing, as well as the total amount of rent.In view of the total amount is the sum of two Numbers before, at the same time you also clear the behavior difference between users and registered users to temporarily, you need to set up two different models to forecast.

    You can use different tools to deal with the CSV file columns, such as Microsoft Excel or RStudio IDE, they are very popular in the group data scientists.In this blog, you will use the cut, sed and awk to maintain the data.

    First of all need to do is to shuffle the line of training data, so as to remove the data in any possible order, they are likely to be biased to machine learning model. # shuffle the lines except for the first header line

    tail -n+2 train.csv | gshuf -o BikeShareTrainData.csv

    # Add the header line from the original file as the first line of the

    shuffled file

    head -1 train.csv | cat - BikeShareTrainData.csv > temp && mv temp BikeShareTrainData.csv

    To predict the temporary user leasing a training data model is set up, you need to get rid of the last two columns of the original training data file (registered and count).In a newly established for temporary user data files used in the cut and delimiter ", "to store before 10 fields:

    cat BikeShareTrainData.csv | cut -d',' -f1-10 > BikeShareCasualTrainData.csv

    For registered users predict repeat the same steps, by removing the tenth field (casual) retain the eleventh (registered) :

    cat BikeShareTrainData.csv | cut -d',' -f1-9,11 > BikeShareRegisteredTrainData.csv

    In order to train the model, you need to upload files to Amazon S3.In running and training model of AWS in the region to create a bucket, and use the AWS CLI copy the data into the bucket.

    aws s3 cp BikeShareCasualTrainData.csv s3://<BUCKET_NAME>/ML/input/ --region us-east-1 aws s3 cp BikeShareRegisteredTrainData.csv s3://<BUCKET_NAME>/ML/input/ --region us-east-1 In the subsequent forecast, please make sure that you have removed the data you don't need.For example, if you do not remove the casual and registered field in the training data and predict the count variable, then the model will become very simple, it will simply the addition of two variables, and then ignore the weather and other feature.

In the Amazon ML console, upload data to you just to Amazon S3 training data.Then, a

model for data definition and optimization.

    Complete season variables, represented by digital season (such as spring is 1, summer is 2), at the same time rather than numerical data types will be marked as category.Numerical variable has a value to describe can be the amount of digital measurement, such as "how many or how much.If you know what a particular number represents not quantity, so in the heart of the data types defined as a category type is better.Next, select the machine learning model to predict target:

    In this model, the choice of casual variables as forecast target.Service will identify into a number, and it will use numerical regression.The next screen, select the default configuration, and start to build process.Set up process take several minutes, depending on the size of the data.In the follow-up work, you may find that a lot of better ways to build the model, but for beginners is simple to use and the default option is clearly better.

    Evaluation of machine learning model

    Model was set up after, you can to evaluate its;If you use the simple creation process and the default route, the assessment will automatically.The use of training data to test the model evaluation is very important.Simple default mode, Amazon ML to randomly divided by data to complete the steps, 70% of the data used for training, the rest of the assessment.Of course, you can be divided according to their own requirements for data.

    Model of the evaluation results will produce numeric value and a visual chart.For a numerical regression, the numeric value is the root mean square error (RMSE).Here, the RMSE value is smaller, the prediction error is less, also on behalf of the established model is more suitable.In this case, the average of RMSE is 49, and numerical regression RMSE is 39.

    At the same time, you can also assess the effect of each variable to predict the target (temp, windspeed, working day and so on) : in this case, is temporary or registered users to lease.

    As shown in figure shown in, the higher the value the more help produce better prediction results.In this case, for casual users, atemp (similar to temperature) has a value of 0.32, and the influence of the wind speed is 0.01.And interestingly, datatime also accounted for a proportion of 0.21.

    Amazon also can analyze the text field, similar 01, 02, 03 markers as predictors of the model.

    Now, you can decide to use the original model, or by getting lower RMSE to ascend.Now, you can be drawn from a datetime hour feature extraction (use), then the service meeting a suitable way to complete this operation.By analogy, you can also be drawn from the week or the month day.Below is a sample script, it can add variables will one day in a week, and copy it to a temporary user training set:

    Each feature transformation can potentially improve model accuracy, so experts can identify whether to need to increase which variable.

    awk 'NR>1{system("date -j -f \"%Y-%m-%d\" " $1 " +%A")}' BikeShareTrain.csv >


    paste -d "," BikeShareTrainDoW.csv BikeShareCasualTrain.csv > BikeShareCasualDoW.csv

    Using ML model for prediction

    After received the required model, you can start to use it to forecast.When the large scale or high real-time requirements, we can batch of prediction.

    At the end of the work, usually for a few minutes, download the result of a batch of prediction.Before submit the results to all predicted + registered (temporary) for total sum for each hour need to rent the bicycle.

    You can use the following code:

    paste casual_batch.out registered_batch.out | awk '$1+$2>0 {print int($1 + $2); next} {print "0"}' > bike_share_sub_test.csv

    By AWS ML, we can in a few minutes to complete the calculation results of average better than before.Want to learn more knowledge of Amazon ML, please visit the Amazon ML Developer Guide.

Report this document

For any questions or suggestions please email