With R&Python run in the cloud extensible data science
Nowadays, data science is becoming more and more complex.This complexity caused by the following three factors:
1. Growth of data production capacity, and looked about him, how much is the number of a device
that can produce?If you use your laptop to browse this paper calculates a, if the side have the
smartphone (and install APPs) plus a, if brought fitness bracelet plus one, driving a car (in some
cases) also calculate a - they are continuously production data.Now imagine that over the next few
years, you can use the refrigerator, home of the temperature regulator, wear clothes, the pen pocket,
as well as water kettle can embedded sensors, constantly to data scientists (and databases) analysis
is used to transfer data.
2. The low cost of data storage - let's make a guess.Guess, since ancient times, all the world of music
stored need to spend how many money?What's your answer?I estimate is the total cost far less than
3. Cheap computing power: have a look at the recently released a laptop performance configuration
list.A 64 gb of RAM Xeon processor and a Quadro GPU.This machine is expected to cost less than
$2000, weight about 2.5 kg.Need I say more?
All in all, we are continuously the production data (when you read this, you will also be a data sample), we can store the data at very low cost, and do the calculation and simulation with them.
Why in the cloud data science?
So why do we want to handle the data in the cloud?When you see a laptop computer is equipped with 64 gb of RAM, ask do we really need to send the data to the cloud?The answer is absolutely needed, we can find a lot of reasons to.Below are listed a few:
Need to run the extensible data science: let's go back to a few years ago.In 2010, I entered a multinational insurance company to form a data science department.One of the job is to purchase a 16 gb RAM server.Because it is a new department, we according to the
standard purchase 3 ~ 5 years in advance.Had a similar combination of star structure, with the increase of our employees to be extended.Not only because of growth, the team's members data volume has grown exponentially.Since there is only a physical machine, we are in trouble!We should buy a new, more powerful server performance, don't let the existing running at full capacity (soon will overload).Things you don't want to see most is set in the data scientists staring at a screen waiting for data to the pending!A machine that the cloud just click the mouse can be easily extended and saves a lot of trouble.Growing data volumes so even if a few times, and now the script and the model can still run normally.
Cost: scalability is on the one hand, the cost is also on the one hand.Suppose you have a problem need to solve, not often, but need more with computing infrastructure.This kind of situation is very common, to explore their sponsoring an annual event of social media data, but want to view the results in real time.Subject to costs, it is impossible to buy a new machine.Very simple, as long as the rent a high configuration of the machine at this moment, hire a few hours or days can solve a problem, the cost is a fraction of the new machine.
Collaboration: when think over and data scientists working at the same time how to is good?Surely you don't want to each of them a copy of data on the local machine and code.
Sharing: when would like to share with team members Python/R code?You use the function library may in him there is no, or version is too low.How to ensure that the code can migrate between different machines?
Greater machine learning system development ecosystem: some cloud services such as AWS, Azure, provide a complete ecological system help you to collect data, model and deployment.If using entity machine, the need to set these configuration.
Rapid prototyping: a lot of time, you on the road or when communicating with friends jumped out of the new ideas.These cases, the use of cloud computing service will be very convenient.Can quickly complete prototyping, and don't have to worry about version and extensibility.After confirmed their ideas, also can easily be converted into products.
inHere,More about cloud computing component content.
Now do you understand the needs of data science of cloud computing.We then look at the clouds R and Python's different choice.
The choice of doing in the cloud data science:
Amazon Web Services ;AWS；
Amazon's cloud computing industry.They occupy the largest share of the market, has a complete document, provide a convenient environment to support rapid expansion.This article will teach you how to run on machinesR or RStudio.If the cloud machine is Linux system, Python is pre-loaded.Also can install additional need library functions and modules.
You can useAWS machine learning, oneself to configure a machine, or even directly useDataScienceToolboxUse all of the software, which provides the toolkit.Platform is not only provide services, also provides some large-scale data set for you to play big data.
Because AWS is one of the most popular choice, it formed the perfect ecological system, and (relative to other) are more likely to find a right experience all kinds of resources.However, other than Amazon's service prices are usually more expensive, even not to provide certain services.In addition, for some reason the machine learning service it is not open to the asia-pacific region.So, if you happen to located in these areas, you need to choose the north American server, or a virtual machine in the cloud configuration.
Azure Machine Learning
If the AWS are the champions, Azure is title challengers.Microsoft clearly intensify efforts to provide a perform end-to-end data interface of the science and machine learning process.You can use themstudioTo build the working process of machine learning, in the cloud using JuPyTer notebooks, or directly use ML APIs.
Microsoft has beenVirtual collegeProvided free of chargeThe electronic documentAnd MOOC introduction to help you.
If Amazon and Microsoft is in the field of cloud computing development organically, then IBM's idea is slightly different.IBM developed BlueMix platform, and then start marketing its services products rapidly.Although as AWS and Azure directly provide way, but I still can provide a set up in the air using notepad.
How to use APIs provided by the Watson data science community will also be very interesting.
If you feel I write before those who are too complicated, it is necessary to know the Sense.Without the project can just click a button to deploy.They provide services based on R, Python, Spark, Julia and Impala, flexible cooperation between members and sharing the results of the analysis.Watch this video preliminary understanding this product: https://www.youtube.com/embed/n3RwCr9l4G8
Domino in San Francisco, it provides a secure cloud computing environment, support R, Python, Julia and Matlab language development.Platform provides version control functions and characteristics, make it very easy to find the cooperation and sharing in the group.
DataJoy at present such as Sense and DominoDataLab stripping version, but it is very interesting how development in the future.At present, if you want to run in the cloud R or Python, you might as well try DataJoy.
If you are developing web applications, and need to build a website contains data science module, PythonAnywhere seems to be a perfect choice.Referred to as the name, the choice is based on the Python development, but it provides a separate window for hosting, data structures, sites and scientific analysis.
In the cloud data scientific challenges:
Although cloud computing has its unique advantages, it also faces many challenges.I don't think these challenges in the long run will stop the growth of the cloud services use, but occasionally form some obstacles.
And the third party to share data concern: I constantly face the challenge.No matter how you tried to explain to some of the security of the cloud, the data sharing to external always worrying.Usually this is all regulatory requirements or legal requirements, but also very wulitou tend to reason.For example, many Banks are reluctant to upload their data to the cloud is analyzed.
Need to upload/download a large amount of data: due to the data center to store large amounts of data, if the network infrastructure is not solid, one-time upload these data will be a great challenge.