China Faculty Summit
Maggie Johnson Director of Education and University Relations
? ? ? Google's Faculty Summits A relatively small gathering Rotating faculty invitations Goals o Explain o Interact o Learn o Have fun
China Faculty Summit Themes
Commerce ? Mobile ? Education and Curriculum Development
Agenda: Day one
Keynote 1: Google Research Overview Keynote 2: Google China R&D Keynote 3: Mobile 2014 Commerce Track
Talk 1: "Large Scale Data Mining in Product Search", Si Li, senior software engineer, Google China Talk 2: "Online Shopping & Social Media", Yonggang Wang, Staff Software Engineer, Google China Round Table
Talk 1: "Mobile Peer-to-Peer Streaming", Prof. Gary Chan, Hong Kong University of Science and Technology Talk 2: "Location Determination for Mobile Devices", Dr. Tsuwei Chen, Senior Software Engineer, Google Round Table
Agenda: Day Two
Education and Curriculum Development ? Google University Relations: a Global Perspective ? Google University Relations in China ? Collaboration Example 1: Tsinghua University, Cloud Computing ? Collaboration Example 2: Zhejiang University, Android ? Collaboration Example 3: Sun Yat-Sen University, Web technologies
Innovation and Research at Google
Organizing the world??s information and making it universally accessible and useful.
Google Scale in Commerce
Over 1 million AdWords advertisers worldwide ? Over 1 million AdSense publishers worldwide host ads ? Via the Google Ad Network, AdSense publishers reach over 80% of global internet users in 100 countries and 20 languages ? YouTube is monetizing over a billion video views per week globally ? In 2009, Google generated $54 billion of economic activity for businesses, website publishers, and nonprofits, in the US alone. Similar benefits elsewhere.
Google Scale in Hardware and Storage
Giga 109, Tera 1012, Peta 1015, Exa 1018, Zetta1021 ? Publicized: Bigtable of 70 petabytes, 10M ops/sec. ? Some representative numbers:
o Storage: 1018 -> 1020-21 o Users: 109 -> 1010 o Network: 1020, now, ->1021/yr (32 KB/sec. for 1B people) ? Warehouse computing possibilities
A variety of science engineering challenges
Focus on Innovation to Benefits our Users
Commitment to advancing technology ? Rich domain of work due to our mission ? Internal consensus that production issues are often as challenging/fun as pure invention ? Technical leverage
1.Google Common Distributed System 2.A Focus on Services 3.Empiricism and a Holistic Approach to Design
Our Innovation Culture
? ? ? We work very hard to have a culture of innovation Managers try hard to say "yes" and empower teams 20% time Focus on research, distributed across the organization o Impacting Google necessitates broad, diverse involvement in science and engineering o Research is done both in our research team and in our engineering organization, organized opportunistically o Research ideas can immediately influence products and product experience can motivate and shape research agenda
Research Challenges in Distributed Computing
Alternative designs that would give better energy efficiency at lower utilization ? Server O.S. design aimed at many highly-connected machines in one building ? Unifying abstractions for exploiting parallelism beyond intertransaction parallelism and map-reduce ? Latency reduction ? A general model of replication ? Machine learning techniques applied to monitoring/controlling such systems ? Automatic, dynamic world-wide placement of data & computation to minimize latency and/or cost, given constraints on ? Building retrieval systems that efficiently and usably deal with ACLs ? Holistic models of privacy ? The user interface to the user??s diverse processing and state
Totally Transparent Processing
For all d in D, all l in L, all m in M, and all c in C D: The set of all end- L: The set of all user access devices human languages M: The set of all modalities C: The set of all corpora
Personal Computers Phone Media Players/Readers Telematics Set-top Boxes Appliances Health devices ??
Current languages Historical languages Other forms of human notation Possible language specialization Formal languages ??
Text Image Audio Video Graphics Other sensor-based data ??
The normal web The deep web Periodicals Books Catalogs Blogs Geodata Scientific datasets Health data ??
Totally Transparent Processing
Selected Research Areas
? ? ? ? ? Machine Translation Speech Structured Data Vision Operations Research Digital Humanities
Machine Translation @ Google
Statistical Machine Translation ? Model translation process with an automated statistical model ? Learning from data: monolingual & bilingual o More data: better translation quality ? Computationally expensive approach o Models have many hundreds of Gigabyte of data Results: ? Much better translation quality ? Ongoing progress o Constant feedback and improvement ? 58 languages (so far) o recently: Haitian Creole, Urdu, Georgian, ????, Latin
Chrome/Toolbar (websites). YouTube (CLIR, captions, snippets). Reader (feeds). GMail, Docs, Spreadsheets, and more
Speech Technology at Google
Much of the world??s information is spoken ?C we need to recognize it before we can organize it:
YouTube transcription and translation (breaking the language barrier for YouTube access) Voicemail transcription
Mobile is the fastest growing and most widespread platform for communication and services that has ever existed
Spoken input and output is key to usability Our goal is completely ubiquitous availability of speech i/o (every application/service, every usage scenario, every language)
How do we get there?
Delivery from the cloud ?C support constant iteration and refinement Operating at large scale ?C train huge statistical models on huge amounts of data
Structured Data on the Web
Discovery and search for structured data: ? The deep Web -- significant gap in coverage ? Structured tables on the Web -- not leveraged in search
Enable easy creation, management, sharing and publishing of structured data: ? Fusion Tables: www.google.com/fusiontables
Google Fusion Tables
host, manage, collaborate on, visualize, and publish data tables online
What can I do with Fusion Tables? Host data online - and stay in control ? control can be at the level of columns or rows Re-use data without making copies Collaborate on the details ? Merge data from multiple tables ? Comment on individual rows, columns or cells Make a map (or chart or timeline) in minutes! Manage data via our site or an API
Fusion Table Example Gallery
Easy Data Upload, Attribution recorded
Easily Create Informative Maps
baby steps towards the dream platform ? DEMO:
circle of blue
Discussions, Data Integration
Advance state-of-the art in 3 key areas of image/audio/video analysis and apply results to our multimedia products. o Semantic Interpretation: Generate human understandable description of content. (eg. auto-tagging videos on YouTube, Image annotation, etc.) o Matching: Find similar entities from a large corpus. (eg. "find similar" on image search, etc ). o Synthesis: Generate better images/video by understanding the statistics of a large corpus of images. (eg. better facades in 3D building on Google Earth, automatic shadow removal from aerial images etc.)
Semantic Interpretation sample problem - Video Annotation
Video metadata has a cognitive cost on the user because they have to type it in, be careful about what keywords they use, and in general try to make their video searchable ? Many uploaders don??t have the motivation, or energy to provide proper metadata ? Noisy metadata hurts everyone ?C spam, misspellings, acronyms, etc.
Operations Research Challenges
Size: Optimization is often NP Complete o The tools are barely keeping up with the problems. ? Uncertainty: Data is often fuzzy. o How do you route cars when there are roadblocks, new one-ways, traffic jams? o How well can you optimize against forecasted data, how do you react if the forecast is bad?
Operations Research Opportunities
Machine Learning can help us in two ways: o By providing guidance towards good solutions. o By qualifying valid solutions. o By reducing the search space. ? Large computing resources means we can try a bit harder. ? Crowd-Sourcing means better data, better feedback, better evaluations of algorithms and solutions. ? Having all our code open-source means we can collaborate on building the best set of tools. o See http://code.google.com/p/or-tools
An Application: Earth Engine
Initial Earth Engine Motivation: Forest Carbon Tracking
UNEP: "Atlas of our Changing Environment"
Original image ???? is divided into 256px sub-units.
Sub-units are distributed
Sub-units are distributed ???? to separate machines.
Sub-units are distributed ???? to separate machines ???? where they can be processed in parallel.
Thousands can be processed simultaneously
Result is reassembled
Result is reassembled ???? into a finished image
Global-scale earth observation and informatics platform
? ? ? ? ? ? ? ? ? ? ? ? For public benefit, and to support emerging green economy Help science come out of research lab and into operational use, at scale Unprecedented catalog of earth observation data for mining and analysis Promote transparency, reproducibility, collaboration, ??open science?? Intrinsically-parallel pixel processing system Built-in Google algorithms as well as user-supplied Earth Engine API for 3rd party algorithm development Access control, versioning, provenance Online and desktop versions (open source desktop version) Every available Landsat and MODIS scene (more satellites coming) Commercial datasets (very high resolution satellite imagery) Environmental data (atmospheric, ocean, terrestrial) User-supplied (ex: in-situ data collected via Android phones)
Very fast computation of scientific map products
On a lot of useful data
Digital Humanities and Education
Illuminating the Humanities
Q: What can you do with: ? 12 million books in ? over 400 languages ? comprised of 5 billion pages and 2 trillion words ? ????all digitized?
A: Look to the humanities for new questions????
For example, what are the differences between early and later editions of:
ºìÂ?ÃÎ (Dream of the Red Chamber), published in 1784
1. Early versions are Rouge versions (Ö???) with 80 chapters, "The Story of The Stone (Ê? Í??Ç)", (10+ editions) 2. Around 30 years after the first edition, the book was amended with another 40 chapters, thus the novel was 120 chapters, called Cheng-Gao versions (?Ì?ß??).
Digital Humanities Awards
Research program supporting university research taking a computational approach to traditional humanist questions. US program, Summer 2010 ? 12 projects ? 23 researchers ? 15 universities European program, Winter 2010 ? 15 projects planned
Seeding and supporting computing curriculum development
o o o
Exploring computational thinking in K12 (google.com/edu/ect) CS4HS: High school computer science (cs4hs.com) Undergraduate open source CS curriculum: Google Code University (code.google.com/edu)
Supporting our Academic Institutions
o o o
Research Awards Programs - 230+ projects funded in the last year Next Due date Feb 1, 2011 Researchemail@example.com Focused Grant Program Mobile 2014 Visiting Faculty Program - 20 faculty (ongoing) Universityfirstname.lastname@example.org Ph.D Fellowship Program 2009: 13 students supported in US 2010: 15 in US, 15 in EMEA, 2 in China (and more to come) Over 150 other scholarships, most in China ~1000 interns worldwide
Scale of Communication and Computing is profound Endless opportunity for technical growth Rapid innovation in
science/technology and value to consumers o We are providing increased support for academic institutions in computer science and related areas
o o o
It's a most exciting area in which to innovate
Thank you! Ð?Ð?