Description
In this course, you will :
- Understand how MapReduce can be used to analyze big data sets
- Write your own MapReduce jobs using Python and MRJob
- Run MapReduce jobs on Hadoop clusters using Amazon Elastic MapReduce
- Chain MapReduce jobs together to analyze more complex problems
- Analyze social network data using MapReduce
- Analyze movie ratings data using MapReduce and produce movie recommendations with it.
- Understand other Hadoop-based technologies, including Hive, Pig, and Spark
- Understand what Hadoop is for, and how it works
Syllabus:
1. Understanding MapReduce
- MapReduce Basic Concepts
- A quick note on file names.
- Walkthrough of Rating Histogram Code
- Understanding How MapReduce Scales / Distributed Computing
- Average Friends by Age Example: Part 1
- Average Friends by Age Example: Part 2
- Minimum Temperature By Location Example
- Maximum Temperature By Location Example
- Word Frequency in a Book Example
- Making the Word Frequency Mapper Better with Regular Expressions
- Sorting the Word Frequency Results Using Multi-Stage MapReduce Jobs
- Activity: Design a Mapper and Reducer for Total Spent by Customer
- Activity: Write Code for Total Spent by Customer
- Compare Your Code to Mine. Activity: Sort Results by Amount Spent
- Compare your Code to Mine for Sorted Results.
- Combiners
2. Advanced MapReduce Examples
- Including Ancillary Lookup Data in the Example
- Example: Most Popular Superhero, Part 1
- Example: Most Popular Superhero, Part 2
- Example: Degrees of Separation: Concepts
- Degrees of Separation: Preprocessing the Data
- Degrees of Separation: Code Walkthrough
- Degrees of Separation: Running and Analyzing the Results
- Example: Similar Movies Based on Ratings: Concepts
- Similar Movies: Code Walkthrough
- Similar Movies: Running and Analyzing the Results
- Learning Activity: Improving our Movie Similarities MapReduce Job
3. Using Hadoop and Elastic MapReduce
- Fundamental Concepts of Hadoop
- The Hadoop Distributed File System (HDFS)
- Apache YARN
- Hadoop Streaming: How Hadoop Runs your Python Code
- Setting Up Your Amazon Elastic MapReduce Account
- Linking Your EMR Account with MRJob
- Exercise: Run Movie Recommendations on Elastic MapReduce
- Analyze the Results of Your EMR Job
4. Advanced Hadoop and EMR
- Distributed Computing Fundamentals
- Activity: Running Movie Similarities on Four Machines
- Analyzing the Results of the 4-Machine Job
- Troubleshooting Hadoop Jobs with EMR and MRJob, Part 1
- Troubleshooting Hadoop Jobs, Part 2
- ml-1m Dataset: Alternate Download Link
- Analyzing One Million Movie Ratings Across 16 Machines
5. Other Hadoop Technologies
- Introducing Apache Hive
- Introducing Apache Pig
- Apache Spark: Concepts
- Spark Example: Part 1
- Spark Example: Part 2