Course Overview

Spark has become the most popular and perhaps most important distributed data processing framework for Hadoop. In particular, it is particularly amenable to machine learning and interactive data workloads, and can provide an order of magnitude greater performance than traditional Hadoop data processing tools. In this course, we will provide a deep-dive into Spark as a framework, understand it's design, how to optimally utilize it's design, and how to develop effective machine learning applications with Spark on HDInsight.

Key Learning Areas

The course covers the fundamentals of Spark, its core APIs and design, relational data processing with Spark SQL, the fundamentals of Spark job execution, performance tuning, tracking and debugging. Users will get hands-on experience with processing streaming data with Spark streaming, training machine learning algorithms with Spark ML and R Server on Spark, as well as HDInsight configuration and platform specific considerations such as remote developing and access with Livy and IntelliJ, secure Spark, multi-user notebooks with Zeppelin, and virtual networking with other HDInsight clusters.

Course Outline

Day One - Spark on HDInsight Overview

  • Spark Clusters on HDInsight
  • Developer Tools and Remote Debugging with IntelliJ IDEA
  • Submitting Spark Jobs Remotely Using Livy
  • Spark Fundamentals - Functional Programming, Scala and the Collections API
  • Cluster Architecture
  • RDDs - Parallel, Distributed Memory Data Structures
  • Spark SQL/DataFrames - Relational Data Processing with Spark
  • Sharing Metastore and Storage Accounts with Hadoop/Hive Clusters and Spark Clusters
  • DataFrames API - Collection of Rows with a Consistent Schema
  • Integrated APIs for Mixing Relational, Graph, and ML Jobs
  • Exploring Relational Data with Spark SQL
  • Catalyst Query Optimization
  • Optimizing Joins in Spark SQL
  • Broadcat Joins versus Merge Joins
  • Creating Custom UDFs for Spark SQL
  • Caching Spark DataFrames, Saving to Parquet

Day Two - Spark Job Execution, Performance Tuning, Tracking and Debugging

  • Jobs, Stages, and Tasks
  • Spark Contexts, Applications, the Driver Program and Spark Executors
  • Partitions and Shuffles
  • Understanding Data Locality
  • Monitoring Spark Jobs with the Spark WebUI
  • Managing Spark Thrift Servers and Changing YARN Resource Allocations
  • Managing Interactive Livy Sessions and their Resources
  • Monitoring Spark Jobs with Spark UI
  • Viewing Spark Job Graphs, and Understanding Spark Stages
  • Spark Streaming
  • Creating Spark Streaming Applications Using Spark DStreams APIs
  • DStreams, Stateful, and Stateless Streams
  • Comparison of DStreams and RDDs
  • Transformers for DStreams
  • Persisting Long Term Data in HBase, Hive or SQL
  • Creating Spark Structured Streams
  • Using DataFrames and DataSets API to Create Streaming DataFrames and DataSets
  • Window Transformations for Stateful and Stateless Operations

Day Three - Spark Machine Learning and Graph Analytics

  • MLLib and Spark ML - Understanding API Patterns
  • Featurizing DataFrames using Transformers
  • Developing Machine Learning Pipelines with Spark ML
  • Cross-Validation and Hyperparameter Tuning
  • Training ML Models on Text Data: Tokenization, TF/IDF, and Topic Modeling with LDA
  • Using Evaluators to Evaluate Machine Learning Models
  • Unsupervised Learning and Clustering
  • Managing Models with ModelDB
  • Understanding Graph Analytics and Graph Operators
  • Vertex and Edge Classes
  • Mapping Operations
  • Measuring Connectedness
  • Training Graph Algorithms with GraphX
  • Performance and Monitoring
  • Reducing Memory Allocation with Serialization
  • Checkpointing
  • Visualizing Networks with SparkR, d3, and Jupyter

Who Benefits

Data Scientists interested in Spark

Prerequisites

Hadoop - Administration, Configuration, and Security