Зарегистрироваться
Восстановить пароль
FAQ по входу

Karau H., Konwinski A., Wendell P., Zaharia M. Learning Spark

  • Файл формата pdf
  • размером 7,82 МБ
  • Добавлен пользователем
  • Отредактирован
Karau H., Konwinski A., Wendell P., Zaharia M. Learning Spark
O'Reilly Media, 2015. — 274 p. — e-ISBN: 978-1-4493-5904-1, ISBN 10: 1-4493-5904-3.
Data in all domains is getting bigger. How can you work with it efficiently? This book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala.
Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.
Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell
Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
Learn how to deploy interactive, batch, and streaming applications
Connect to data sources including HDFS, Hive, JSON, and S3
Master advanced topics like data partitioning and shared variables
INTRODUCTION TO DATA ANALYSIS WITH SPARK
What Is Apache Spark?
A Unified Stack
Who Uses Spark, and for What?
A Brief History of Spark
Spark Versions and Releases
Storage Layers for Spark
DOWNLOADING SPARK AND GETTING STARTED
Downloading Spark
Introduction to Spark’s Python and Scala Shells
Introduction to Core Spark Concepts
Standalone Applications
Conclusion
PROGRAMMING WITH RDDS
RDD Basics
Creating RDDs
RDD Operations
Passing Functions to Spark
Common Transformations and Actions
Persistence (Caching)
Conclusion
WORKING WITH KEY/VALUE PAIRS
MotivationCreating Pair RDDs
Transformations on Pair RDDs
Actions Available on Pair RDDs
Data Partitioning (Advanced)
Conclusion
LOADING AND SAVING YOUR DATA
Motivation
File Formats
Filesystems
Structured Data with Spark SQL
Databases
Conclusion
ADVANCED SPARK PROGRAMMING
Introduction
Accumulators
Broadcast Variables
Working on a Per-Partition Basis
Piping to External Programs
Numeric RDD Operations
Conclusion
RUNNING ON A CLUSTER
Introduction
Spark Runtime Architecture
Deploying Applications with spark-submit
Packaging Your Code and Dependencies
Scheduling Within and Between Spark Applications
Cluster Managers
Which Cluster Manager to Use?
Conclusion
TUNING AND DEBUGGING SPARK
Configuring Spark with SparkConf
Components of Execution: Jobs, Tasks, and Stages
Finding Information
Key Performance Considerations
Conclusion
SPARK SQL
Linking with Spark SQL
Using Spark SQL in Applications
Loading and Saving Data
JDBC/ODBC Server
User-Defined Functions
Spark SQL Performance
Conclusion
SPARK STREAMING
A Simple Example
Architecture and Abstraction
Transformations
Output Operations
Input Sources
24/7 Operation
Streaming UI
Performance Considerations
Conclusion
MACHINE LEARNING WITH MLLIB
Overview
System Requirements
Machine Learning Basics
Data Types
Algorithms
Tips and Performance Considerations
Pipeline API
Conclusion
  • Чтобы скачать этот файл зарегистрируйтесь и/или войдите на сайт используя форму сверху.
  • Регистрация