what’s Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Speed

Run workloads 100x faster.

Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Spark是运行在内存上的高速的计算引擎，会比Hadoop快100多倍

Ease of Use

1
2
3

df = spark.read.json("logs.json") 
df.where("age > 21")   
.select("name.first").show()

大部分应用都是用Java、Scala、Python、R。Spark还提供了80个高级算子，是我们容易搭建起分布式APP

Generality

支持Spark SQL,Streaming 和 complex analytics.

Run Everywhere

You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Spark可以跑在Yarn. Mesos,Kubernetes 独立集群或者云服务。它能访问各种数据源。