1.flink初识

flink是什么？

flink作为apache的顶级项目之一，我们可以轻松找到它官网的位置http://flink.apache.org。

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

来自官网的描述：apache flink是基于有界和无界数据流状态的计算的分布式的处理引擎。Flink可以在常见的集群环境，在内存上以任何规模去运行。

Flink基本概念

1、处理无界和有界的数据

Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream.

官网说，所有的数据可以认为是事件流。（举个栗子：信用卡交易数据、传感器测量数据、机器日志、用户在网站或移动应用程序上的交互，这些数据都可以认为是流）

无界数据

Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated. Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested. It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time. Processing unbounded data often requires that events are ingested in a specific order, such as the order in which events occurred, to be able to reason about result completeness.

其实这个就是说无界数据是源源不断产生的数据,flink提供了对无界数据集的持续计算。需要提供事件顺序才能推断数据结果。

有界数据

Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations. Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted. Processing of bounded streams is also known as batch processing.

有界数据集是最终不再发生变化的数据。处理有界数据集不需要事件顺序，因为可以对这个数据进行排序，对有界数据的处理也叫批处理

2、Deploy Applications Anywhere

Flink它能运行在很多常见资源调度管理器上，例如Hadoop YARN, Apache Mesos和Kubernetes。也能自己独立部署集群。Flink能自行识别所处资源环境，如果发生故障，Flink会通过请求新资源来替换发生故障的容器。提交或控制应用程序的所有通信都通过REST调用进行。

3、Run Applications at any Scale

Flink能很容易扩展集群规模，可以轻松维护非常大的应用程序状态。其异步和增量检查点算法确保对处理延迟的影响最小，同时保证一次性状态一致性。

4、Leverage In-Memory Performance

Flink是高性能的处理引擎，它高效使用内存来计算和存储任务状态，如果可用内存满了，他会存储在访问高效的磁盘上数据结构中，因此它是低延迟的。Flink通过定期和异步检查本地状态到持久存储来保证在出现故障时的一次状态一致性。