
Berlin 2019
Agenda

Berlin 2019
Agenda

Schedule
Flink Forward Europe 2019 kicks off with a full day of training sessions led by Apache Flink® experts from Ververica.
The program consists of four hands-on and instructional sessions running in parallel on October 7. The sessions are updated for each new release of the software and designed to help you improve your stream processing and Apache Flink skills.
The training pass is sold both separately and in combination with the main conference days on October 8 & 9.
After an inspiring day of technical sessions, we invite you to join our Flink Fest in the evening on October 8.
Apache Flink Developer Training
Apache Flink Developer Training
October 7: 10:00am - 11:30am1h 30minStandard TrainingThis course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentations will focus on the core concepts of distributed streaming dataflows, event time, and key-partitioned state. The exercises will give you a chance to see how these concepts are reflected in the API, and to understand how the pieces fit together to solve real problems.Course Content
- Introduction to Stream Processing and Apache Flink
- Foundations of the DataStream API
- Getting Setup for Flink Development (incl. exercise)
- Stateful Stream Processing (incl. exercise)
- Time, Timers, and ProcessFunction (incl. exercise)
- Connected Streams: Control and Enrichment Patterns (incl. exercise)
- Testing (incl. exercise)
Prerequisites:
No prior knowledge of Apache Flink is required.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- An IDE for Java (or Scala) development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Developer TrainingApache Flink Developer Training
Apache Flink Developer Training
October 7: 11:45am - 1:15pm1h 30minStandard TrainingThis course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentations will focus on the core concepts of distributed streaming dataflows, event time, and key-partitioned state. The exercises will give you a chance to see how these concepts are reflected in the API, and to understand how the pieces fit together to solve real problems.Course Content
- Introduction to Stream Processing and Apache Flink
- Foundations of the DataStream API
- Getting Setup for Flink Development (incl. exercise)
- Stateful Stream Processing (incl. exercise)
- Time, Timers, and ProcessFunction (incl. exercise)
- Connected Streams: Control and Enrichment Patterns (incl. exercise)
- Testing (incl. exercise)
Prerequisites:
No prior knowledge of Apache Flink is required.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- An IDE for Java (or Scala) development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Developer TrainingApache Flink Developer Training
Apache Flink Developer Training
October 7: 2:15pm - 3:45pm1h 30minStandard TrainingThis course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentations will focus on the core concepts of distributed streaming dataflows, event time, and key-partitioned state. The exercises will give you a chance to see how these concepts are reflected in the API, and to understand how the pieces fit together to solve real problems.
Course Content
- Introduction to Stream Processing and Apache Flink
- Foundations of the DataStream API
- Getting Setup for Flink Development (incl. exercise)
- Stateful Stream Processing (incl. exercise)
- Time, Timers, and ProcessFunction (incl. exercise)
- Connected Streams: Control and Enrichment Patterns (incl. exercise)
- Testing (incl. exercise)
Prerequisites
No prior knowledge of Apache Flink is required.You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- An IDE for Java (or Scala) development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Developer TrainingApache Flink Developer Training
Apache Flink Developer Training
October 7: 4:00pm - 5:30pm1h 30minStandard TrainingThis course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentations will focus on the core concepts of distributed streaming dataflows, event time, and key-partitioned state. The exercises will give you a chance to see how these concepts are reflected in the API, and to understand how the pieces fit together to solve real problems.
Course Content
- Introduction to Stream Processing and Apache Flink
- Foundations of the DataStream API
- Getting Setup for Flink Development (incl. exercise)
- Stateful Stream Processing (incl. exercise)
- Time, Timers, and ProcessFunction (incl. exercise)
- Connected Streams: Control and Enrichment Patterns (incl. exercise)
- Testing (incl. exercise)
Prerequisites
No prior knowledge of Apache Flink is required.You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- An IDE for Java (or Scala) development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Developer TrainingApache Flink Operations Training
Apache Flink Operations Training
October 7: 10:00am - 11:30am1h 30minStandard TrainingThis course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience includes developers and operations staff responsible for deploying Flink applications and maintaining Flink clusters. The presentations will focus on the core concepts involved in the Flink runtime, and the principal tools available for deploying, upgrading, and monitoring Flink applications. The exercises will provide a hands-on introduction to these same topics.
Course Content
- Introduction to Stream Processing and Apache Flink
- Flink in the Data Center
- Architecture and Distributed Runtime
- Containerized Deployments (incl. hands-on)
- State Backends and Fault Tolerance (incl. hands-on)
- Upgrades and State Migration (incl. hands-on)
- Metrics (incl. hands-on)
- Capacity Planning
Prerequisites:
No prior knowledge of Apache Flink is required.
You will need a notebook with at least 8 GB RAM and Docker installed.We will send an email with more detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Operations TrainingApache Flink Operations Training
Apache Flink Operations Training
October 7: 11:45am - 1:15pm1h 30minStandard TrainingThis course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience includes developers and operations staff responsible for deploying Flink applications and maintaining Flink clusters. The presentations will focus on the core concepts involved in the Flink runtime, and the principal tools available for deploying, upgrading, and monitoring Flink applications. The exercises will provide a hands-on introduction to these same topics.
Course Content
- Introduction to Stream Processing and Apache Flink
- Flink in the Data Center
- Architecture and Distributed Runtime
- Containerized Deployments (incl. hands-on)
- State Backends and Fault Tolerance (incl. hands-on)
- Upgrades and State Migration (incl. hands-on)
- Metrics (incl. hands-on)
- Capacity Planning
Prerequisites:
No prior knowledge of Apache Flink is required.
You will need a notebook with at least 8 GB RAM and Docker installed.We will send an email with more detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Operations TrainingApache Flink Operations Training
Apache Flink Operations Training
October 7: 2:15pm - 3:45pm1h 30minStandard TrainingThis course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience includes developers and operations staff responsible for deploying Flink applications and maintaining Flink clusters. The presentations will focus on the core concepts involved in the Flink runtime, and the principal tools available for deploying, upgrading, and monitoring Flink applications. The exercises will provide a hands-on introduction to these same topics.
Course Content
- Introduction to Stream Processing and Apache Flink
- Flink in the Data Center
- Architecture and Distributed Runtime
- Containerized Deployments (incl. hands-on)
- State Backends and Fault Tolerance (incl. hands-on)
- Upgrades and State Migration (incl. hands-on)
- Metrics (incl. hands-on)
- Capacity Planning
Prerequisites
No prior knowledge of Apache Flink is required.You will need a notebook with at least 8 GB RAM and Docker installed.We will send an email with more detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Operations TrainingApache Flink Operations Training
Apache Flink Operations Training
October 7: 4:00pm - 5:30pm1h 30minStandard TrainingThis course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience includes developers and operations staff responsible for deploying Flink applications and maintaining Flink clusters. The presentations will focus on the core concepts involved in the Flink runtime, and the principal tools available for deploying, upgrading, and monitoring Flink applications. The exercises will provide a hands-on introduction to these same topics.
Course Content
- Introduction to Stream Processing and Apache Flink
- Flink in the Data Center
- Architecture and Distributed Runtime
- Containerized Deployments (incl. hands-on)
- State Backends and Fault Tolerance (incl. hands-on)
- Upgrades and State Migration (incl. hands-on)
- Metrics (incl. hands-on)
- Capacity Planning
Prerequisites
No prior knowledge of Apache Flink is required.You will need a notebook with at least 8 GB RAM and Docker installed.We will send an email with more detailed setup instructions a week or two before the conference.
Standard Training
Apache Flink Operations TrainingSQL Developer Training
SQL Developer Training
October 7: 10:00am - 11:30am1h 30minStandard TrainingApache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions that are easier to build and maintain than applications built with Flink’s lower-level APIs. In this hands-on training you will learn how to fully leverage the potential of running SQL queries on data streams with Apache Flink. We will look at different use cases for streaming SQL, including enriching and joining streaming data, computing windowed aggregations, maintaining materialized views, and defining and matching patterns using the MATCH RECOGNIZE clause which is part of SQL since 2016.
Course Content
- Introduction to SQL on Flink
- Querying Dynamic Tables with SQL
- Joining Dynamic Tables
- Pattern Matching with MATCH_RECOGNIZE
- Ecosystem & Writing to External Tables
Prerequisites
No prior knowledge of Apache Flink is required. Basic knowledge of SQL is required.
You will need a notebook with at least 8 GB RAM and Docker installed. To save time during the event, we would also like you set up the training environment beforehand, i.e. to download all required Docker containers. We will send an email with detailed setup instructions about a week before the conference.
Standard Training
SQL Developer TrainingSQL Developer Training
SQL Developer Training
October 7: 11:45am - 1:15pm1h 30minStandard TrainingApache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions that are easier to build and maintain than applications built with Flink’s lower-level APIs. In this hands-on training you will learn how to fully leverage the potential of running SQL queries on data streams with Apache Flink. We will look at different use cases for streaming SQL, including enriching and joining streaming data, computing windowed aggregations, maintaining materialized views, and defining and matching patterns using the MATCH RECOGNIZE clause which is part of SQL since 2016.
Course Content
- Introduction to SQL on Flink
- Querying Dynamic Tables with SQL
- Joining Dynamic Tables
- Pattern Matching with MATCH_RECOGNIZE
- Ecosystem & Writing to External Tables
Prerequisites
No prior knowledge of Apache Flink is required. Basic knowledge of SQL is required.
You will need a notebook with at least 8 GB RAM and Docker installed. To save time during the event, we would also like you set up the training environment beforehand, i.e. to download all required Docker containers. We will send an email with detailed setup instructions about a week before the conference.
Standard Training
SQL Developer TrainingSQL Developer Training
SQL Developer Training
October 7: 2:15pm - 3:45pm1h 30minStandard TrainingApache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions that are easier to build and maintain than applications built with Flink’s lower-level APIs. In this hands-on training you will learn how to fully leverage the potential of running SQL queries on data streams with Apache Flink. We will look at different use cases for streaming SQL, including enriching and joining streaming data, computing windowed aggregations, maintaining materialized views, and defining and matching patterns using the MATCH RECOGNIZE clause which is part of SQL since 2016.
Course Content
- Introduction to SQL on Flink
- Querying Dynamic Tables with SQL
- Joining Dynamic Tables
- Pattern Matching with MATCH_RECOGNIZE
- Ecosystem & Writing to External Tables
Prerequisites
No prior knowledge of Apache Flink is required. Basic knowledge of SQL is required.
You will need a notebook with at least 8 GB RAM and Docker installed. To save time during the event, we would also like you set up the training environment beforehand, i.e. to download all required Docker containers. We will send an email with detailed setup instructions about a week before the conference.
Standard Training
SQL Developer TrainingSQL Developer Training
SQL Developer Training
October 7: 4:00pm - 5:30pm1h 30minStandard TrainingApache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions that are easier to build and maintain than applications built with Flink’s lower-level APIs. In this hands-on training you will learn how to fully leverage the potential of running SQL queries on data streams with Apache Flink. We will look at different use cases for streaming SQL, including enriching and joining streaming data, computing windowed aggregations, maintaining materialized views, and defining and matching patterns using the MATCH RECOGNIZE clause which is part of SQL since 2016.
Course Content
- Introduction to SQL on Flink
- Querying Dynamic Tables with SQL
- Joining Dynamic Tables
- Pattern Matching with MATCH_RECOGNIZE
- Ecosystem & Writing to External Tables
Prerequisites
No prior knowledge of Apache Flink is required. Basic knowledge of SQL is required.You will need a notebook with at least 8 GB RAM and Docker installed. To save time during the event, we would also like you set up the training environment beforehand, i.e. to download all required Docker containers. We will send an email with detailed setup instructions about a week before the conference.
Standard Training
SQL Developer TrainingApache Flink Tuning & Troubleshooting
Apache Flink Tuning & Troubleshooting
October 7: 10:00am - 11:30am1h 30minAdvanced TrainingWorking with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application from early PoC stages up into production. In this training, we will focus on eliminating a few of these challenges together. We will provide a starting point for a useful troubleshooting toolset and present best practices, tips, and tricks in fields such as monitoring, watermarking, serialization, state back-ends and more. During hands-on sessions between the talks, participants will have the opportunity to apply the newly acquired knowledge to tackle some of the usual suspects in the context of an ill-performing Flink job: We will identify reasons for jobs not making progress or not performing as expected in either throughput or latency.PrerequisitesIn order to make the most of this training, you should already have Flink knowledge equivalent to our two day developer training. In particular, you should be familiar with:
- time and watermarking,
- state handling and state backends,
- Flink's fault-tolerance guarantees,
- checkpoints and savepoints,
DataStream
API andProcessFunction
.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- an IDE for Java development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Advanced Training
Apache Flink Tuning & TroubleshootingApache Flink Tuning & Troubleshooting
Apache Flink Tuning & Troubleshooting
October 7: 11:45am - 1:15pm1h 30minAdvanced TrainingWorking with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application from early PoC stages up into production. In this training, we will focus on eliminating a few of these challenges together. We will provide a starting point for a useful troubleshooting toolset and present best practices, tips, and tricks in fields such as monitoring, watermarking, serialization, state back-ends and more. During hands-on sessions between the talks, participants will have the opportunity to apply the newly acquired knowledge to tackle some of the usual suspects in the context of an ill-performing Flink job: We will identify reasons for jobs not making progress or not performing as expected in either throughput or latency.PrerequisitesIn order to make the most of this training, you should already have Flink knowledge equivalent to our two day developer training. In particular, you should be familiar with:
- time and watermarking,
- state handling and state backends,
- Flink's fault-tolerance guarantees,
- checkpoints and savepoints,
DataStream
API andProcessFunction
.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- an IDE for Java development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Advanced Training
Apache Flink Tuning & TroubleshootingApache Flink Tuning & Troubleshooting
Apache Flink Tuning & Troubleshooting
October 7: 2:15pm - 3:45pm1h 30minAdvanced TrainingWorking with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application from early PoC stages up into production. In this training, we will focus on eliminating a few of these challenges together. We will provide a starting point for a useful troubleshooting toolset and present best practices, tips, and tricks in fields such as monitoring, watermarking, serialization, state back-ends and more. During hands-on sessions between the talks, participants will have the opportunity to apply the newly acquired knowledge to tackle some of the usual suspects in the context of an ill-performing Flink job: We will identify reasons for jobs not making progress or not performing as expected in either throughput or latency.PrerequisitesIn order to make the most of this training, you should already have Flink knowledge equivalent to our two day developer training. In particular, you should be familiar with:
- time and watermarking,
- state handling and state backends,
- Flink's fault-tolerance guarantees,
- checkpoints and savepoints,
DataStream
API andProcessFunction
.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- an IDE for Java development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Advanced Training
Apache Flink Tuning & TroubleshootingApache Flink Tuning & Troubleshooting
Apache Flink Tuning & Troubleshooting
October 7: 4:00pm - 5:30pm1h 30minAdvanced TrainingWorking with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application from early PoC stages up into production. In this training, we will focus on eliminating a few of these challenges together. We will provide a starting point for a useful troubleshooting toolset and present best practices, tips, and tricks in fields such as monitoring, watermarking, serialization, state back-ends and more. During hands-on sessions between the talks, participants will have the opportunity to apply the newly acquired knowledge to tackle some of the usual suspects in the context of an ill-performing Flink job: We will identify reasons for jobs not making progress or not performing as expected in either throughput or latency.PrerequisitesIn order to make the most of this training, you should already have Flink knowledge equivalent to our two day developer training. In particular, you should be familiar with:
- time and watermarking,
- state handling and state backends,
- Flink's fault-tolerance guarantees,
- checkpoints and savepoints,
DataStream
API andProcessFunction
.
You will need your own laptop with these tools installed:
- Git
- Java 8 JDK (a JRE is not sufficient, and newer versions of Java will not work)
- Maven 3.x
- an IDE for Java development
Any laptop capable of running an IDE, such as IntelliJ (MacOS, Linux, or Windows), should be fine.To save time during the event, we would also like you to get Flink running before you come to the training. We will send an email with detailed setup instructions a week or two before the conference.
Advanced Training
Apache Flink Tuning & TroubleshootingApache Flink Developer Training
This course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentat…
Apache Flink Developer Training
This course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presenta…
Apache Flink Developer Training
This course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The presentat…
Apache Flink Developer Training
This course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. The present…
Apache Flink Operations Training
This course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience include…
Apache Flink Operations Training
This course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience include…
Apache Flink Operations Training
This course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience include…
Apache Flink Operations Training
This course is a hands-on introduction to topics relating to the deployment and operation of Apache Flink applications. The intended audience include…
SQL Developer Training
Apache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions th…
SQL Developer Training
Apache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions th…
SQL Developer Training
Apache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions th…
SQL Developer Training
Apache Flink supports SQL as a unified API for stream and batch processing. SQL can be used for a wide variety of use cases, and lead to solutions th…
Apache Flink Tuning & Troubleshooting
Working with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application fr…
Apache Flink Tuning & Troubleshooting
Working with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application fr…
Apache Flink Tuning & Troubleshooting
Working with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application fr…
Apache Flink Tuning & Troubleshooting
Working with numerous Flink users over the last years we have learned a lot about the most common challenges when bringing a streaming application fr…
Apache Flink Worst Practices
Apache Flink Worst Practices
October 8: 11:00 am - 11:40 am 40minOperationsDistributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong.
In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
As Head of Product for Ververica Platform Konstantin is responsible for Ververica's commercial product, an enterprise-ready stream processing platform based on Apache Flink. Previously, he was leading the solutions architecture team and helping our clients as well as the Open Source community to get the most out of Apache Flink and Ververica Platform. Before joining Ververica he worked as a Senior Consultant with TNG Technology Consulting, where he supported their clients mainly in the areas of Distributed Systems and Automation. Konstantin has studied Mathematics and Computer Science at TU Darmstadt specializing in Stochastics and Algorithmics.
Operations
Apache Flink Worst PracticesUnlocking the next wave of applications with Stream Processing
Unlocking the next wave of applications with Stream Processing
October 8: 11:50 am - 12:30 pm40minTechnology Deep DiveTechnology Deep Dive
Unlocking the next wave of applications with Stream ProcessingBuild a Flink AI Ecosystem
Build a Flink AI Ecosystem
October 8: 1:30 pm - 2:10 pm40minTechnology Deep DiveMachine learning, especially deep learning, has become critical in data processing. Frameworks such as TensorFlow, PyTorch or Caffe are widely adopted in the industry to fulfill the deep learning requirements. Meanwhile, traditional machine learning are also preferred in some other areas. Compared to some other processing engines, Flink was not considered as a strong player in ML in the past. However, in the recent year, the community has invested quite some efforts to catch up in AI. This talk summarizes the efforts and progress in Flink to build its own AI ecosystem, including:
1. Refined ML Pipeline. We have redesigend ML pipeline interface on top of Table API. Thanks to the stream / batch unification in Table API, users can enjoy a simple and consistent experience in both online and offline learning scenarios.
2. Deep learning on Flink. Instead of yet create another DL framework, we decided to embrace the existing popular frameworks such as TF and PyTorch. This allows users to run these frameworks in parallel in Flink with fault tolerance and monitoring support.
3. Enhanced native iteration support in Table API. One key feature that distinguish Flink from other processing engine is its support for native iterations, which has better performance. We added the native iteration in Table API to address the needs for iterative processing in AI. Moreover, To improve the user experience, we enhanced the iteration API to support multi-variable and nested iterations.
Combining Flink's data processing capability with these efforts, we would like to establish an ecosystem which provides users with an end-to-end support in all AI stages. The talk will demonstrate how AI practitioners, such as algorithm writers and ML lib users, would benefit in their ML tasks from data cleansing to model training, model validation, model serving and inference.
Jiangjie (Becket) is currently a software engineer at Alibaba where he mostly focus on the development of Apache Flink and its ecosystem. Prior to Alibaba, Becket worked at LinkedIn to build streams infrastructures around Apache Kafka after he received Master degree from Carnegie Mellon University in 2014. Becket is a PMC member of Apache Kafka.
Technology Deep Dive
Build a Flink AI EcosystemA Tale of Dual Sources: Pictures of Grief and The Job Manager’s Clock
A Tale of Dual Sources: Pictures of Grief and The Job Manager’s Clock
October 8: 2:20 pm - 3:00 pm40minTechnology Deep DiveIt was the best of times, it was the worst of times. In the year of our Flink 2017, Lyft proclaimed that, indeed, dual sources are hard. But how hard could they be?
Many of Stripe’s Flink jobs need to start at the beginning of time; from the start of Stripe’s Kafka archives which are stored in S3 until they have “caught-up” and can begin reading from Kafka. Thus, we embarked on creating a specialized Flink source that would start with Kafka archives in S3 and transparently handover to Kafka. Sounds straight-forward, right?
In this experience report, we revel in the many challenges we encountered writing this specialized dual source and the many hacks we endured in its development. We will also highlight how much easier this problem will become in upcoming version of Flink. Attendees will walk away with a deeper understanding of Flink Sources, the challenges of distributed state, and the lifelong lesson of unadulterated hubris.
Aaron Levin is a mathematician-turned-radio-DJ-turned software engineer working on Stripe’s real-time data team (✨Streaming✨). Aaron used to live in Berlin, but now lives in Canada’s Berlin (Montréal - not to be mistaken with Berlin, Ontario).
Mike Mintz is a software engineer on Stripe’s Streaming team. Mike previously worked in the trading industry, where it was valuable to have a unified system for historical backtesting and live trading. Mike is originally from Anchorage, Alaska, but now lives in San Francisco.
Technology Deep Dive
A Tale of Dual Sources: Pictures of Grief and The Job Manager’s ClockStreaming Event-Time Partitioning With Apache Flink and Apache Iceberg
Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg
October 8: 3:30 pm - 3:50 pm20minEcosystemNetflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
Julia Bennett is a member of the data engineering team for personalization at Netflix that delivers recommendations made for each user. The team is responsible for building large scale data processing used in training and scoring of the various machine learning models that power the Netflix UI experience. They have recently been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. Before joining Netflix, Julia completed her PhD in mathematics from The University of Texas At Austin.
Ecosystem
Streaming Event-Time Partitioning With Apache Flink and Apache IcebergFlink’s New Batch Architecture
Flink’s New Batch Architecture
October 8: 4:00 pm - 4:40 pm40minTechnology Deep DiveSince its inception, Flink supports to execute batch workloads. Using specialized operators for processing bounded streams allows Flink to achieve an already very decent batch performance. However, in particular Flink’s fault recovery, which restarts the whole topology in case of task faults, caused problems for complex and large batch jobs. Moreover, supporting batch and streaming alike led in some components to necessary generalizations which prevented further batch optimizations:
* The scheduler needs to schedule topologies with complex dependencies as well as low latency requirements
* The shuffle service needs to support high-throughput batch as well as fast streaming data exchanges
In this talk, we will shed some light on the community’s effort to address these limitations and which new components compose Flink’s improved batch architecture. We will demonstrate how the new fine grained recovery feature minimizes the set of computations to restart in case of a failover. Moreover, we will explain how a batch job differs from a streaming job and what this means for the scheduler. We will also discuss why it can be beneficial to separate results from computation and how Flink supports this feature. Last but not least we want to give an outlook on possible future improvements like support for speculative execution, RDMA based data exchanges and how they relate to Flink’s new batch architecture.
Till is a PMC member of Apache Flink and engineering lead at Ververica. His main work focuses on enhancing Flink’s scalability as a distributed system. Till studied computer science at TU Berlin, TU Munich and École Polytechnique where he specialized in machine learning and massively parallel dataflow systems.
2010 ~ Now Alibaba Inc.
2007 ~ 2010 Baidu Inc.
Technology Deep Dive
Flink’s New Batch ArchitectureTowards More Efficient and Adaptive Scheduling for Flink Batch
Towards More Efficient and Adaptive Scheduling for Flink Batch
October 8: 5:00 pm - 5:20 pm20minTechnology Deep DiveAs a unified data processing framework, Flink has continuous evolution with current refactoring of scheduling strategies. Based on the redesigned interfaces, we have implemented the new LazyFromSources batch scheduler dedicated to make it more efficient for job execution and adaptive to stragglers such as skew and long tails, which could be caused by many reasons, such as environment factors and data skew.
In the distributed clusters, It is common to encounter the performance degradation on some nodes due to hardware problems, accidental I/O busy or CPU load burst. This kind of degradation can probably cause the running tasks on the node to be quite slow, that is so called long tail tasks. Although the long tail tasks will not fail, they can severely affect the total job running time. To deal with this problem, we have implemented Speculative Execution, which runs a copy of task on another node when the original task is identified to be long tail. Our production experience shows that it could significantly reduce the performance degradation.
Since Speculative Execution could resolve problems caused by environment factors, we incorporate Adaptive Parallelism to accelerate slow tasks due to data skew. we implement an adaptive strategy to determine the task parallelism considering the input data size and other operation statistics at runtime, i.e., merge partitions for small data size and split partition for large data size.
I am a senior engineer of Alibaba Group and works on Alibaba Big Data Processing Platform for over 3 years. My work mainly focus on distributed computing, streaming computing, and distributed resource management. I have designed and developed Alibaba Distributed Computing Platform, which has been deployed among hundreds of thousands of nodes in production supporting millions of business jobs every day.
Technology Deep Dive
Towards More Efficient and Adaptive Scheduling for Flink BatchIntelligent Log Analysis and Real-time Anomaly Detection @ Salesforce
Intelligent Log Analysis and Real-time Anomaly Detection @ Salesforce
October 8: 5:30 pm - 5:50 pm20minUse CaseApplication performance monitoring is one of the key DevOps and SRE duties for any large cloud-based organization, with Salesforce being one of the largest of such. The volume of application logs being collected is enormous and may surpass tens of gigabytes per second for large SaaS operations. Trying to identify a downgrading trend in performance or detect an application anomaly or perform a root-cause analysis is a hay-in-the-needle problem which many companies only perform retroactively on a case-by-case basis, typically knowing what they want to find and commonly using ad-hoc offline queries in Splunk/ElasticSearch or some Big SQL tools. Real-time pro-active log monitoring requires a different approach: a stack of unsupervised statistical analytics and ML models augmenting and classifying the ingested log data at medium-cardinality scopes to identify performance problems as they start to unwind. These methods can not be too complex as statistics need to be continuously re-evaluated online because trend and anomaly detection is a moving-target problem with volatile/spiky dynamics. Moreover, there are many definitions of anomalies and performance degradation events as each software service has unique performance pitfalls, especially when it comes to root-cause analysis. In Salesforce, we decided to build a public-cloud-hosted elastic multi-tenancy real-time log analysis/anomaly detection platform so that each tenant can easily onboard their own anomaly models and log monitoring analytics and scale it independently based on their log volume processing requirements. Apache Flink is the engine that powers our platform: most of the intelligent monitoring can be performed at scale using time-windowing features of Flink, with more complex ML models served through Flink connected streams. Flink also provides basic multi-tenancy support through Application Master sessions running on top of elastic compute cluster managers such as YARN, Kubernetes and Mesos: this allows us to provide per-tenant elasticity and billing cost estimation, often saving tens of thousands of dollars in monthly compute costs. In this talk, I'll provide a high-level end-to-end overview of our platform and show how we deliver the results via alerting services for DevOps monitoring and via real-time BI tools for performance engineering insights.
Andrew Torson is a Principal Data Engineer with Salesforce. His current work is focused on real-time ML based anomaly detection and application performance monitoring for the Salesforce cloud software. Before joining Salesforce, he was a data engineering lead working on the Smart Pricing platform in the Walmart Labs, generating real-time algorithmic price decisions for the global Walmart e-commerce catalog. Andrew is a Scala enthusiast and an active Flink developer with a long industry track-record. He holds a PhD degree in Operations Management from the New York University and M.Sci in Applied Mathematics from the Moscow Institute of Physics and Technology.
Use Case
Intelligent Log Analysis and Real-time Anomaly Detection @ SalesforceDynamically Generated Flink Jobs at Scale
Dynamically Generated Flink Jobs at Scale
October 8: 11:00 am - 11:40 am40minEcosystemThe Data Lake runs 120K jobs a day all dynamically generated from metadata with the goal of ingesting data from source, validating it, and then performing milestoning. This talk will cover the use case, architecture, challenges, solutions and our experience with using Flink. I will focus on optimizations and tuning in a dynamic and variable workload. We will also cover metrics and performance over using our original heterogeneous stack versus Flink.
Background:
Our Data Lake provides a platform to manage data in a central location so that anyone in the firm can rapidly query, analyze or refine the data in a standard way. While there are many data lakes in industry, we see ours differently in several key distinctions:
1) Decoupling of producers and consumers – Traditionally in the industry state-of-the-art, the producers have to build dedicated pipelines for each producer-consumer pair into a data warehouse where then transformations can be made. We see this as two user flows with the Lake acting as the intermediary, thereby changing an m (number of datasets) * n (number of consumers) problem to m + n. In the previous model, the transformations performed on the original dataset are then isolated whereas we capture the output of the transformations and put it back into the lake. We call these refiners, consumers of existing data and producers of new transformed data.
2) Dynamic ingestion pipelines – In order to scale, we make it self-service for our clients to onboard and define a schedule. The Data Lake then dynamically kicks off the jobs necessary to extract, validate, and merge/commit their data.
3) Technology agnostic – We acknowledge that the technologies around us can change. Producers have different source types that we can extract from and consumers may want their data in a variety of endpoints. Both of these are simply connecters into and out of the core Data Lake. Our core pipeline technology can also change without having our clients rebuild.
4) Shared infrastructure management – We own the central cluster that serves both as data nodes and compute for ingestion and refiner processing. We also own shared database pairs where consumers query from rather than having to provision and manage their own.
5) Built in Milestoning/Replication – Reproducibility is critical for regulatory functions. We milestone the data by comparing the incoming data with the existing data. We mark the records with the batch ids for which they were live from and thru. We handle replication centrally both from a core Lake perspective as well as the consumer endpoints.
6) Centralized controls and entitlements -- We centralize security of the datasets by maintaining a mapping between users and the datasets and apply these on both the files in our cluster as well as the consumer endpoints.
In addition to this, because we have the metadata on the levels of transformations done, we can derive data provenance as well as increase the incentives to ensure high levels of data quality.
The Data lake infrastructure runs around a Hadoop core consisting of a pair of Hadoop clusters each sized at 20k cores, 100TB of memory and 13PB of storage for multi-data center redundancy. We originally built this using a heterogeneous stack involving a mixture of homegrown tech, Spark, MapReduce but have since created a homogenous stack using Flink resulting in a 3-5x reduction in wall clock latency while reducing CPU and Memory utilization at the same time. The Flink pipeline is dynamically constructed on a per job basis from metadata described what the lake needs to do (what is the data, where is it, how often to get it, who is using it and so on). We generate about 120K jobs from this daily. In addition to a drop in replacement of the existing processing architecture, Flink allows to decouple the interdependency of having to ingest raw data fully before transforming it. More importantly, Flink offers us the ability to have simultaneous streaming and batch based versions of existing pipelines without writing code, potentially chaining the n-levels of refining resulting in a reduction of overall latency thereby realizing a 100% Kappa architecture. Replication of data/metadata between lake clusters is also handled using Flink jobs.
Regina Chan is a Senior Engineer at Goldman Sachs in the Data Architecture team building solutions to service the firm’s growing demand for data. She is one of the original members of the Data Lake team building it from the ground up and has been leading the effort in rebuilding using Flink.
Ecosystem
Dynamically Generated Flink Jobs at ScaleAirbus makes more of the sky with Flink
Airbus makes more of the sky with Flink
October 8: 11:50 am - 12:30 pm40minUse CaseMake More of the Sky - Air Traffic Management (ATM) services affect the quality and performance of every commercial flight in the world - currently transporting about a billion passengers a year. Growing traffic, limited systems capacity, strict safety requirements, and high environmental awareness demand a constant ATM services improvement.
Come and listen to Hassene Ben Salem, Chief Engineer for Airbus AirSense, and Jesse Anderson, Big Data guru, tell how they use Azure to continuously develop, deploy and release AirSense services, the Airbus new generation ATM services. These services, powered by Azure Data & Analytics platform, provide business insights based on real-time aircraft position data to Airbus internal and external customers in order to shape the future of ATM.
Find out how Flink, Event Hubs, CosmosDB and Databricks, amongst others, were combined to handle real-time streams, derive analytics, and provide decision support directly relevant and valuable for the Air Transportation industry. Hassene and Jesse will also cover the collaborative approach they leverage to overcome the challenges encountered along the way.
Jesse Anderson is a Data Engineer, Creative Engineer and Managing Director of Big Data Institute.
He works with companies ranging from startups to Fortune 100 companies on Big Data. This includes training on cutting edge technologies like Apache Kafka, Apache Hadoop and Apache Spark. He has taught over 30,000 people the skills to become data engineers.
He is widely regarded as an expert in the field and for his novel teaching practices. Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired.
Hassene Ben Salem is Chief Engineer for Airbus AirSense where he works on developing solutions for real-time tracking and airspace situational awareness. Before joining Airsense, he was one of the original members of the advanced analytics team building it from the ground up and has been leading the efforts in setting up the analytics and AI practices within Airbus Defence and Space first as a Data Scientist then as a Product Owner and Lead Architect. Prior to that, he received his M.Sc. in Computer Science from École Polytechnique with a focus on Systems Architecture
Use Case
Airbus makes more of the sky with FlinkNot So Big – Flink as a true Application Framework
Not So Big – Flink as a true Application Framework
October 8: 1:30 pm - 2:10 pm40minUse CaseMotaWord is a collaborative translation platform where multiple translators together work on documents in real-time. It uses Flink to empower its intelligent platform manager, which manages all of the reactionary and analytical workflows by using event-based analysis. In this talk, we will steer away from the phrase “stream processor” and look at Flink as an “application framework”. We will take a sample business case and try to reliably construct the business domain, flow the data and define operations optimally/minimally, with three very important constraints: 1) no external data reference during operation except for sources and backfills (forget about redis), 2) most elements are shared and referenced, 3) everything just flows in streams.
Oytun is the co-founder and CTO of MotaWord, the world’s fastest business translation platform. Majored in linguistics, he is a software engineer by vocation. He grew an interest in collaborative workflows which MotaWord implements fully, and the automation of human collaboration. His most recent toys are Apache Flink, inline skating and kites.
Use Case
Not So Big – Flink as a true Application FrameworkRunning Flink in production: The good, the bad and the in-between
Running Flink in production: The good, the bad and the in-between
October 8: 2:20 pm - 3:00 pm40minOperationsThe streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pickup ETA accuracy, dynamic pricing, generating machine learning features for fraud detection, real-time analytics among many others. Broadly, the jobs fall into two abstraction layers: applications (Flink jobs that run on the native platform) and analytics (that leverage Dryft, Lyft’s fully managed data processing engine). This talk will give an overview of the platform architecture, deployment model and user experience. The talk will also dive deeper into some of the challenges and the lessons that were learnt, running Flink jobs at scale, specifically around scaling Flink connectors, dealing with event time skew (source synchronization) and highlight common patterns of problems observed across several Flink jobs. Finally, the talk will give insights into how we are re-architecting the streaming platform @ Lyft using a Kubernetes based deployment.
Lakshmi is a software engineer on the streaming platform team at Lyft. The team builds and supports the core infrastructure that enables several product teams at Lyft to easily and reliably spin up Flink jobs to perform aggregations on real-time data. Most recently, she has been spending time re-architecting the platform to a Kubernetes based deployment. Prior to Lyft, Lakshmi worked in fin-tech land, building a search and information retrieval platform for Goldman Sachs.
Operations
Running Flink in production: The good, the bad and the in-betweenIntrospection of the Flink in production
Introspection of the Flink in production
October 8: 3:30 pm - 3:50 pm20minOperationsYour Flink job is breaking your SLA. The alert is raised and you try to understand the root cause of the issue and mitigate it. What would be your first steps, which metrics to look at and which operations to apply in order to recover the SLA?
In this talk, I would like to share our experience of monitoring and operating Flink at scale. You will learn what we believe the meaningful monitoring dashboard for a Flink job is and which tools to use in order to understand the execution of your Streaming job. We will see how to check the different levels of execution of your job from Linux through JVM to Flink infrastructure logic.
More than 10 years of experience in the industry.
Currently part of the SRE Kafka team in Criteo which builds Streaming Platform.
Worked for Grammarly in the past. Likes JVM and functional programming. Fun of improving development productivity.
Operations
Introspection of the Flink in productionBringing Cypher to Apache Flink
Bringing Cypher to Apache Flink
October 8: 4:00 pm - 4:40 pm40minEcosystemCypher is a declarative graph query language based on the property graph model and used by database systems such as Neo4j. The strive of insights into growing volumes of data, often in the range of petabytes, creates the need for distributed processing engines to overcome this task. While the world of relational systems can use Flink Table with its SQL capabilities, the graph world is lacking such engines. We at Neo4j implemented a backend agnostic Cypher query engine called OKAPI, that translates Cypher queries to relational operators on top of an abstract table. That table can then be implemented by any relational backend.
During my master thesis I implemented such a backend for Flink Table and compared its performance to Neo4j Morpheus, our backend implementation for Apache Spark.
In this talk I will present the OKAPI query engine and my implementation including lessons learned from working with Flink Table and the Calcite optimizer framework. Furthermore, I will demonstrate benchmarking techniques and results for benchmarking Cypher queries on Apache Flink and Apache Spark.
Sören is a a software engineer in the graph analytics team at Neo4j. His interests cover working with graphs in big data environments as well as query execution engines. Prior to joining Neo4j, he was studying at Leipzig University and wrote his master thesis about Cypher on Flink.
Ecosystem
Bringing Cypher to Apache FlinkReal-time Experiment Analytics at Pinterest with Apache Flink
Real-time Experiment Analytics at Pinterest with Apache Flink
October 8: 5:00 pm - 5:20 pm20minUse CaseAt Pinterest, over 250M monthly users all over the world come for inspiration and to discover and do what they love. Our experimentation platform processes petabytes of data and we run thousands of experiments every day. In order to accelerate decision making, feature iteration and failure monitoring, we built a real-time experiment analytics pipeline with Flink that computes experiment metrics and does statistical tests in real-time. This helps reduce the delay of our most import experiment metrics from 1 day to 15 minutes.
With various powerful low level operators like KeyedProcessFunction, KeyedBroadcastProcessFunction, IntervalJoin and CoProcessFunction, we have been able to implement complex business logic and perform stateful operations based on experiment user activation and event streams. In this talk we will share our development experience and learnings on dealing with message out-of-orderness, state management, fault tolerance, checkpoint failure, back-pressure and job monitoring in a large scale real-time experiment analytics environment.
Experienced Software Engineer with 15 years of professional experience building scalable, distributed, high-performance web applications, backend services and big data applications. Experience working with high scale systems like Apple's IdMS and Ooyala's recommendation engine.
Languages - Java, Scala, Python
Big Data - Apache Spark, HBase, Elastic Search, Couchbase NoSQL, Cassandra, Flink
Machine Learning - Content and Collaborative filtering algorithms for video recommendations based on Spark
Software engineer at Pinterest focusing on large scale data analytics with work experience in Spark, Hive, Flink and HBase.
Before joining, Ben Liu graduated from Stanford University as an MS student in Statistics with a background in Computer science.
Use Case
Real-time Experiment Analytics at Pinterest with Apache FlinkDo Flink on Web with FLOW
Do Flink on Web with FLOW
October 8: 5:30 pm - 5:50 pm20minUse CaseWe present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI.
FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library.
In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.
Dongwon Kim is a big data architect at SK telecom. During his post-doctoral work, he was fascinated by the internal architecture of Flink and gave a talk titled “a comparative performance evaluation of Flink” at Flink Forward 2015. He introduces Flink to SK telecom, SK energy, and SK hynix to fulfill various needs for real-time streaming processing from the companies and shares the experiences at Flink Forward 2017 and 2018. He is recently working on a web service to promote the wide adoption of streaming applications companywide.
Haemee Park is a big data engineer at SK telecom. She studied Computer Science and Management of Technology at Korea University. She previously worked at Oracle as a technical consultant and at Samsung Life Insurance as a DBA. Currently, she is focusing on developing PdM solution based on Flink.
Use Case
Do Flink on Web with FLOWTime-To-Live: How to perform Automatic State Cleanup in Apache Flink
Time-To-Live: How to perform Automatic State Cleanup in Apache Flink
October 8: 11:00 am - 11:40 am40minTechnology Deep DiveA common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of state size and visibility. The state time-to-live (TTL) feature enables application state cleanup in Apache Flink.
In this talk, we will first discuss the State TTL feature and its use cases. We will then outline the semantics of the feature and provide code examples before taking a closer look at the implementation details to tackle the encountered challenges associated with the background cleanup process. Finally, we will talk about the roadmap of the TTL feature including potential improvements of the feature in future Flink releases.
Andrey Zagrebin is a Software Engineer at Ververica. Andrey’s work focuses primarily on Apache Flink’s distributed coordination and state backends. Previously, he worked as a Software Engineer at T-Mobile building a large scale infrastructure for batch and real-time analytics of customer experience. Before that, he worked at LinkResearchTools, where he developed an SEO web crawler and at Qubit Digital where he built multiple distributed streaming applications.
Technology Deep Dive
Time-To-Live: How to perform Automatic State Cleanup in Apache FlinkBuilding a Self-Service Streaming Platform at Pinterest
Building a Self-Service Streaming Platform at Pinterest
October 8: 11:50 am - 12:30 pm40minOtherPinterest is a visual discovery engine that helps more than 250 million monthly active users discover things they love and inspires them to go do those things in their daily lives. Creating Pinterest's new stream processing platform, Xenon, around Flink has enabled teams across the company to tackle new real-time applications. From accelerating machine learning model training iterations to enabling real-time analytics, real-time stream processing has opened up new possibilities everywhere. Given that we ingest over a trillion messages every day through Kafka at Pinterest, a lot went into the design of our new stream processing platform.
We would like to share our experience in making the decision to move from different streaming technologies to Flink. We also discuss how we are building a self-service streaming platform for our engineers, data analysts and data scientists that provides a seamless job deployment pipeline, common monitoring, alerting and tooling. We have also enabled ad-hoc Flink SQL queries on bounded Kafka data streams through a UI to help users explore streaming data easily, extract real-time insights and assist product development.
Steven is a software engineer on the Data Processing Platform at Pinterest. He primarily works on Pinterest’s streaming platform, Xenon, and has helped Pinterest move from a Mesos-based micro-batch stream processing model to true streaming with Flink on YARN.
Other
Building a Self-Service Streaming Platform at PinterestBeam on Flink: How does it actually work?
Beam on Flink: How does it actually work?
October 8: 1:30 pm - 2:10 pm40minTechnology Deep DiveApache Beam is a data processing model built with focus on portability. Beam jobs can be written in the language of your choice: Java, Python, Go, or SQL. Once written, they can be executed using various execution engines including Apache Flink, Apache Spark, Google Cloud Dataflow, and many more.
In order for Beam to support multiple execution engines, the Beam API needs to be translated to the API of the execution engine (e.g. Flink's). In Beam, this is the responsibility of the ""Runner"".
The Flink Runner has come a long way from an early stage Runner to a fully-featured Runner. Its latest addition is the integration of Beam's language portability layer which enabled to run jobs written in other languages than Java.
In this talk, we will dissect the Flink Runner and show how Beam's components tie together with Flink. If you have ever wondered how the Flink Runner or Beam works, this is your chance to find out.
Max is a software engineer and PMC member of Apache Flink and Apache Beam. During his studies at Free University of Berlin and Istanbul University, he worked at Zuse Institute Berlin on Scalaris, a distributed transactional database. Inspired by the principles of distributed systems and open-source, he helped to develop Apache Flink at Ververica and, in the course of, joined the Apache Beam community to create the Flink Runner. After maintaining the SQL layer of the distributed database CrateDB, he is now working on the portability aspects of Apache Beam.
Technology Deep Dive
Beam on Flink: How does it actually work?One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables
One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables
October 8: 2:20 pm - 3:00 pm40minCommunityApache Calcite is a data management framework that includes a SQL parser and query optimizer. It is used by many projects that implement SQL processing capabilities, including Apache Beam and Apache Flink. Over the last years, members of these three communities had many discussions about the semantics and syntax of “Streaming SQL”. End of last year, we decided to formalize and summarize our views and ideas in paper that we submitted to the Industrial Track of the SIGMOD 2019 conference. The paper got accepted (http://sigmod2019.org/sigmod_industry_list).
It presents a three-part proposal for integrating robust streaming into SQL, namely: (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results.
The paper shows how with these minimal additions it is possible to utilize the complete suite of standard SQL semantics to perform robust stream processing and motivates and illustrate these concepts using examples and describe lessons learned from implementations in Apache Calcite, Apache Flink, and Apache Beam.
In this talk, we present our “Syntactically Idiomatic Approach to Manage Streams and Tables”.
Fabian Hueske is a committer and PMC member of the Apache Flink® project and has been contributing to Flink since its earliest days. Fabian is a co-founder of Ververica, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink®. He holds a PhD in computer science from TU Berlin and is currently writing a book about “Stream Processing with Apache Flink®”.
Community
One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and TablesDemo: From Zero to Production with Ververica Platform
Demo: From Zero to Production with Ververica Platform
October 8: 3:30 pm - 3:50 pm20minOtherCome see a demo of the upcoming release of Ververica Platform. I will show how to get started with the platform, from initial installation to a running application, and showcase some of the features coming in the next release, including artifact management, automatic checkpoint configuration, and Flink jobmanager failover without ZooKeeper.
Patrick Lucas is a Senior Data Engineer at Ververica working on the team developing Ververica Platform. Previously, he worked on and led various infrastructure teams at Yelp in San Francisco, and prior to that worked at the Cyber Technology and Information Security Laboratory at the Georgia Tech Research Institute. Patrick studied Computer Science at the Georgia Institute of Technology.
Other
Demo: From Zero to Production with Ververica PlatformBuild and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications
Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications
October 8: 4:00 pm - 4:40 pm40minEcosystemStream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. Yet, despite these advantages compared to traditional batch-oriented analytics applications, streaming applications are much more challenging to operate. Some of these challenges include the ability to provide and maintain low end-to-end latency, to seamlessly recover from failure, and to deal with a varying amount of throughput.
We all know and love Flink to take on those challenges with grace. In this session, we explore an end to end example that shows how you can use Apache Flink and Amazon Kinesis Data Analytics for Java Applications to build a reliable, scalable, and highly available streaming applications. We discuss how you can leverage managed services to quickly build Flink based streaming applications and show managed services can help to substantially reduce the operational overhead that is required to run the application. We also review best practices for running streaming applications with Apache Flink on AWS.
So you will not only see how to actually build streaming applications with Apache Flink on AWS, you will also learn how leveraging managed services can help to reduce the overhead that is usually required to build and operate streaming applications to a bare minimum.
Dr. Steffen Hausmann is a Specialist Solutions Architect for Analytics with Amazon Web Services. He has a strong background in the area of complex event and stream processing and supports customers on their cloud journey. In his spare time, he likes hiking in the nearby mountains.
Ecosystem
Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java ApplicationsSponsored Talk: Agile Analytics: Model-driven Flink
Sponsored Talk: Agile Analytics: Model-driven Flink
October 8: 5:00 pm - 5:20 pm20minOtherEvery organization wants to go faster. Business wants data, but Engineering can't keep up. Growing demands for faster analytics and insights have dramatically increased the growth of streaming data — and the need to quickly extract business intelligence from it in real time has become a necessity. However, streaming analytics is very complex - requiring enormous and orchestrated effort from programmers and IT personnel to translate business questions into technical specifications. Cogynt simplifies this by providing a unified, model-driven system for your streaming needs. From building data pipelines, to sifting through multiple data streams, looking for complex patterns rapidly, reducing the “time to market” significantly.
In this presentation we will look at two use cases enabled by agile analytics. One explores a financial fraud scenario, while the other investigates opioid transactions in the United States. During the presentation, we will build and deploy the analysis for these scenarios and compare and contrast our model-based approach to the more conventional SQL-based approach
Michael is a co-founder and CTO of Cogility, currently spearheading the efforts in Cogynt’s development. Having over 10 years of experience in behavioral analytics, distributed systems, big data and model-driven systems, Michael has played a key role in integrating Flink into the Cogynt platform presented today.
Other
Sponsored Talk: Agile Analytics: Model-driven FlinkLarge Scale Real Time Ad Invalid Traffic Detection with Flink
Large Scale Real Time Ad Invalid Traffic Detection with Flink
October 8: 5:30 pm - 5:50 pm20minUse CaseCriteo is receiving 300 billion Ad bid requests and performing 4 billion displays daily, to protect our advertisers from Invalid Traffic, we need a real-time stream processing framework that fits our scale and can help us detect those invalid traffic. We are going to talk about the Invalid Traffic Detection Architecture in Criteo, and focus on our stream processing rule engine and how it evolves from Kafka Streams to Flink, where we leverage the SQL API to quickly create new business rules. As well as the technical challenges we encountered on the way and how did we solve them.
Senior software engineer at Criteo. Started his career working as a Freelance on Web Development. He then joined Criteo in 2015 to work as a Software Engineer in the Invalid Traffic Detection Team.
Senior software engineer at Criteo. Working with Big data technologies for the last 8 years and currently developing a rule-based engine for invalid traffic detection based on Flink.
Use Case
Large Scale Real Time Ad Invalid Traffic Detection with FlinkWriting an interactive streaming SQL engine and pre-parser using Flink
Writing an interactive streaming SQL engine and pre-parser using Flink
October 8: 11:00 am - 11:40 am40minUse CaseIn this talk I will cover our journey building an interactive SQL interface and engine that leverages Apache Flink and Apache Calcite. I will talk about architectural design of backend components including a interactive SQL parser, how communication from web client to server are done using web sockets, user interface decisions, scalability considerations, schema declaration and evaluation, and providing feedback to the user issuing SQL in an interactive manner.
I will go over some architectural decisions and tradeoffs, example queries and capabilities, utilizing multiple sources and sinks, and how Apache Flink enables unmatched statefulness, availability, and scalability for this use case.
Attendees will leave the talk with concrete examples of what worked and didn't work, why we chose Flink for our product, and how to create an entire streaming stack using just SQL!
Kenny has 18 years of experience with various database platforms behind some of the busiest datasets in the world. Most recently he Co-Founded ObjectRocket. He has had roles as Chief Technologist, Architect, Director, Manager, Developer, and DBA. He was a key member of the early teams that scaled Paypal and then eBay, ran one of the largest PostgreSQL installations on the planet, and was a very early adopter and Entrepreneur using MongoDB. He is an active database community member, speaker, and evangelist.
Loves vi.
Use Case
Writing an interactive streaming SQL engine and pre-parser using FlinkUnify Enterprise Data Processing System: Platform-level integration of Flink and Hive
Unify Enterprise Data Processing System: Platform-level integration of Flink and Hive
October 8: 11:50 am - 12:30 pm40minTechnology Deep DiveIn this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data.
Unification of streaming and batch is a main theme for Flink. Since 1.9.0, we have integrated Flink with Hive in a platform level. I will talk about:
- what features we have released so far, and what they enable our customers to do
- best practices to use Flink with Hive
- what is the latest development status of Flink-Hive integration at the time of Flink Forward Berlin (Oct 2019), and what to look for in the next release (probably 1.11)
Bowen is a committer of Apache Flink and Senior Software Engineer at Alibaba. He is currently focusing on advancing Flink as a unified data processing system and developing Flink's metadata and batch capabilities. Bowen is the host of Seattle Flink Meetup, he frequently organizes meetups and events, and give talks on Flink.
Technology Deep Dive
Unify Enterprise Data Processing System: Platform-level integration of Flink and HiveWhat's new for Flink's Table & SQL APIs? Planners, Python, DDL, and more!
What's new for Flink's Table & SQL APIs? Planners, Python, DDL, and more!
October 8: 1:30 pm - 2:10 pm40minTechnology Deep DiveAbout three years ago, the Apache Flink community started adding a Table & SQL API to process static and streaming data in a unified fashion. It makes data processing accessible to non-programmers and significantly reduces the effort to solve common tasks. Today, Flink SQL already powers production systems at Alibaba, Huawei, Lyft, and Uber. But we are only getting started! The community is currently re-shaping the future of data processing.
Even for followers of the Flink mailing lists, it can be quite difficult to keep track with all the developments that happen on Flink's relational APIs. In this talk, we give an overview of recent contributions, such as pluggable optimizers, the new type system with consistent type inference, SQL DDL support, and the Python Table API. We elaborate on how all these efforts interact and discuss the future roadmap.
Aljoscha Krettek is a co-founder at Ververica where he works on the Flink APIs in the open source. He is also a PMC member at Apache Flink and Apache Beam. Before working on Flink, he studied Computer Science at TU Berlin, he has worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha has spoken at Hadoop Summit, Strata, Flink Forward and several meetups about stream processing and Apache Flink before.
Timo Walther is a committer and PMC member of the Apache Flink project. He studied Computer Science at TU Berlin. Alongside his studies, he participated in the Database Systems and Information Management Group there and worked at IBM Germany. Timo works as a software engineer at Ververica. In Flink, he is mainly working on the Table & SQL API.
Technology Deep Dive
What's new for Flink's Table & SQL APIs? Planners, Python, DDL, and more!SQL Ask-Me-Anything
SQL Ask-Me-Anything
October 8: 2:20 pm - 3:00 pm40minCommunity The Flink community has been working on unified SQL support for batch and streaming data for many years.
In the last two releases (Flink 1.8 and Flink 1.9) many features added and improved and a few things changed.
The most significant effort was certainly the donation and integration of Blink's query processor that became available for preview in Flink 1.9.
The new processor provides better SQL coverage (full TPC-H in Flink 1.9 and full TPC-DS planned for 1.10) and significant performance improvements for queries on batch data.
Moreover, the community reworked the type system to be compliant with the SQL standard, added support for SQL DDL statements, and improved the interfaces to connect to external catalogs.
Integration with the Apache Hive ecosystem (also available as a preview in Flink 1.9) marks another significant step for Flink SQL.
Some of these efforts are still ongoing and there are many more SQL-related work items on Flink's roadmap.
For this Ask-me-Anything session, many Flink committers who are working on Flink SQL will come together to answer all your questions around current and future SQL support in Flink.
Timo Walther is a committer and PMC member of the Apache Flink project. He studied Computer Science at TU Berlin. Alongside his studies, he participated in the Database Systems and Information Management Group there and worked at IBM Germany. Timo works as a software engineer at Ververica. In Flink, he is mainly working on the Table & SQL API.
I work at realtime compute team in Alibaba, and mostly focus on building a unified, high-performance SQL engine based on Apache Flink.
Bowen is a committer of Apache Flink and Senior Software Engineer at Alibaba. He is currently focusing on advancing Flink as a unified data processing system and developing Flink's metadata and batch capabilities. Bowen is the host of Seattle Flink Meetup, he frequently organizes meetups and events, and give talks on Flink.
Dawid Wysakowicz is a Flink committer, currently working as a Software Engineer at Ververica. Recently his main area of interest is detecting patterns in streams of data with Flink Complex Event Processing library. Previously worked at GetInData, where he’s been implementing real-time streaming solutions based on Apache Flink. His journey with highly distributed and scalable solutions started in 2015 while writing a Master Thesis on Distributed Genomic Datawarehouse.
Community
SQL Ask-Me-AnythingA year in the Apache Flink Community
A year in the Apache Flink Community
October 8: 3:30 pm - 3:50 pm20minCommunityFlink is moving quickly, and so does the community driving it. There are many efforts such as the integration of the Chinese-speaking community, the development of a community packages website or improvements to the contribution process.
The sheer number of efforts make it difficult to keep track of all of them — that’s why this talk will provide an overview of what happened in the past year, and which new community efforts are coming.
Join this talk if you want to get a condensed overview of non-coding efforts in Flink.
Robert Metzger is a PMC member of the Apache Flink project and a co-founder and an engineering lead at Ververica. He is the author of many Flink components including the Kafka and YARN connectors. Robert studied Computer Science at TU Berlin and worked at IBM Germany and at the IBM Almaden Research Center in San Jose. He is a frequent speaker at conferences such as the Hadoop Summit, ApacheCon and meetups around the world.
Community
A year in the Apache Flink CommunityFlink SQL Powered AutoML Pipeline
Flink SQL Powered AutoML Pipeline
October 8: 4:00 pm - 4:40 pm40minUse CaseMachine learning system at scale is a team work requiring the collaboration of data engineers, who develops, constructs, and maintains architectures, and data scientists, who on the other hand, organizes data to build models and grain insights to solve business problem. Either role requires specific skill sets that the other does not necessarily have, which makes it hard to leverage all resources to efficiently extract features, train models and serve in production.
We are proud to introduce our AutoML pipeline solution, an end-to-end system that builds a bridge over the chasm between data engineering and data science by bringing SQL, Flink and ML together using Flink SQL as the abstraction layer.
Since SQL is the most intuitive and widely used data management language by engineers, analysts and scientists, we come up with the idea of building a SQL-like language interface that exposes Flink SQL and streaming processing abilities to the end users in a typical ML scenario. In this pipeline, features can be extracted, ingested into feature store, accessed and reused for model training and prediction with a unified API. Streaming processing, incremental feature backfill and ML model update are achieved by leveraging the windowing feature in Flink.
To make it easy to use, we also build an AutoML interface where you can just drag and drop components such as datasets, ETL and built in ML algorithms to visually build the pipeline and see the result of every step.
Master's degree from Chongqing University. Currently a senior big data engineer at HanSight. Mostly interested in applying machine learning technologies on fast accurate anomaly detection in streaming processing system. I’m currently researching on how to build a flexible AutoML process based on big data processing frameworks. I’m also the main contributor of the UEBA product of our company.
University of Waterloo alumni of 2012, master degree of software engineering, former Flink Forward Berlin 2017 speaker. I’m a senior big data processing architect and currently the leader of UEBA product development at HanSight, the leading cyber security company in China and the only Asian vendor in Gartner Peer Insights “Voice of the Customers” SIEM Customers’ Choice 2019. My skills span multiple big data processing frameworks (e.g., Flink, Spark, Kafka, Zookeeper), data intensive applications design and machine learning technologies. Currently I’m focusing on powering machine learning process with an AutoML architecture that enhances feature reusability, feature standardization, consistency of model training/serving and user experience, and that as a result fills the gap between data engineering and data science.
Use Case
Flink SQL Powered AutoML PipelineFighting phishing and spam with online machine learning on data streams
Fighting phishing and spam with online machine learning on data streams
October 8: 5:00 pm - 5:20 pm20minUse CaseMachine Learning on streaming data is becoming truly important for our products. Our systems grow in size producing more and more data with users expecting to get their results in real-time. On the other hand, our real-life business scenarios are changing even faster - we’re even experiencing critical patterns changing within hours. Great examples where online machine learning on data streams can be applied are problems like spam or fraud detection, online recommendations, feeds personalisation and others.
In this presentation, I will go through our approach and lessons we learnt when building and applying Machine Learning solutions on data streams using Apache Flink. In our FreshMail case we are working on using this approach to build a new generation anti-abuse engine - SendGuard - working on thousands of messages per second, over 70M messages a day using FlinkML and Flink CEP. We use it to protect recipients from getting abuse messages like spam, spoofing or phishing.
Wojtek works as FreshMail’s CTO and independent consultant and trainer. He loves various aspects of data-driven business culture transformations and development of data and software architectures in such companies. He is also a great supporter of fostering an organization-wide learning culture.
In FreshMail he leads the product development team and works on some core business solutions, often applying Machine Learning and AI to solve problems. There, together with the team, he works on a new generation of an anti-abuse engine (fighting spam, phishing, and other attacks) that uses data stream processing, ML & AI on the scale of tens of dozens of millions of emails every day. He is a creator of some ML/AI workshops, including public ones.
For almost 10 years he co-founded the Ministry of Ideas, where he was consulting a data-driven organization's transformations and implementation of tools and processes supporting it. As the consultant and trainer he worked, among others with The Coca-Cola Company, the American Bankers Association, Macy's, Bloomingdales, Heineken, Saks 5th Avenue, BP, Boots, Polo Ralph Lauren, Homebase, Porsche, HSBC, Intel, Oracle and others. Outside of the professional life, he is an enthusiast of mountain sports - downhill, enduro, free-touring, freeride snowboarding and travel, expeditions and photography.
Use Case
Fighting phishing and spam with online machine learning on data streamsSponsored Talk: Streaming Processing Options in Google Cloud
Sponsored Talk: Streaming Processing Options in Google Cloud
October 8: 5:30 pm - 5:50 pm20minOtherBy 2025, more than a quarter of data created in the global datasphere will be real-time in nature. To help make sense of 50+ zettabytes (10^21) of streaming data (in 2025), the Google Cloud offers a comprehensive set of services for real-time ingestion (Cloud Pub/Sub), streaming analytics (Cloud Dataflow), data warehousing (BigQuery), dashboarding and AI. Forrester recently rated Google Cloud Dataflow a Leader in the 2019 Streaming Analytics Wave.
In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will give an overview of the managed services available in the Google Cloud, and offer a reference architecture for building streaming processing applications, taking advantage of the unique features available in these services.
Sergei is the Product Manager for Cloud Dataflow, Google’s serverless, fully-managed service for streaming analytics. Dataflow offers advanced resource usage and execution time optimization techniques including autoscaling and fully-integrated batch processing. Sergei holds an MBA degree from the Wharton School, and a Computer Science degree from the Technical University of Munich, Germany.
Other
Sponsored Talk: Streaming Processing Options in Google CloudHow to configure your streaming jobs like a pro
How to configure your streaming jobs like a pro
October 8: 11:00 am - 11:40 am40minOperationsIn this talk we take an in-depth tour of the different job and cluster configuration options available in Flink and how to leverage them to maximize performance, stability, security and observability of your pipeline. Flink features an impressive set of configuration and tuning options that an expert user can use to take their application to the next-level, and with every release the list of possibilities keeps on growing.
Based on experience gathered over years of large scale flink deployment of hundreds of streaming jobs, we walk you through the most important configuration parameters that can make or break your application and also cover some rabbit holes that you might want to avoid. We will try to touch on the most important topics including resource allocation, state backend configuration, logging, metrics, debugging and security.
By the end of this talk you should have an up-to-date and comprehensive view of the configuration options available in Flink to fine tune your streaming applications.
Gyula is a Software Engineer in the Flink Engineering team at Cloudera working on integrating Flink into the Cloudera platform.
He has been a committer and contributor since the early days of Flink streaming and has used Flink in large scale production at King for almost 4 years delivering innovative real-time applications at a global scale.
Gyula grew up in Budapest where he first started working on distributed stream processing and later became a core contributor to the Apache Flink project. Gyula has been a speaker at numerous big data related conferences and meetups, talking about stream processing technologies and use-cases.
Matyas has been working at Cloudera and assisting customers on their big data journey since 2016. After being a member of the Support, then the Professional Services team he has joined Engineering as a founding member of the Cloudera Flink team. He focuses on enterprise requirements including security and operations. Before joining Cloudera he was responsible for delivering classical software development projects in the Telecommunication and Financial sectors. He is a wine enthusiast and a hobby winemaker. He is married and proud owner of a 'sausage dog', Ziggy.
Operations
How to configure your streaming jobs like a proFlink for Everyone: Self-Service Data Analytics with StreamPipes
Flink for Everyone: Self-Service Data Analytics with StreamPipes
October 8: 11:50 am - 12:30 pm40minEcosystemThis talk presents StreamPipes (https://www.streampipes.org), an open source self-service data analytics solution leveraging existing big data technologies such as Apache Flink to provide non-technical users with an easy and intuitive way to connect, analyze and exploit a variety of different streaming data sources for their use.
Newly arising IoT-driven use cases in domains such as manufacturing, smart city or autonomous driving often demand for continuous integration and processing of sensor data in order to derive time-sensitive actions. One example is the optimization of maintenance processes based on the current condition of machines (condition-based maintenance). While this is technically already well supported by the existing big data tool landscape, building such applications still require a crucial set of expertise ranging from general domain expertise, programming skills to deep knowledge on distributed and scalable systems. Such skills are usually not present in hardware-focused manufacturing companies.
To mitigate these shortcomings, StreamPipes allows non-technical users to leverage a graphical editor to model and deploy analytical tasks as pipelines in a drag and drop manner. Pipelines are built based on a toolbox of reusable data adapters, processors and sinks. Toolbox elements encapsulate dedicated algorithms (e.g., filter, aggregation, machine learning classifiers) implemented in big data processing engines such as Apache Flink communicating over an internal distributed messaging system (e.g. Apache Kafka).
In this talk, we present technologies and tools enabling flexible modeling of real-time processing pipelines by domain experts. We motivate our talk by showing real-world examples we gathered from a number of industry projects during the past years in Industrial IoT domains such as manufacturing and supply chain management. For instance, we show how StreamPipes eases the accessibility of big data tools for non-technical users based on examples such as supervising a fleet of autonomous electric delivery vehicles as well as data analytics in one of the largest test areas for autonomous driving in Germany.
Philipp Zehnder is a research scientist at the FZI Research Center of Information Technology and PhD student at the Karlsruhe Institute of Technology (KIT). Philipp holds a master degree in Computer Science from KIT. He was a student assistant at FZI, where he was working on the ProaSense FP7 project. His current research interests are in the areas of Distributed Stream Processing and Streaming Machine Learning. He received a Microsoft Azure for Research Award for his current research work focused on the development of distributed machine learning pipelines.
Patrick Wiener currently works at the FZI Research Center for Information Technology in Karlsruhe. His research interests include Distributed Computing (Cloud, Edge/Fog Computing), IoT, and Stream Processing. Patrick is an expert for infrastructure management such as containers and container orchestration frameworks. He has worked in several public-funded research projects related to Big Data Management and Stream Processing in domains such as logistics and geographical information systems.
Ecosystem
Flink for Everyone: Self-Service Data Analytics with StreamPipesSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
October 8: 1:30 pm - 2:10 pm40minResearchWith its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I will share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I will present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I will conclude with evaluation results, ongoing work, and and future challenges in this area.
Vasia is a postdoctoral fellow at the Systems Group of ETH Zurich and will soon be moving to Boston University as an Assistant Professor of Computer Science. She is interested in distributed stream processing, large-scale graph analytics, and the intersection of the two. Vasia is a PMC member of Apache Flink and co-author of O’Reilly’s “Stream Processing with Apache Flink”.
Research
Self-managed and automatically reconfigurable stream processingMoving on from RocksDB to something FASTER
Moving on from RocksDB to something FASTER
October 8: 2:20 pm - 3:00 pm40minResearchFor many years streaming applications requiring larger-than-memory fault-tolerant state have settled for RocksDB as the de facto state backend. This is despite it being optimised for read and range queries rather than the update intensive workloads typically exhibited in stream processing. Several features of RocksDB’s design, such as its key-order page format and Read-Copy-Update approach, become limiting factors in the throughput of state updates. Given these limitations, we have evaluated the use of FASTER, an embedded Key-Value store from Microsoft Research, as an alternative backend that is more suitable for streaming workloads. It uses in-place updates on a changeable “hot” set in-memory and a cache-optimised hash index to ensure a high throughput of point operations on its HybridLog that spans memory and disk. In this talk we present benchmarking results for different streaming workloads highlighting the performance differences between FASTER and RocksDB. We use these results to motivate an integration between FASTER and Timely Dataflow, with promising results demonstrating FASTER’s suitability as the state backend of choice for large stateful computations. Finally, we will show the early results from the integration of FASTER with Flink.
Studied Computing at Imperial College London which included a year-long exchange program to ETH Zurich. Wrote Master's Thesis as part of the Strymon group under the supervision of Vasia Kalavri and John Liagouris. In September 2019 began a Backend Engineer position at Monzo, London. Interested in Stream Processing and Data-Intensive Applications.
Research
Moving on from RocksDB to something FASTERIntroducing Arc: A common intermediate language for unified batch and stream analytics
Introducing Arc: A common intermediate language for unified batch and stream analytics
October 8: 3:30 pm - 3:50 pm20minResearchToday's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
Max Meldrum is a researcher and systems engineer at RISE SICS in Sweden. His interests lie within distributed systems and areas it intersects with. That being, dataflow processing frameworks (e.g., Flink), scheduling, and data management. Max is one the core developers of Arcon, a distributed Rust-based dataflow runtime capable of executing stream and batch workloads efficiently at native hardware speeds.
Klas Segeljakt is a next-gen compilers researcher and PhD student at KTH in Sweden, currently investigating the space of programming languages and hardware acceleration for data processing. He is known for his contributions to Arc, an intermediate representation aiming to bridge the worlds of batch and stream processing, independently of the frontend language (e.g. SQL) or backend system executing the optimized code (e.g., Flink).
Research
Introducing Arc: A common intermediate language for unified batch and stream analyticsDeploying Stateful Functions-as-a-Service (FaaS) on Streaming Dataflows
Deploying Stateful Functions-as-a-Service (FaaS) on Streaming Dataflows
October 8: 4:00 pm - 4:40 pm40minResearchIn the serverless model, users upload application code to a cloud platform and the cloud provider undertakes the deployment, execution, and scaling of the application, relieving users from all operational aspects. Although very popular, current serverless offerings offer poor support for the management of local application state, the main reason being that managing state and keeping it consistent at large scale is very challenging. As a result, the serverless model is inadequate for executing stateful, latency-sensitive applications.
In this talk, we present a research project at TU Delft which focuses on a high-level programming model for developing stateful functions and deploying them in the cloud. Our programming model allows functions to retain state as well as call other functions. In order to deploy stateful functions in a cloud infrastructure, we translate functions and their data exchange into a stateful dataflow graph. With this talk we aim at demonstrating that using a modified version of an open-source dataflow engine, such as Apache Flink as a runtime for stateful functions. Finally we will demonstrate that we can deploy scalable stateful services in the cloud with surprisingly low latency and high throughput.
Adil Akhter is a functional programmer with a focus on distributed system engineering and data-intensive application architecture. He works at ING as a Lead Engineer and involved in building a state-of-the-art Prediction Serving system. He is passionate about technology and interested in category theory, streaming analytics, scalable machine learning infrastructure, and so on. In his spare time, he hacks with Haskell and Idris, speaks at different conferences, or organises meetups.
Marios Fragkoulis is a postdoctoral researcher at TU Delft, working on scalable stream processing. He holds a PhD in main memory data analytics from the Athens University of Economics and Business and an MSc degree from Imperial College London. Marios is the co-developer of dgsh, the directed graph shell.
Research
Deploying Stateful Functions-as-a-Service (FaaS) on Streaming DataflowsDeep Stream Dynamic Graph Analytics with Grapharis
Deep Stream Dynamic Graph Analytics with Grapharis
October 8: 5:00 pm - 5:20 pm20minResearchWorld's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
Massimo Perini is graph analytics aficionado with deep scientific knowledge and engineering experience in the field. Massimo is currently researching online graph embedding techniques within a multi-MSc degree in Data Science from KTH in Sweden, Politecnico di Milano and Torino in Italy, while also holding a joint Computer Engineering BSc with Tongji University in China. He has been a finalist at the Xilinx Open Hardware 2018 and the winner of the Italian Statistics and Probability Competition in 2013. His general interests lie in the fields of machine learning, big data and real-time data processing.
Research
Deep Stream Dynamic Graph Analytics with GrapharisIntroducing WinBro for Scalable Streaming Graph Partitioning
Introducing WinBro for Scalable Streaming Graph Partitioning
October 8: 5:30 pm - 5:50 pm20minResearchIn today’s world of sensors, events and social networks, data is produced with high frequency, often represented in the form of graphs. Certain applications such as network monitoring and credit fraud detection require fast (near) real-time graph analysis based on the latest data and thus, a great use case for a stream processing framework such as Flink. However, Flink only supports static key-based partitioning which in the case of ever-changing graph data dependencies cannot be a viable solution. On top of that problem, all existing smart graph partitioning algorithms cannot really scale out due to heavy dependencies to shared state. We have therefore investigated the heart of the problem and developed WinBro, a Flink-native framework, that allows the partitioning of graph streams in a scalable way. In principle, our solution exploits Flink's windowing and broadcast state pattern to achieve the first ever smart yet scalable online graph partitioner. We used our partitioner to create optimal subgraphs using real-world data from social networks with billions of edges. WinBro can be used out-of-the-box for boosting dynamic applications (Page Rank and Connected Components Labelling) as a specialized alternative to Flink’s default hash partitioning.
Zainab Abbas is a PhD student at the KTH Royal Institute of Technology, Stockholm, and the Université catholique de Louvain, Louvain La-Neuve. She holds a joint masters degree from KTH, Stockholm, and the Polytechnic University of Catalonia (UPC), Barcelona, in Distributed Systems. Her research work is focused on performance optimization techniques for large-scale data. In particular, stream processing using modern data stream processing engines, i.e. Apache Flink.
Adrian Ackva is a System Research Intern at Research Institutes of Sweden (RISE). He has a background in Business Information Systems and is about to finish M. Sc. degrees specialized in data-intensive computing at KTH Royal Institute of Technology Stockholm and University of Rennes 1. Before his Master studies, he worked as a technical consultant in different projects in Germany and England, helping to get their infrastructure scalable and automated.
Research
Introducing WinBro for Scalable Streaming Graph PartitioningKeynote: Stream Processing and Applications in the Modern Age
Keynote: Stream Processing and Applications in the Modern Age
October 8: 9:10 am - 9:50 am40minKeynoteSince the beginning of the year, Apache Flink has made big headway in the unification of its stream- and batch processing capabilities. This effort is a major milestone in the bigger vision of building a stream processor that can form the backbone for both unified batch and stream data processing as well as for event-driven applications in a common way.
With the batch/streaming unification well underway, we want to look again at the other end of the spectrum: How can Stream Processing help to build modern event-driven applications:
- What challenges do application developers face when using stream processors as their framework?
- What building blocks and tools are still missing to build complex applications?
- How does stream processing fit into the major trends in the application development space, like serverless architectures?
- How does stream processing compare to other approaches (req/resp, actors, etc.) and what can those approaches learn from each other?
Stephan Ewen is CTO and co-founder at Ververica where he leads the development of the stream processing platform based on open source Apache Flink. He is also a PMC member and one of the original creators of Apache Flink. Before working on Apache Flink, Stephan worked on in-memory databases, query optimization, and distributed systems. He holds a Ph.D. from the Berlin University of Technology.
Keynote
Keynote: Stream Processing and Applications in the Modern AgeKeynote: Cloudera's Data-in-Motion vision
Keynote: Cloudera's Data-in-Motion vision
October 8: 10:05 am - 10:30 am25minKeynoteCloudera’s Data-in-Motion team is adopting Flink to complete its platform for collecting, processing and analyzing event streams. In working with our customers we find that an increasing number of enterprises implement streaming first solutions, where insights are delivered in near real-time. In this talk, we discuss the enterprise requirements for an end-to-end streaming platform and Flink’s role in our vision for it. We share usage patterns where our customers are interested in using or are already using Flink in production.
Marton is a Flink PMC member and one of the first contributors to the streaming API. He has driven big data adoption at around 50 customers as a Senior Solutions Architect at Cloudera during the last four years. He is the manager of the newly formed Streaming Analytics team and focuses on adding Flink to the Cloudera platform.
As the Chief Technology Officer of Cloudera in Asia Pacific, Andrew is responsible for working closely with enterprises to transform their businesses by unlocking the potential of their data running on any cloud from the edge to AI. A trusted advisor and partner to many prominent Executives across the region, Andrew helps businesses to maximize ROI by identifying complex business problems, reducing them to deliverable solutions and achieving business objectives.Andrew joined Cloudera as part of the merger with Hortonworks in early 2019 and has a 20+ year career leading from the intersection of business and technology to drive strategic planning, tactical development, and implementation of leading-edge technology solutions across enterprises globally.He is recognized for being a hands-on-leader with a customer focus with deep technical expertise and a reputation for integrity, quality, efficiency and reliability. Andrew is a highly sought after speaker and can often be seen presenting at many thought-leading industry events. He also teaches at a number of universities in North America and has also authored a book titled “Streaming Data” http://manning.com/psaltis/.
Keynote
Keynote: Cloudera's Data-in-Motion visionLunch Talk: How to contribute to Apache Flink
Lunch Talk: How to contribute to Apache Flink
October 8: 1:00 pm - 1:30 pm30minCommunityThis talk will introduce you to the various ways of contributing to Apache Flink. We will start by an introduction into the Apache Software Foundation, its history and present-day status, and where Apache Flink fits within the Foundation. We’ll then give an overview of the different areas to contribute, and what to consider in those areas.
For example, how can I help users best on the user@flink mailing list, how are decisions being made in the community, how can I help with Flink releases or how can I contribute code to the project.
Robert Metzger is a PMC member of the Apache Flink project and a co-founder and an engineering lead at Ververica. He is the author of many Flink components including the Kafka and YARN connectors. Robert studied Computer Science at TU Berlin and worked at IBM Germany and at the IBM Almaden Research Center in San Jose. He is a frequent speaker at conferences such as the Hadoop Summit, ApacheCon and meetups around the world.
Community
Lunch Talk: How to contribute to Apache FlinkCoffee Break - Book Signing: "Stream Processing with Apache Flink" by Fabian Hueske & Vasiliki Kalavri
Coffee Break - Book Signing: "Stream Processing with Apache Flink" by Fabian Hueske & Vasiliki Kalavri
October 8: 3:00 pm - 3:30 pm30minOtherBecome one of the lucky Flink Forward attendees to get a free signed copy of "Stream Processing with Apache Flink" - the recent book by Fabian Hueske & Vasiliki Kalavri! The number of books is limited so be sure to visit the Ververica booth during the break to grab your copy.
Welcome Session
Welcome Session
October 8: 9:00 am - 9:10 am10minKeynoteHolger is leading all things growth which we define as Community as well as Commercial growth since both are essential to the success of the company but also the wider Flink Community. Holger has been in the IT industry since almost 20 years now for organisations such as Citrix, VMWare and Neo4j and is passionate about Open Source deep tech technologies and Open Core business models.
Keynote
Welcome SessionKeynote: Open Source in the Age of Change
Keynote: Open Source in the Age of Change
October 8: 9:50 am - 10:05 am15minKeynoteTechnological advancement has fundamentally changed our global economy, presenting both challenges for incumbents and opportunities for new entrants. An organization’s ability to adapt, respond and harness technology is the new measure of success. Gone are the days of maintaining a sustainable competitive advantage in your industry. Many enterprises are adapting to this new reality, but many are still struggling with the pace of change. For those of us that help create new technologies, especially those that build open source software, this is an exciting time. We get to create. How do we continue creating, while helping our users continue to adapt and adopt? Join Chip Childers, CTO of Cloud Foundry Foundation, as he discusses the unique challenges and opportunities generated by the digital era.
Chip has spent 20 years in large-scale computing and open source software. In 2015, he became the co-founder of the Cloud Foundry Foundation as Technology Chief of Staff. He was the first VP of Apache Cloudstack, a platform he helped drive while leading Enterprise Cloud Services at SunGard and then as VP Product Strategy at Cumulogic. Prior to SunGard, he led the rebuild of mission-critical applications for organizations including IRS.gov, USMint.gov, Merrill Lynch and SEI Investments. Chip is an experienced speaker at events like OSCON, LinuxCon North America, LC Japan, LC EU, ApacheCon, O’Reilly Software Architecture Conference, and many more. In his free time, Chip loves trail hiking with his black lab, sailing catamarans and sunfish, and trying to keep up with his young daughter.
Keynote
Keynote: Open Source in the Age of ChangeKeynote: Stream Processing and Applications in the Modern Age
Since the beginning of the year, Apache Flink has made big headway in the unification of its stream- and batch processing capabilities. This effort is…
Keynote: Open Source in the Age of Change
Technological advancement has fundamentally changed our global economy, presenting both challenges for incumbents and opportunities for new entrants. …
Keynote: Cloudera's Data-in-Motion vision
Cloudera’s Data-in-Motion team is adopting Flink to complete its platform for collecting, processing and analyzing event streams. In working with our …
Apache Flink Worst Practices
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, rea…
Lunch Talk: How to contribute to Apache Flink
This talk will introduce you to the various ways of contributing to Apache Flink. We will start by an introduction into the Apache Software Foundation…
Build a Flink AI Ecosystem
Machine learning, especially deep learning, has become critical in data processing. Frameworks such as TensorFlow, PyTorch or Caffe are widely adopted…
A Tale of Dual Sources: Pictures of Grief and The Job Manager’s Clock
It was the best of times, it was the worst of times. In the year of our Flink 2017, Lyft proclaimed that, indeed, dual sources are hard. But how hard …
Coffee Break - Book Signing: "Stream Processing with Apache Flink" by Fabian Hueske & Vasiliki Kalavri
Become one of the lucky Flink Forward attendees to get a free signed copy of "Stream Processing with Apache Flink" - the recent book by Fabi…
Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a crit…
Flink’s New Batch Architecture
Since its inception, Flink supports to execute batch workloads. Using specialized operators for processing bounded streams allows Flink to achieve an …
Towards More Efficient and Adaptive Scheduling for Flink Batch
As a unified data processing framework, Flink has continuous evolution with current refactoring of scheduling strategies. Based on the redesigned inte…
Intelligent Log Analysis and Real-time Anomaly Detection @ Salesforce
Application performance monitoring is one of the key DevOps and SRE duties for any large cloud-based organization, with Salesforce being one of the la…
Dynamically Generated Flink Jobs at Scale
The Data Lake runs 120K jobs a day all dynamically generated from metadata with the goal of ingesting data from source, validating it, and then perfor…
Airbus makes more of the sky with Flink
Make More of the Sky - Air Traffic Management (ATM) services affect the quality and performance of every commercial flight in the world - currently tr…
Not So Big – Flink as a true Application Framework
MotaWord is a collaborative translation platform where multiple translators together work on documents in real-time. It uses Flink to empower its inte…
Running Flink in production: The good, the bad and the in-between
The streaming platform team at Lyft has been running Flink jobs in production for more than a year now, powering critical use cases like improving pic…
Introspection of the Flink in production
Your Flink job is breaking your SLA. The alert is raised and you try to understand the root cause of the issue and mitigate it. What would be your fir…
Bringing Cypher to Apache Flink
Cypher is a declarative graph query language based on the property graph model and used by database systems such as Neo4j. The strive of insights into…
Real-time Experiment Analytics at Pinterest with Apache Flink
At Pinterest, over 250M monthly users all over the world come for inspiration and to discover and do what they love. Our experimentation platform proc…
Do Flink on Web with FLOW
We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in sp…
Time-To-Live: How to perform Automatic State Cleanup in Apache Flink
A common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of state size and…
Building a Self-Service Streaming Platform at Pinterest
Pinterest is a visual discovery engine that helps more than 250 million monthly active users discover things they love and inspires them to go do thos…
Beam on Flink: How does it actually work?
Apache Beam is a data processing model built with focus on portability. Beam jobs can be written in the language of your choice: Java, Python, Go, or …
One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables
Apache Calcite is a data management framework that includes a SQL parser and query optimizer. It is used by many projects that implement SQL processin…
Demo: From Zero to Production with Ververica Platform
Come see a demo of the upcoming release of Ververica Platform. I will show how to get started with the platform, from initial installation to a runnin…
Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications
Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick r…
Sponsored Talk: Agile Analytics: Model-driven Flink
Every organization wants to go faster. Business wants data, but Engineering can't keep up. Growing demands for faster analytics and insights have…
Large Scale Real Time Ad Invalid Traffic Detection with Flink
Criteo is receiving 300 billion Ad bid requests and performing 4 billion displays daily, to protect our advertisers from Invalid Traffic, we need a re…
Writing an interactive streaming SQL engine and pre-parser using Flink
In this talk I will cover our journey building an interactive SQL interface and engine that leverages Apache Flink and Apache Calcite. I will talk abo…
Unify Enterprise Data Processing System: Platform-level integration of Flink and Hive
In this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data.Unificati…
What's new for Flink's Table & SQL APIs? Planners, Python, DDL, and more!
About three years ago, the Apache Flink community started adding a Table & SQL API to process static and streaming data in a unified fashion. It m…
SQL Ask-Me-Anything
The Flink community has been working on unified SQL support for batch and streaming data for many years.In the last two releases (Flink 1.8 and Flink…
A year in the Apache Flink Community
Flink is moving quickly, and so does the community driving it. There are many efforts such as the integration of the Chinese-speaking community, the d…
Flink SQL Powered AutoML Pipeline
Machine learning system at scale is a team work requiring the collaboration of data engineers, who develops, constructs, and maintains architectures, …
Fighting phishing and spam with online machine learning on data streams
Machine Learning on streaming data is becoming truly important for our products. Our systems grow in size producing more and more data with users expe…
Sponsored Talk: Streaming Processing Options in Google Cloud
By 2025, more than a quarter of data created in the global datasphere will be real-time in nature. To help make sense of 50+ zettabytes (10^21) of str…
How to configure your streaming jobs like a pro
In this talk we take an in-depth tour of the different job and cluster configuration options available in Flink and how to leverage them to maximize p…
Flink for Everyone: Self-Service Data Analytics with StreamPipes
This talk presents StreamPipes (https://www.streampipes.org), an open source self-service data analytics solution leveraging existing big data technol…
Self-managed and automatically reconfigurable stream processing
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job rec…
Moving on from RocksDB to something FASTER
For many years streaming applications requiring larger-than-memory fault-tolerant state have settled for RocksDB as the de facto state backend. This i…
Introducing Arc: A common intermediate language for unified batch and stream analytics
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tenso…
Deploying Stateful Functions-as-a-Service (FaaS) on Streaming Dataflows
In the serverless model, users upload application code to a cloud platform and the cloud provider undertakes the deployment, execution, and scaling of…
Deep Stream Dynamic Graph Analytics with Grapharis
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-…
Introducing WinBro for Scalable Streaming Graph Partitioning
In today’s world of sensors, events and social networks, data is produced with high frequency, often represented in the form of graphs. Certain applic…
Flinking, Fast and Slow
Flinking, Fast and Slow
October 9: 11:30 am - 11:50 am20minOperationsHave you tried to build a new Flink project in an already well-established, non-Flink ecosystem? Curious to try it? My team built our company’s first Flink service in production and had to determine how to incorporate it into a complex environment of non-Flink services and processes. I will cover what we learned about integrating large, tier 1 Flink services, as well as more simple micro services, and the best ways we found to connect them all into a healthy and robust software system. This talk will provide pragmatic strategies and best practices for overcoming the challenges of integrating Flink into pre-existing systems in a way that is efficient and maximizes Flink’s strengths.
Enabling Machine Learning with Apache Flink
Enabling Machine Learning with Apache Flink
October 9: 12:00 pm - 12:40 pm40minUse CaseIn the world of ride sharing, many decisions such as matching a passenger to the nearest driver, need to be made in realtime. As a result, building and reacting to the current state of the world is imperative. In this presentation I will talk about how we built a Machine Learning platform for realtime feature generation at Lyft using Flink Streaming. I will be covering some of the problems we solved, such as dealing with skew when reading from multiple sources, controlling throughput when talking to micro-services, bootstrapping, as well as other operational aspects of building a production quality ML platform.
Sherin is a Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data Science and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial Intelligence and Machine Learning through Streaming.
She is passionate about getting more people, especially women, interested in the field of data and has been trying her best to share her work with the community through tech talks and panel discussions. Most recently she gave a talk about Flink Streaming, at Connect 2019(a Women Who Code event) in San Francisco.
In her free time she loves to read and paint. She is also the president of the Russian Hill book club based in San Francisco and loves to organize events for her local library.
Use Case
Enabling Machine Learning with Apache FlinkTowards Flink 2.0: Unified Batch & Stream Processing
Towards Flink 2.0: Unified Batch & Stream Processing
October 9: 1:40 pm - 2:20 pm40minTechnology Deep DiveFlink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Aljoscha Krettek is a co-founder at Ververica where he works on the Flink APIs in the open source. He is also a PMC member at Apache Flink and Apache Beam. Before working on Flink, he studied Computer Science at TU Berlin, he has worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha has spoken at Hadoop Summit, Strata, Flink Forward and several meetups about stream processing and Apache Flink before.
Technology Deep Dive
Towards Flink 2.0: Unified Batch & Stream ProcessingBatch/Stream Ask-Me-Anything
Batch/Stream Ask-Me-Anything
October 9: 2:30 pm - 3:10 pm40minCommunityProcessing batch and streaming data with a single, unified engine have been the goal of the Apache Flink community for several years.
Over the last months, the community has made good progress towards realizing this vision.
While most changes currently affect the internals of the system and are not visible to the users, the last step of the plan is to replace the DataSet API by an extension of the DataStream API that works on bounded streams.
Flink committers who are involved in the design and implementation of this effort will gather for this Ask-me-Anything session and answer all the questions that you might have about unified batch/stream processing with Flink.
Aljoscha Krettek is a co-founder at Ververica where he works on the Flink APIs in the open source. He is also a PMC member at Apache Flink and Apache Beam. Before working on Flink, he studied Computer Science at TU Berlin, he has worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha has spoken at Hadoop Summit, Strata, Flink Forward and several meetups about stream processing and Apache Flink before.
Till is a PMC member of Apache Flink and engineering lead at Ververica. His main work focuses on enhancing Flink’s scalability as a distributed system. Till studied computer science at TU Berlin, TU Munich and École Polytechnique where he specialized in machine learning and massively parallel dataflow systems.
I work at realtime compute team in Alibaba, and mostly focus on building a unified, high-performance SQL engine based on Apache Flink.
Community
Batch/Stream Ask-Me-AnythingStream SQL with Flink @ Yelp
Stream SQL with Flink @ Yelp
October 9: 3:30 pm - 3:50 pm20minUse CaseYelp has been using Flink for over 2 years to power the entire Data Pipeline infrastructure. We’ve built several components to power a modular architecture that allows our users to solve complex use cases that translate to critical product features. In this talk I’ll present a component called Stream SQL, which is a service built around the Flink SQL API. I’ll discuss how we integrated the Flink SQL API into Yelp’s Data Pipeline infrastructure, the strategy we follow to run Flink acceptance testing and deployment, which has evolved over time to meet a constant increase in demand and scale.
Enrico works as a tech lead of the Data Infrastructure at Yelp, designing, building and maintaining data streaming and real-time processing infrastructure. Since 2013 he’s been working on real-time processing systems, designing and scaling Yelp's data pipeline to move and process in real-time hundreds of terabytes of data and tens of billions of messages every day. Enrico loves designing robust software solutions for stream processing that scale and building tools to make application developers’ interaction with the infrastructure as simple as possible. At Yelp, Enrico has led the teams that build and maintain the Kafka and Flink deployments and the overall data pipeline. Enrico has previously spoken about Apache Flink, Apache Kafka and Apache Beam at Flink Forward SF, Berlin Buzzwords, Techsummit.io, ApacheCon and several meetups.
Use Case
Stream SQL with Flink @ YelpPanel Discussion: Building Self Service Platforms for Apache Flink
Panel Discussion: Building Self Service Platforms for Apache Flink
October 9: 4:00 pm - 5:00 pm1h 0minCommunityModerated by: Robert Metzger & Konstantin Knauf (Ververica)
This panel session will review lessons learned and best practices for building self-service platforms for Apache Flink. The panellists gained many years of experience building internal and public Flank-Based platforms.
We will compare the different solutions, the interfaces and abstractions they provide to their users, operational challenges as well as future plans.
Kenny has 18 years of experience with various database platforms behind some of the busiest datasets in the world. Most recently he Co-Founded ObjectRocket. He has had roles as Chief Technologist, Architect, Director, Manager, Developer, and DBA. He was a key member of the early teams that scaled Paypal and then eBay, ran one of the largest PostgreSQL installations on the planet, and was a very early adopter and Entrepreneur using MongoDB. He is an active database community member, speaker, and evangelist.
Loves vi.
Steven is a software engineer on the Data Processing Platform at Pinterest. He primarily works on Pinterest’s streaming platform, Xenon, and has helped Pinterest move from a Mesos-based micro-batch stream processing model to true streaming with Flink on YARN.
Enrico works as a tech lead of the Data Infrastructure at Yelp, designing, building and maintaining data streaming and real-time processing infrastructure. Since 2013 he’s been working on real-time processing systems, designing and scaling Yelp's data pipeline to move and process in real-time hundreds of terabytes of data and tens of billions of messages every day. Enrico loves designing robust software solutions for stream processing that scale and building tools to make application developers’ interaction with the infrastructure as simple as possible. At Yelp, Enrico has led the teams that build and maintain the Kafka and Flink deployments and the overall data pipeline. Enrico has previously spoken about Apache Flink, Apache Kafka and Apache Beam at Flink Forward SF, Berlin Buzzwords, Techsummit.io, ApacheCon and several meetups.
Ryan Nienhuis is a technical product manager who helps customers use the technology to deliver business value. He has created and managed cloud services products focused on analytics. Ryan has worked on the Amazon Kinesis team at AWS for the past five years, where he defines products that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.
Community
Panel Discussion: Building Self Service Platforms for Apache FlinkRun Interactive Queries with Apache Flink
Run Interactive Queries with Apache Flink
October 9: 11:30 am - 11:50 am20minTechnology Deep DiveAs a well-established stream processing engine, Flink has also been keeping expanding its horizon into the batch processing world. Among all the differences between batch and streaming applications, an important one is query pattern. Continous query is the most native query pattern of stream processing. Stream applications draw a static DAG to describe the query logic and submit it as a long running job. In contrast, batch processing heavily relies on interactive queries. That means later query logic may vary depending on the output of earlier queries. Hence the application logic is difficult to be described as a static DAG.
In this short talk we will introduce the new Flink features we introduced to better support interactive queries. Including the API semantics, the intermediate results caching and metadata management mechanism, as well as potential use cases. From this talk, the audience will learn how Flink supports batch and stream processing at the same time, and how interactive queries can be used to help the users write their batch applications.
Jiangjie (Becket) is currently a software engineer at Alibaba where he mostly focus on the development of Apache Flink and its ecosystem. Prior to Alibaba, Becket worked at LinkedIn to build streams infrastructures around Apache Kafka after he received Master degree from Carnegie Mellon University in 2014. Becket is a PMC member of Apache Kafka.
Technology Deep Dive
Run Interactive Queries with Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
October 9: 12:00 pm - 12:40 pm40minEcosystemBoth Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from the Apache Pulsar community will share the latest integrations between Apache Pulsar and Apache Flink. He will explain how Apache Flink can integrate and leverage Pulsar’s built-in efficient schemas to allow users of Flink SQL query Pulsar streams in realtime.
Sijie Guo is the founder of StreamNative. StreamNative is an infrastructure startup, focusing on building cloud native event streaming systems around Apache Pulsar. Previously, he was the tech lead for the Messaging Group at Twitter, and worked on push notification infrastructure at Yahoo. He is also the VP of Apache BookKeeper and PMC Member of Apache Pulsar.
Ecosystem
Query Pulsar Streams using Apache FlinkThe role Stream Processing plays for an Opti Channel Experience
The role Stream Processing plays for an Opti Channel Experience
October 9: 1:40 pm - 2:20 pm40minUse CaseCustomers want instant gratification and are pushing for their banking experiences be enable via different platforms they are frequently interact with, rather than through traditional banking apps – often referred to as an “Opti (Optimal) Channel Experience”.
This now means organizations no longer own all channels they offer services through, but rather ownership lies with the disparate data points being exposed through these channels.
This talk describes how Apaches Kafka, Beam and Flink are used for stream processing and how data is exposed through Opti Channels and the use of cognitive automation and Artificial Intelligence to support these experiences.
Jamie has been working with data for 13+ years, began his data career as an ETL developer, and now finds himself in the world of stream processing.
Over the past 4 years Jamie has been leading stream processing implementations, from the design of the dataflows, to building the engineering capabilities to support the implementation, across two banks, in two geographical locations.
Use Case
The role Stream Processing plays for an Opti Channel ExperienceFrom BaaB to EaaS in the Financial Industry
From BaaB to EaaS in the Financial Industry
October 9: 2:30 pm - 3:10 pm40minUse CaseIs it possible to kill the batch way in the Financial Industry?
Indizen technologies plays a key role in the architecture and innovation team of the first Spanish bank corporate and investment banking department. During the last two years we have been working in the leading streaming platform of the area.
The goal is to change the BaaB (Batch as a batch) approach into an EaaS(Everything as a Streaming). We will discuss the problems suffered to beat the legacy dependencies to accomplish the real time first era and the main benefits of our zeta architecture implementation with Flink and Kafka being the main characters. This architecture has a lot of special features, such as the separation of technology and business using a business rule engine, the metadata evolution, real time analytics, the integration of many different projects each other using event sourcing and CQRS patterns etc.
Last but not least, we will enlighten you with some particular real time use cases in order to reduce costs, operational risk and reconciliation operations learned during these years.
Big Data changed my life. I started working with the elephant and his friends in 2013 in one of the first big data projects in Spain for Deutsche Bank. Until now I had the opportunity to work with several teams and different countries from Mapreduce through Spark and from 2017 until the present with Flink designing and developing innovation solutions.
My current role is Big Data & Innovation Architect at Indizen Technologies. Spanish company located in Madrid and Málaga specialized in R&D for financial services.
Innovative and technological enthusiastic with a broad career (+18 years) as Technical Expert and Leader, lately focused on helping companies to take advantage of Big Data Technologies in their business.
Use Case
From BaaB to EaaS in the Financial IndustryScotty: Efficient Window Aggregation with General Stream Slicing
Scotty: Efficient Window Aggregation with General Stream Slicing
October 9: 3:30 pm - 3:50 pm20minResearchWindow aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage.
However, each technique operates under different assumptions with respect to workload characteristics such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or countbased), and stream (dis)order. Violating the assumptions of a technique can deem it unusable or drastically reduce its performance.
In this talk, we present Scotty an implementation of a general stream slicing technique for window aggregation. This technique automatically adapts to workload characteristics to improve performance without sacrificing its general applicability. Our experiments show that Scotty outperforms alternative implementations, like the default window operator in Flink, by up to one order of magnitude.
Furthermore, we present how to use Scotty as a library in Flink, Storm, or Beam without changing the underlying Stream Processing System.
General stream slicing was first published at EDBT 2019 (http://www.user.tu-berlin.de/powibol/assets/publications/traub-efficient-window-aggregation-with-general-stream-slicing-edbt-2019.pdf) where it received the Best Paper Award.
The Scotty library and its connectors are available as open-source (https://github.com/TU-Berlin-DIMA/scotty-window-processor) and contributions are highly welcome.
Jonas is a Research Associate at Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). His research interests include data stream processing, sensor data analysis, and data acquisition from sensor nodes. Jonas authored several publications related to data stream gathering, processing and transmission in the Internet of Things and will complete his PhD in March 2019 under the supervision of Volker Markl. Before he started his PhD, Jonas wrote his master thesis at the Royal Institute of Technology (KTH) and the Swedish Institute of Computer Science (SICS) / RISE in Stockholm under supervision of Seif Haridi and Volker Markl and advised by Paris Carbone and Asterios Katsifodimos. Prior to that, he received his B.Sc. degree at Baden-Württemberg Cooperative State University (DHBW Stuttgart) and worked several years at IBM in Germany and the USA. Jonas is an alumnus of "Software Campus", "Studienstiftung des deutschen Volkes" and "Deutschlandstipendium"
Philipp is a Research Associate at Technische Universität Berlin and a PhD candidate supervised by Volker Markl. His research interests include data stream processing, query compilation, and the exploitation of modern hardware. Before joining TU Berlin, he has worked for several companies and collected experiences in frontend and backend software development. At the German Research Center for Artificial Intelligence, he joined a streaming systems oriented research project involving Apache Flink as a research assistant.
He graduated with a M.Sc. in computer science in March 2019 at TU-Berlin. Prior to that, he received his B.Sc degree at Hamburg University of Applied Sciences.
Research
Scotty: Efficient Window Aggregation with General Stream SlicingWhen ordering matters
When ordering matters
October 9: 4:00 pm - 4:40 pm40minUse CaseIn the last decades many systems have been used that were described as "queues" (AQ, ActiveMQ, RabbitMQ, etc.), yet from a computer science perspective these are not queues at all. Many of us have learned to work quite effectively with these messaging systems and we all understand that we cannot expect to receive the messages in any particular order and that we get all messages exactly once (which we can expect with a queue). With the arrival of Kafka and Flink a new class of applications became possible. In this talk I will go into several real applications from the bol.com context that all revolve around low latency behavioral analytics. I will talk about the entire end-to-end pipeline from the webbrowser and application server to application and discuss many of the things to think about when creating your analysis application. I will also touch upon using state machines as a way of doing this type of behavioral analysis using very simple software and show example algorithms from our context.
Niels Basjes (1971) has been working for bol.com since May 2008. Before that, he was working as a Webanalytics architect for Moniforce, and as an IT architect/researcher at the National Aerospace Laboratory in Amsterdam. Since the second half of the 1990s he has been working on processing problems that require scalability. He has applied these concepts in the past 20 years in aircraft/runway planning, IT operations and in the field of web analytics to build reports for some of the biggest websites in the Netherlands. Also at bol.com the primary focus of Niels Basjes are scalability problems and he is responsible for a shift in thinking about data and the business value it contains. Niels designed and implemented many of the personalization algorithms that are in production today at bol.com. Niels studied Computer Science at the TU Delft, and has Business administration degree at Nyenrode University. Niels is an active opensource developer who is one of the Apache Avro PMC members and has authored ( https://github.com/nielsbasjes/ ) and contributed various improvements and bugfixes to projects like Hadoop, HBase, Pig and Flink.
Use Case
When ordering mattersReal-time Stream Analytics and Scoring Using Apache Flink, Druid & Cassandra at Deep.BI
Real-time Stream Analytics and Scoring Using Apache Flink, Druid & Cassandra at Deep.BI
October 9: 4:50 pm - 5:10 pm20minUse CaseOne of the hardest challenges we are trying to solve is how to deliver customizable insights based on billions of data points in real-time, that fully scale from a single perspective of an individual up to millions of users.
At Deep.BI we track user habits, engagement, product and content performance, processing terabytes or billions of events of data daily. Our goal is to provide real-time insights based on custom metrics from a variety of self-created dimensions. The platform allows to perform tasks from various domains such as adjusting websites using real-time analytics, running AI optimized marketing campaigns, providing a dynamic paywall based on user engagement and AI scoring, or detecting frauds based on data anomalies and adaptive patterns.
To accomplish this, our system collects every user interaction. We use Apache Flink for event enrichment, custom transformations, aggregations and serving machine learning models. The processed data is then indexed by Apache Druid for real-time analytics and Apache Cassandra for delivery of the scores. Historical data is also stored on Apache Hadoop for machine learning model building. Using the low-level DataStream API, custom Process Functions, and Broadcasted State, we have built an abstract feature engineering framework that provides re-usable templates for data transformations. This allowed us to easily define domain specific features for analytics and machine learning, and migrate our batch data preprocessing pipeline from Python jobs deployed on Apache Spark to Flink, resulting in a significant performance boost.
This talk covers our challenges with building and maintaining our platform and lessons learned along the way, namely how to:
- evolve a continuous application processing an unbounded data stream,
- provide an API for defining, updating and reusing features for machine learning,
- handle late events and state TTL,
- serve machine learning models with the lowest latency possible,
- dynamically update the business logic at runtime without a need of redeploy, and
- automate the data pipeline deployment.
Michał Ciesielczyk is a Machine Learning Engineer at Deep.BI. He is responsible for researching, building and integrating machine learning tools with a variety of technologies including Scala, Python, Flink, Kafka, Spark, and Cassandra. Previously, he worked as an assistant professor at Poznan University of Technology, where he received a Ph.D. in computer science and was a member of a research team working on numerous scientific and R&D projects. He has published more than 15 refereed journal and conference papers in the areas of recommender systems and machine learning.
Sebastian Zontek is the CEO, CTO and co-founder of Deep.BI, Predictive Customer Data Platform with real-time user scoring. He is an experienced IT systems architect with particular emphasis on the production use of open source systems for big data such as Flink, Cassandra, Hadoop, Spark, Kafka, Druid in BDaaS solutions (Big Data as a Service), SaaS (Software as a Service), and PaaS (Platform as a Service). Previously, CEO and main platform architect at Advertine. The Advertine network allowed to match product ads with the user preferences, predicting their purchasing intent using ML and NLP techniques.
Use Case
Real-time Stream Analytics and Scoring Using Apache Flink, Druid & Cassandra at Deep.BIMaking Sense of Streaming Sensor Data: How Uber Detects On-trip Car Crashes
Making Sense of Streaming Sensor Data: How Uber Detects On-trip Car Crashes
October 9: 11:30 am - 11:50 am20minUse CaseSafety is one of the most crucial concerns of Uber’s ride-sharing platform. To advance the timeliness in response to the safety issues in daily business, Uber runs a Flink pipeline that joins together multiple high volume (>10 TB/day) streaming sources of sensor data along with trip information in order to extract a number of contextual features. It deploys a deep learning model on top of TensorFlow in Flink to detect potential car crashes and identify general trip and driving anomalies. This knowledge is then passed down to the business operational teams to proactively reach out to riders and drivers to confirm their user experiences and provide prompt safety assistance if needed.
Nikolas is a software engineer on Uber's Driving Safety team, where he works on using sensor/context-derived insights to make inferences about events that Uber drivers experience and acting on this knowledge accordingly. Previously, Nikolas studied Mathematics and Statistics at the University of Chicago.
Jin is a software engineer on Uber’s Driving Safety team. In particular, she works with safety-related activities that happen on-trip, including detecting distracted driving behavior and potential car crashes. She has a Computer Science degree from the University of Southern California. Previously, she worked at Mercedes-Benz R&D North America to collect streaming telematics data for business insight and product improvement. Both inside and outside her work, she enjoys cultivating her interest in driving, cars, and machine learning.
Use Case
Making Sense of Streaming Sensor Data: How Uber Detects On-trip Car CrashesMulti-tenanted streams @Workday
Multi-tenanted streams @Workday
October 9: 12:00 pm - 12:40 pm40minUse CaseAt WORKDAY Inc., #1 Future Fortune company 2018 (link: https://fortune.com/future-50/2018/workday/), we process data for our community of more than 39 million workers, including 40 percent of Fortune 500 organizations. Our success is driven by the trust our customer puts on us and we give them confidence with our strict security regulations. This demands that we always encrypt customer data at rest and in transit: each piece of data should always be stored, encrypted with the customer key.
This is a challenge in a Data Streaming platform like Flink, where data may be persisted in multiple phases:
Storage of States in Checkpoints or Savepoints, Temporary fs storage for time-window aggregation, Common spilling to disk when heap is full.
On top of that, we need to consider that in a Flink dataflow data might get manipulated and we need to maintain the context needed to correctly encrypt it.
Come join us to see how we solved this challenges to provide a secure platform to support our MachineLearning organization, how we extended AVRO libraries to enable encryption at serialization and how we support data traceability for GDPR.
Agnoli Enrico is a Software Engineer at Workday. During the last 5 years, he worked on multiple technical projects as a developer, tech lead and people manager at different stages. Currently involvements:
- As architect and developer to technically lead the delivery of a new DataStreaming platform to support ML
- Investigate new technologies and deliver POC for possible new tools/products, like streaming platforms, blockchain, audibility of machine learning models and data security
- Being part of the Workday Giving&Doing foundation, he helps to organize events and raise awareness on various causes / nonprofit groups.
Studied at Politecnico of Milan and moved to Germany right after to work first on Honda’s ASIMO humanoid robots, then on automation software in one of Europe biggest datacenter for Amadeus and finally for Workday, #1 Future Fortune company of 2018. Workday's innovator of the year in 2018 for a research project on Blockchain.
Leire has been a Software Engineer at Workday for the last 4 years, although it has been over a decade that she is immersed into Software development, performing multiple roles as developer, tech lead and mentor.
Leire is passionate about building quality code, from conception through implementation, testing and delivery. Being the newest member of the Data Streaming Platform team in Workday, she's excited to be given the opportunity to work with Apache Flink and explore the possibilities and challenges that it has to offer.
When not behind a computer, Leire enjoys ‘all things outdoors’, with a bit of circus arts on the side.
Use Case
Multi-tenanted streams @WorkdayExtending Flink state serialization for better performance and smaller checkpoint size
Extending Flink state serialization for better performance and smaller checkpoint size
October 9: 1:40 pm - 2:20 pm 40minOperationsOperations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easily make your job to spend most of a precious CPU time in serialization and inflate a checkpoint size to the sky. In this talk we’ll focus on a Flink serialization framework and common problems happening around it:
* Is Kryo fallback is really that expensive from the CPU and state size perspective?
* How to plug your own or existing serializers into the Flink (like protobuf).
* Using Scala sealed traits without Kryo fallback.
* Using custom integer variable-length encoding and delta encoding for primitive arrays to further reduce the state size.
Roman Grebennikov is a passionate software developer from Russia with hands-on experience in software development, JVM and high-performance computation. During last years he has focused on the delivery of functional programming principles and practices to real-world data analysis and machine-learning projects.
Operations
Extending Flink state serialization for better performance and smaller checkpoint sizeState Unlocked
State Unlocked
October 9: 2:30 pm - 3:10 pm40minOperationsAs stateful streaming processing becomes more and more mature for complex event-driven applications and real-time analytics, users have put Apache Flink into the center of their business logic and entrusted it to manage their most valuable assets, their application data, as internal state of Flink streaming pipelines.
At the same time, the Flink community has continued efforts to make sure that users feel safe and future-proof in doing that. They should have sufficient means to access and modify their state, as well as making it much easier to bootstrap state with existing data from external systems. These efforts span multiple Flink major releases and consists of the following: 1) evolvable state schema, 2) flexibility in swapping state backends, and 3) an offline tool (currently named “Savepoint Connector” in the community) to read, process, and create new snapshots that streaming applications can bootstrap its state with.
In this talk, we will go over these topics and demonstrate how users can interact with state with the availability of these new features. A demo will also be prepared to showcase usage of the Savepoint Connector.
Tzu-Li (Gordon) Tai is an Apache Flink PMC member and software engineer at Ververica. His main contributions in Apache Flink includes work on some of the most widely used Flink connectors (Apache Kafka, AWS Kinesis, Elasticsearch). Gordon was a speaker at conferences such as Flink Forward, Strata Data, as well as several Taiwan-based conferences on the Hadoop ecosystem and data engineering in general.
Seth Wiesman is a Solutions Architect at Ververica, where he works with engineering teams inside of various organizations to build the best possible stream processing architecture for their use cases.
Operations
State UnlockedSponsored Talk: Running Apache Flink on AWS
Sponsored Talk: Running Apache Flink on AWS
October 9: 3:30 pm - 3:50 pm20minOtherIn this Sponsor talk, we will describe different options for running Apache Flink on AWS and the advantages of each, including Amazon EMR, Amazon Elastic Kubernetes Service (EKS), and Amazon Kinesis Data Analytics. We will then go in depth of how Amazon Kinesis Data Analytics uses EKS under the covers to provide you with a stateful and serverless stream processing service.
Ryan Nienhuis is a technical product manager who helps customers use the technology to deliver business value. He has created and managed cloud services products focused on analytics. Ryan has worked on the Amazon Kinesis team at AWS for the past five years, where he defines products that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.
Other
Sponsored Talk: Running Apache Flink on AWSEvent Streaming Architecture for Industry 4.0
Event Streaming Architecture for Industry 4.0
October 9: 4:00 pm - 4:40 pm40minUse CaseNew use cases under the Industry 4.0 umbrella are playing a key role in improving factory operations, process optimization, cost reduction and quality improvement. We propose an event streaming architecture to streamline the information flow all the way from the factory to the main data center. Building such a streaming architecture enables a manufacturer to react faster to critical operational events. However, it presents two main challenges:
- Data acquisition in real time: data should be collected regardless of its location or access challenges are. It is commonplace to ingest data from hundreds of heterogeneous data sources (ERP, MES, Sensors, maintenance systems, etc).
- Event processing in real time: events collected from different parts of the organization should be combined into actionable insights in real time. This is extremely challenging in a context where events can be lost or delayed.
In this talk, we show how Apache NiFi and MiNiFi can be used to collect a wide range of datasources in real-time, connecting the industrial and information worlds. Then, we show how Apache Flink’s unique features enables us to make sense of this data. For instance, we will explain how Flink’s time management such Event Time mode, late arrival handling and watermark mechanism can be used to address the challenge of processing IoT data originating from geographically distributed plants. Finally, we demonstrate an end to end streaming architecture for Industry 4.0 based on the Cloudera DataFlow platform.
Abdelkrim is a senior data streaming specialist at Cloudera with 10 years experience on several distributed systems (big data, IoT, peer to peer and cloud). Previously, he held several positions including big data lead, CTO, and software engineer at several companies. He was a speaker at various international conferences and published several scientific papers at well-known IEEE and ACM journals. Abdelkrim holds a PhD, MSc, and MSe degrees in computer science.
Jan Kunigk holds a B.Sc. in Computer Science from DHBW Mannheim and started his career with distributed systems at IBM in 2005. Ever since then he has been busy with (Tera) bytes flying by. He led T-Systems' introduction of Hadoop hosting services in 2013 and joined Cloudera in 2014. At Cloudera, Jan has helped customers in all industries to be successful with large scale data processing projects in all industries as a solutions architect. Currently, Jan serves as Field Chief Technology Officer for EMEA. He is also a co-author of O'Reilly's "Architecting Modern Data Platforms".
Use Case
Event Streaming Architecture for Industry 4.0Faster checkpointing through unaligned checkpoints
Faster checkpointing through unaligned checkpoints
October 9: 4:50 pm - 5:10 pm20minTechnology Deep DiveFlink is a state of the art streaming processing engine with exactly-once semantics. At the core of providing exactly-once guarantees under failures is Flink’s checkpointing feature. Checkpointing is implemented in a way that provides maximal throughput with minimal overheads and no additional IO access.
However, the current checkpointing approach comes with some challenges and one particularly tricky challenge on which we focus of this talk is checkpoint barrier alignment.
I will explain what checkpoint barrier alignment is, in what scenarios can cause problems, and what those problems are. Next, we will present a new feature of unaligned checkpoints to tackle those problems. We will compare checkpointing with and without alignment and conclude about the implications for throughput, latency, and resource usage.
Piotr Nowojski is a Software Engineer in Ververica and Flink committer working mostly on Flink’s runtime code. Previously, he was a Software Engineer in Teradata working on Presto – distributed batch SQL query engine.
Technology Deep Dive
Faster checkpointing through unaligned checkpointsDemystifying Flink Memory Allocation and tuning
Demystifying Flink Memory Allocation and tuning
October 9: 11:30 am - 11:50 am20minTechnology Deep DiveEver tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
Roshan is a technical lead at Uber's stream processing platform team (Athena) and looking into problems of stream processing at scale. He was previously at Hortonworks where he architected Storm 2.0's new high performance execution engine and authored Hive's transactional streaming ingest APIs. He is a committer on Flume, Streamline and Storm. He is also author of Castor, an open source C++ library that brings the Logic paradigm to C++.
Technology Deep Dive
Demystifying Flink Memory Allocation and tuningChange data capture in production with Apache Flink
Change data capture in production with Apache Flink
October 9: 12:00 pm - 12:40 pm40minUse CaseDrive the business with your KPIs. That is what we aimed to do at OVH. As a 18 years old company and quite big cloud provider, we encountered several issues during this long journey to setup change data capture and data driven culture.
Getting data from thousands of tables into one place, keep it all up to date was not possible without a strong streaming engine like Apache Flink.
We will present you our current production pipeline with its pros and cons. From the data collection made directly with binary logs of the databases, to continuous writing into Apache Hive in a Kerberized cloud-based Apache Hadoop cluster. We will describe how we handle schema transcription, events lifecycle, stream partitioning, sort of the events with the use of watermarks and windows aggregation - all of this in a transaction way until the data availability on user side.
Finally we will introduce our production infrastructure based on cloud only, its operation and monitoring.
David is a Big Data devops in the Data Convergence team at OVH. He works on building architectures for OVH products around data (ingestion, analytics, storage, processing). He was introduced to Big Data with Hadoop 6 years ago and fell in love with it’s dynamic ecosystem. Since then he’s been working with every kind of system dealing with data with loads of technical challenges on his way.
Yann is a senior software engineer in the Data Convergence team at OVH, working on creating products around data ingestion, data lakes and analytics platforms. More focused on the backend side of thing, he is passionate about API design, modularity and performance, a passion that he shares with his students as a teacher in Brest’s University (in France).
Use Case
Change data capture in production with Apache FlinkKubernetes + Operator + PaaSTA = Flink @ Yelp
Kubernetes + Operator + PaaSTA = Flink @ Yelp
October 9: 1:40 pm - 2:20 pm40minOperationsAt Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine operations like deploys, restart, and savepointing for so many jobs would take quite a bit of developers’ time without the right degree of automation. The latest addition to our toolshed is a Kubernetes operator managing the deployment and the lifetime of Flink clusters on PaaSTA, Yelp’s Platform As A Service.
We replaced our deployment framework launching Flink clusters on top of AWS EMR with a Kubernetes operator managing fully Docker-ized Flink clusters. Compared to EMR, this architecture allowed us to both drastically reduce the deployment time of our Flink clusters and to share our hardware resources more efficiently. In addition, we now offer to our developers the same interface they are used to for running REST services, batch jobs and many other workloads on PaaSTA.
This talk will give a brief overview of Yelp’s PaaSTA before diving into the details of how the Kubernetes operator has been implemented and how it has been integrated with Yelp developers’ workflow (deploy, logs, savepoints, upgrades, etc), to end with a glimpse of the future features we are planning for the operator (Flink as a library, autoscaling, etc.).
Writing code and tinkering with computers for a living, writing code and tinkering with computers for fun. Still uncertain whether he’s a Software Engineer, a Systems Engineer or a Software Reliability Engineer, keeps telling people he’s one of the computer guys at Yelp. Mainly interested in distributed systems and stream processing, has a taste for open-source software.
Operations
Kubernetes + Operator + PaaSTA = Flink @ YelpFlinkDTW: time-series pattern search at scale using Dynamic Time Warping
FlinkDTW: time-series pattern search at scale using Dynamic Time Warping
October 9: 2:30 pm - 3:10 pm40minEcosystemDTW: Dynamic Time Warping is a well-known method to find patterns within a time-series. It has the possibility to find a pattern even if the data are distorted. It can be used to detect trends in sell, defect in machine signals in the industry, medicine for electro-cardiograms, DNA…
Most of the implementations are usually very slow, but a very efficient open source implementation (best paper SIGKDD 2012) is implemented in C. It can be easily ported in other language, as Java, so that it can be then easily used in Flink.
We present how we did some slight modifications so that we can use with Flink at even greater scale to return the TopK best matches on past data or streaming data.
Christophe Salperwyck started as a software engineer and then moved to machine learning. He specialised on machine learning on streaming data during his PhD in Orange. He is also interested in designing algorithms that scale such as CourboSpark, an adaptation of Spark decision tree for time series for EDF. There he also worked on creating a data lake for the 30 years of historical power plant data, mainly in HBase: 1000B points/100 TB of data.
Ecosystem
FlinkDTW: time-series pattern search at scale using Dynamic Time WarpingTips and Tricks for Developing Streaming and Table Connectors
Tips and Tricks for Developing Streaming and Table Connectors
October 9: 3:30 pm - 3:50 pm20minTechnology Deep DiveIn this session we share tips for developing an effective connector for Apache Flink. Topics include how to develop data sources and data sinks for the Flink Steaming API and the Table API, and how to support data serialization, parallelism, exactly-once semantics using checkpoints, event time, and metrics.
In this session we share tips for developing an effective connector for Apache Flink. Topics include how to develop data sources and data sinks for the Flink Steaming API and the Table API, and how to support data serialization, parallelism, exactly-once semantics using checkpoints, event time, and metrics.
Technology Deep Dive
Tips and Tricks for Developing Streaming and Table ConnectorsWhat's new in 1.9.0 blink planner
What's new in 1.9.0 blink planner
October 9: 4:00 pm - 4:40 pm40minTechnology Deep DiveFlink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
I work at realtime compute team in Alibaba, and mostly focus on building a unified, high-performance SQL engine based on Apache Flink.
Technology Deep Dive
What's new in 1.9.0 blink plannerAStream: Ad-hoc Shared Stream Processing
AStream: Ad-hoc Shared Stream Processing
October 9: 4:50 pm - 5:10 pm20minResearchIn the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment.
The goal of this talk is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework.
To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.
Check-in & Breakfast
Check-in & Breakfast
October 9: 8:30 am - 9:30 am 1h 0minOtherOther
Check-in & BreakfastKeynote: Customer Journey with streaming data on AWS
Keynote: Customer Journey with streaming data on AWS
October 9: 9:30 am - 10:15 am45minKeynoteAmazon Web Services (AWS) offers over 165 fully featured cloud services from data centers globally. AWS launched its first data streaming service, Amazon Kinesis Data Streams, over five years ago. Now, customers are using streaming data across most AWS services including two that support running Apache Flink, Amazon EMR and Amazon Kinesis Data Analytics. In this keynote, we will describe how customers and their use of streaming data has evolved on AWS. We will look at how streaming data and Apache Flink are used externally and internally on AWS, and where we see usage of Apache Flink growing.
Rahul Pathak is currently General Manager of Databases, Analytics, and Blockchain at AWS. He owns Amazon Managed Blockchain, Athena, EMR, DocumentDB, Neptune, and Timestream at AWS. During his 7+ years at AWS, Rahul has focused on managed database and analytics services. Prior to his current role, he was the GM for AWS Glue and Lake Formation, Principal Product Manager for Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse service in the cloud. He has also worked on Amazon ElastiCache, Amazon RDS, and Amazon RDS Provisioned IOPS. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.
Keynote
Keynote: Customer Journey with streaming data on AWSKeynote: Building and operating a serverless streaming runtime for Apache Beam in the Google Cloud
Keynote: Building and operating a serverless streaming runtime for Apache Beam in the Google Cloud
October 9: 10:15 am - 11:00 am45minKeynoteApache Beam is Flink’s sibling in the Apache family of streaming processing frameworks. The Beam and Flink teams work closely together on advancing what is possible in streaming processing, including Streaming SQL extensions and code interoperability on both platforms.
Beam was originally developed at Google as the amalgamation of its internal batch and streaming frameworks to power the exabyte-scale data processing for Gmail, YouTube and Ads. It now powers a fully-managed, serverless service Google Cloud Dataflow, as well as is available to run in other Public Clouds and on-premises when deployed in portability mode on Apache Flink, Spark, Samza and other runners. Users regularly run distributed data processing jobs on Beam spanning tens of thousands of CPU cores and processing millions of events per second.
In this session, Sergei Sokolenko, Cloud Dataflow product manager, and Reuven Lax, the founding member of the Dataflow and Beam team, will share Google’s learnings from building and operating a global streaming processing infrastructure shared by thousands of customers, including:
- safe deployment to dozens of geographic locations,
- resource autoscaling to minimize processing costs,
- separating compute and state storage for better scaling behavior,
- dynamic work rebalancing of work items away from overutilized worker nodes,
- offering a throughput-optimized batch processing capability with the same API as streaming,
- grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system,
- integrating with the Google Cloud security ecosystem, and other lessons.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Sergei is the Product Manager for Cloud Dataflow, Google’s serverless, fully-managed service for streaming analytics. Dataflow offers advanced resource usage and execution time optimization techniques including autoscaling and fully-integrated batch processing. Sergei holds an MBA degree from the Wharton School, and a Computer Science degree from the Technical University of Munich, Germany.
Reuven Lax is a senior staff software engineer at Google. He has been at Google since 2006 and involved in designing and building Google's streaming data processing infrastructure since 2008, serving as technical lead for MillWheel and leading development of Dataflow's streaming engine.
Keynote
Keynote: Building and operating a serverless streaming runtime for Apache Beam in the Google CloudLunch Talk: Non-code Contributions as a Road to Diversity
Lunch Talk: Non-code Contributions as a Road to Diversity
October 9: 1:00 pm - 1:30 pm30minCommunityOpen source draws its power and strength from the communities that build it. Projects that welcome a diversity of perspectives have stronger communities and ecosystems. In turn, a track record of contributions to open source is often a critical first step in building a career in technology. Yet not everybody in the community feels empowered or welcome to contribute to open source projects, and that hurts us all.
In this talk, we will cover what exactly non-code contributions are, how they are important to project health, and provide concrete recommendations and best practices for building a welcoming environment for all kinds of contributions. We will also give project maintainers tips on how to recognize and reward the non-code contributions that help an open source community flourish.
Aizhamal is an open source enthusiast and a committer to Apache Airflow. She helps build healthy open source communities and improve contributor experience, and advocates for documentation and recognition of non-code contributions. In her free time she watches too many movies, follows football (she’s a fan of Messi), dances salsa and bakes lava cakes.
Community
Lunch Talk: Non-code Contributions as a Road to DiversityKeynote: Customer Journey with streaming data on AWS
Amazon Web Services (AWS) offers over 165 fully featured cloud services from data centers globally. AWS launched its first data streaming service, Ama…
Keynote: Building and operating a serverless streaming runtime for Apache Beam in the Google Cloud
Apache Beam is Flink’s sibling in the Apache family of streaming processing frameworks. The Beam and Flink teams work closely together on advancing wh…
Flinking, Fast and Slow
Have you tried to build a new Flink project in an already well-established, non-Flink ecosystem? Curious to try it? My team built our company’s first …
Enabling Machine Learning with Apache Flink
In the world of ride sharing, many decisions such as matching a passenger to the nearest driver, need to be made in realtime. As a result, building an…
Lunch Talk: Non-code Contributions as a Road to Diversity
Open source draws its power and strength from the communities that build it. Projects that welcome a diversity of perspectives have stronger communiti…
Towards Flink 2.0: Unified Batch & Stream Processing
Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch…
Batch/Stream Ask-Me-Anything
Processing batch and streaming data with a single, unified engine have been the goal of the Apache Flink community for several years.Over the last mon…
Stream SQL with Flink @ Yelp
Yelp has been using Flink for over 2 years to power the entire Data Pipeline infrastructure. We’ve built several components to power a modular archite…
Panel Discussion: Building Self Service Platforms for Apache Flink
Moderated by: Robert Metzger & Konstantin Knauf (Ververica)This panel session will review lessons learned and best practices for building self-ser…
Run Interactive Queries with Apache Flink
As a well-established stream processing engine, Flink has also been keeping expanding its horizon into the batch processing world. Among all the diffe…
Query Pulsar Streams using Apache Flink
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with bat…
The role Stream Processing plays for an Opti Channel Experience
Customers want instant gratification and are pushing for their banking experiences be enable via different platforms they are frequently interact with…
From BaaB to EaaS in the Financial Industry
Is it possible to kill the batch way in the Financial Industry?Indizen technologies plays a key role in the architecture and innovation team of the fi…
Scotty: Efficient Window Aggregation with General Stream Slicing
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant com…
When ordering matters
In the last decades many systems have been used that were described as "queues" (AQ, ActiveMQ, RabbitMQ, etc.), yet from a computer science …
Real-time Stream Analytics and Scoring Using Apache Flink, Druid & Cassandra at Deep.BI
One of the hardest challenges we are trying to solve is how to deliver customizable insights based on billions of data points in real-time, that fully…
Making Sense of Streaming Sensor Data: How Uber Detects On-trip Car Crashes
Safety is one of the most crucial concerns of Uber’s ride-sharing platform. To advance the timeliness in response to the safety issues in daily busine…
Multi-tenanted streams @Workday
At WORKDAY Inc., #1 Future Fortune company 2018 (link: https://fortune.com/future-50/2018/workday/), we process data for our community of more than 39…
Extending Flink state serialization for better performance and smaller checkpoint size
Operations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easil…
State Unlocked
As stateful streaming processing becomes more and more mature for complex event-driven applications and real-time analytics, users have put Apache Fli…
Sponsored Talk: Running Apache Flink on AWS
In this Sponsor talk, we will describe different options for running Apache Flink on AWS and the advantages of each, including Amazon EMR, Amazon Elas…
Event Streaming Architecture for Industry 4.0
New use cases under the Industry 4.0 umbrella are playing a key role in improving factory operations, process optimization, cost reduction and quality…
Faster checkpointing through unaligned checkpoints
Flink is a state of the art streaming processing engine with exactly-once semantics. At the core of providing exactly-once guarantees under failures i…
Demystifying Flink Memory Allocation and tuning
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectl…
Change data capture in production with Apache Flink
Drive the business with your KPIs. That is what we aimed to do at OVH. As a 18 years old company and quite big cloud provider, we encountered several …
Kubernetes + Operator + PaaSTA = Flink @ Yelp
At Yelp we run hundreds of Flink jobs to power a wide range of applications: push notifications, data replication, ETL, sessionizing and more. Routine…
FlinkDTW: time-series pattern search at scale using Dynamic Time Warping
DTW: Dynamic Time Warping is a well-known method to find patterns within a time-series. It has the possibility to find a pattern even if the data are …
Tips and Tricks for Developing Streaming and Table Connectors
In this session we share tips for developing an effective connector for Apache Flink. Topics include how to develop data sources and data sinks for th…
What's new in 1.9.0 blink planner
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes f…
AStream: Ad-hoc Shared Stream Processing
In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central…