0%

【Effective Data Science Infrastructure】Stack

好的方法论,加上好的工具,等于可持续高效生产。


更新历史

  • 2022.07.24:完成初稿

读后感

这本书实际上是 metaflow 的说明书,但是把具体的框架融入在了更大的架构思维中,值得一看。

读书笔记

下面为简单的翻译,中英混杂。

Introducing data science infrastructure

技术栈(越往上越与数据科学相关):

  1. Model development
  2. Feature engineering
  3. Model operations
  4. Architecture: Metaflow
  5. Versioning: A/B Test
  6. Job scheduler: workflow 编排
  7. Compute resources: CPU + GPU
  8. Data warehouse: 唯一的中心化数据仓库

面向不同的人,有不同的关注点:

  • Data Science(1-3): Iteration and experimentation
  • Software Architecture(3-6): Integrations
  • Fundational Infrastructure(6-8): Data and compute

为什么需要好的架构,因为我们需要管理复杂度:N 数据科学团队 x N 产品研发团队 x N 场景 x N 数据 x N 版本 x N 工作流 x N 数据产品 x N 客户

  • Implementation: Designing and implementing infrastructure that deals with this level of complexity is a nontrival task.
  • Usability: It is a key challenge of effective infrastructure to make data scientists productive despite the complexities involved, which is a key motivation for human-centric infrastructure.
  • Operations: How do we keep the machines humming with minimal human intervention? Reducing the operational burden of data science applications is another key goal of the infrastructure

The tollchain of data science

  • Data scientists need a development environment that provides excellent ergonomics for the following two key activities:
    • Prototyping loop: writing, evaluating, and analyzing application code
    • Interaction with production deployments: deploying, monitoring, and debugging production applications
  • Workflows are a useful abstraction for structuring data science applications. Workflows provide a number of benefits: they are easy to understand and explain, they help in managing complexity as the number of data science applications grows, and they can make execution more scalable and performant

Introducing Metaflow

开启一个新项目需要思考的过程:

  1. 要解决什么业务问题
  2. 可以用什么输出数据,这些数据在哪里,如何读取
  3. 输出数据要怎么处理,如何写入
  4. 用什么技术来提升输出数据的质量

How to do parallel computation with Metaflow(其他框架也是一样的,这个是一个设计思路)

  • Use branches to make your application more understandable by making data dependencies explicit as well as to achieve higher performance
  • You can run either one operation on multiple pieces of data using dynamic branches, or you can run many distinct operations in parallel using static brances

How to develop a simple end-to-end applciation

  • It is best to develop application iteratively
  • Use resume to continue execution quickly after failures
  • Metaflow is designed to be used with off-the-shelf data science libraries like Scikit-Learn

Scaling with the compute layer

4 Vs:

  • Volume: 支持大量的数据科学应用
  • Velocity: 快速搭建原型,并产品化数据科学应用
  • Validity: 结果是有效的稳定的
  • Variety: 支持不同的数据科学模型和应用

为了支持扩展性,每一层都需要是可扩展的,下面的层级逐渐细化

  1. Organization -> projects++ -> people++
  2. Project -> versions++ -> instances++
  3. Version -> workflows++ -> instances++
  4. Workflow -> tasks++ -> instances++
  5. Task -> data++ -> cpu/gpu/ram++
  6. Algorithm -> data++ -> cpu/gpu/ram++

对应的 infrastructure layers 为:

  • Architecture and Versioning(1-3)
  • Orchestration(2-4)
  • Data and Compute(4-6)

如何评价不同的 compute layer?有如下几个点:

  • Workload support: 比如是否支持 gpu 等
  • Latency: 多久能启动任务
  • Workload managment: 如何管理任务
  • Cost-efficiency: 资源利用率
  • Operational complexity: 运维的复杂度

下面分别对不同的技术栈来做分析

  • K8s
    • Workload Support: 通用
    • Latency: 主要是一个容器编排系统,所以延迟主要取决于采用什么集群管理方法
    • Workflow management: 可以配合任何工作流系统
    • Cost efficiency: 可配置,主要取决于采用什么集群管理方法
    • Operational complexity: 高,有一个陡峭的学习曲线
  • AWS Batch
    • Workload Support: 通用
    • Latency: 相对较高,主要是为批处理准备
    • Workflow management: 内置工作队列
    • Cost efficiency: 可配置,可以用任何 EC2 类型
    • Operational complexity: 低,设置简单,几乎不用维护
  • AWS Lambda
    • Workload Support: 只限轻量任务
    • Latency: 低,几秒内就可以启动
    • Workflow management: 在异步模式下,包含一个队列
    • Cost efficiency: 极好,几乎是百分百只为你用的付费
    • Operational complexity: 非常低,几乎不用维护
  • Apache Spark
    • Workload Support: 只支持 Spark 任务
    • Latency: 取决于集群管理的系统
    • Workflow management: 包含内置工作队列
    • Cost efficiency: 可配置,取决于集群设置
    • Operational complexity: 相对高,Spark 是一个复杂的引擎,需要专业技术才能运维
  • Distributed Training Platform
    • Workload Support: 非常有限
    • Latency: 高,针对批处理优化
    • Workflow management: 针对 Task 进行管理
    • Cost efficiency: 一般非常昂贵
    • Operational complexity: 相对高,虽然云服务一般提供简化的运维面板
  • Local Processes
    • Workload Support: 通用
    • Latency: 非常低,进程立刻启动
    • Workflow management: 可配置,默认是没有
    • Cost efficiency: 便宜,但是计算能力有限
    • Operational complexity: 适中,工作站需要维护和调试

Practicing scalability and performance

高效的工作流:

  1. Start with the simplest possible approach. A simple, obviously correct solution provides a robust foundatoin for gradual optimization.
  2. If you are concerned that the approach is not scalable, think when and how you will hit the limits in practice. If the answer is never, or at least not any time soon, you can increase complexity only when it becomes necessary.
  3. Use vertical scalability to make the simple version work with realistic input data.
  4. If the initial implementation can’t take the advantage of hardware resources provided by vertical scalability, consider using an off-the-shelf optimized library that can.
  5. If the workflow contains embarrassingly parallel parts and/or data can be easily sharded, leverage horizontal scalability for parallelism.
  6. If the workflow is still too slow, carefully analyze where the bottlenecks lie. Consider whether simple performance optimizations could remove the bottleneck, maybe using one of the tools from the Python data science toolkit.
  7. If the workflow is still too slow, which is rare, consider using specialized compute layers that can leverage distributed algorithms and specialized hardware.

Going to production

  • Using a centralized metadata server helps to track all executions and artifacts across all projects, users, and produciton deployments.
  • Leverage a highly available, scalable production scheduler like AWS Step Functions to execute workflows on a schedule without human supervision.
  • Use the @schedule decorator to make workflows run automatically on a predefined schedule(平台也要提供自动调度机制).
  • Use containers and the @conda decorator to manage third-party dependencies in production deployments.
  • User namespaces help isolate prototypes that users run on their local workstations, making sure that prototypes don’t interfere with each other.
  • Production deployments get a namespace of their own, isolated from prototypes. New users must obtain a production token to deploy new versions to production, which prevents accidental overwrites.

Processing data

两类不同的应用,有不同的 input 和 output:

  • Analytics Application: Complex query -> Data Warehouse -> Small result -> Dashboard
  • ML Application: Simple query -> Data Warehouse -> Large result -> Workflow

Modern Data Stack,从内到外:

  • Data: 最核心
  • Durable storage: Iceberg,保存 Parquet
  • Query engine: Trino / Spark
  • Data loading and transformations: ET - airbyte, T - dbt
  • Workflow orchestrator: airflow / dagster
  • Data management: data catalogue / data governance / data monitoring

和前面几层的对应关系:

  • Data Warehouse = Query engine + Durable storage + Data
  • Job scheduler = Workflow orchestrator
  • Versioning = Data catalogue
  • Feature engineering = Data catalogue
  • Model operations = Data monitoring

区分 facts 和 features

  • Facts
    • Role: Data engineer
    • Key Activity: Collect and persist reliable observations
    • Speed of iteration: Slow
    • Can we control: 部分,不控制输入
    • Trustworthiness: 目标做到高
  • Features
    • Role: Data scientist
    • Key Activity: Define new features and challenge existing ones
    • Speed of iteration: Fast
    • Can we control: 完全,控制输入和输出
    • Trustworthiness: 差别很大,默认是低

Using and operating models

  • To produce value, machine learning models must be connected to other surrounding systems.
  • There isn’t a single way to deploy a data science application and produce predictions: the right approach depens on the use case.
  • Choose the right infrastructure for predictions, depending on the time window between when the input data becomes known and when predictions are needed.
  • Another key consideration is whether surrounding systems need to request predictions from the model, or whether the model can push predictions to the surrounding systmes. In the latter case, batch or streaming predicitons are a good approach.
  • If the input data is known at least 15-30 minutes before predictions are needed, it is often possible to produce predictions as a batch workflow, which is the most straightforward approach technically.
  • It is important to attach a version identifier in all model outputs, both in batch and real-time use cases.
  • Real-time predictions can be produced either using a general-purpose micro-service framework or a solution that is tailored to data science applications. The latter may be the best approach if your models are computationally demanding.
  • Make sure your deployments are debuggable by investing in monitoring tools and lineage. It should be possible to track every prediction all the way to the model and a workflow that produced it.