Kubernetes
Advanced
Posit Package Manager can be configured to run on AWS in a Kubernetes cluster with EKS for a non-air-gapped environment. In this architecture, Package Manager can handle a large number of users and benefits from deploying across multiple availability zones.
This configuration is suitable for teams of hundreds of data scientists who want or require multiple availability zones.
Most companies don’t need to run in this configuration unless they have many concurrent package uploads/downloads or are required to run across multiple availability zones for compliance reasons. Instead, the single server architecture of Package Manager would be more suitable for small teams that don’t need these requirements.
Architecture Overview#
This Posit Package Manager implementation deploys the application in an EKS cluster following the Kubernetes installation instructions. It additionally leverages:
- AWS Elastic Kubernetes Service (EKS) to provision and manage the Kubernetes cluster.
- AWS Relational Database Service (RDS) for PostgreSQL, serving as the application database for Posit Package Manager.
- AWS Simple Storage Service (S3) for Posit Package Manager’s object storage.
- AWS Application Load Balancer (ALB) to route requests to the Posit Package Manager service.
Architecture Diagram#
Kubernetes Cluster#
The Kubernetes cluster can be provisioned using AWS Elastic Kubernetes Service (EKS).
Nodes#
We recommend three worker nodes across multiple availability zones. We have tested with m6i.2xlarge
instances (8 vCPUs, 32 GiB memory) for each of the nodes and can serve 30 million package installs per month, or one million package installs per day. This configuration can also handle 100 Git builders concurrently building packages from Git repositories.
Note
Each Posit Package Manager user could be downloading dozens or hundreds of packages a day. There are also other usage patterns such as an admin uploading local packages or the server building packages for Git builders, but package installations give a good idea of what load and throughput this configuration can handle.
This reference architecture does not assume autoscaling node groups. It assumes you have a fixed number of nodes within your node group. However, it is safe for Posit Package Manager pods to run on auto-scaling nodes. If a pod is evicted from a node due to a scale-down event, any long-running jobs (e.g. Git builders) that are in progress will be restarted on a different pod. All long-running jobs are tracked externally in the database.
Database#
This configuration uses RDS with PostgreSQL on a db.t3.xlarge
instance (4 vCPUs, 16 GiB memory) with 100 GiB of General Purpose SSD (gp3) storage, and Multi-AZ enabled with one standby.
Multi-AZ allows for the RDS instance to run in an active/passive configuration across 2 availability zones, with auto-failover when the primary instance goes down.
The RDS instance should be configured with an empty Postgres database for the Posit Package Manager metadata. To handle a higher number of concurrent users, the configuration option PostgresPool.MaxOpenConnections
should be increased to 50.
This is a very generous configuration. In our testing, the Postgres database handled one million package installs per day without exceeding 10-20% CPU utilization.
Storage#
The S3 bucket is used to store data about packages and sources, as well as cached metadata to decrease response times for requests. S3 can also be used with KMS for client-side encryption.
Networking#
Posit Package Manager should be deployed in a EKS cluster with the control plane and node group in a private subnet with ingress using an Application Load Balancer in a public subnet. This should run across multiple availability zones.
Configuration Details#
The configuration of Package Manager is managed through the official Helm chart: https://github.com/rstudio/helm/tree/main/charts/rstudio-pm. For complete details, refer to the Kubernetes installation steps.
Encryption key#
When running with more than one replica, it is important that each replica has the same encryption key. To ensure that each replicas has access to the same encryption key you should create a Kubernetes secret and then expose it as an environment variable using the values.yaml
file. For more details on the encryption key see the Configuration Encryption page.
First, create an encryption key and Kubernetes secret:
# Create an encryption key
openssl rand -hex 64
# Create a secret. Replace 'xxx' with your encryption key.
kubectl create secret generic ppm-secrets --from-literal=ppm-encryption-key-user='xxxx'
Then, update your values.yaml
file to use the secret:
# How to use the secret in your values.yaml
pod:
env:
- name: PACKAGEMANAGER_ENCRYPTION_KEY
valueFrom:
secretKeyRef:
name: ppm-secrets
key: ppm-encryption-key-user
Replicas#
This reference architecture uses three replicas for the Posit Package Manager service. If you want to ensure that each replica runs on a different node, set a topologySpreadConstraints
:
replicas: 3
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
# The helm chart will add this label to the pods automatically
app.kubernetes.io/name: rstudio-pm
Resiliency and Availability#
This configuration of Posit Package Manager is comparable to what has been deployed on the Posit Public Package Manager service. As a publicly available service, the architecture is tested by the R and Python communities that use it. Public Package Manager is used by many more users than any private Posit Package Manager instance. The current uptime for the Posit Public Package Manager service can be found on the status page.
FAQ#
See the Frequently Asked Questions page for more information for the general FAQ.