Create a provider to use Azure Batch (#133)

* Started work on provider

* WIP Adding batch provider

* Working basic call into pool client. Need to parameterize the baseurl

* Fixed job creation by manipulating the content-type

* WIP Kicking off containers. Dirty

* [wip] More meat around scheduling simple containers.

* Working on basic task wrapper to co-schedule pods

* WIP on task wrapper

* WIP

* Working pod minimal wrapper for batch

* Integrate pod template code into provider

* Cleaning up

* Move to docker without gpu

* WIP batch integration

* partially working

* Working logs

* Tidy code

* WIP: Testing and readme

* Added readme and terraform deployment for GPU Azure Batch pool.

* Update to enable low priority nodes for gpu

* Fix log formatting bug. Return node logs when container not yet started

* Moved to golang v1.10

* Fix cri test

* Fix up minor docs Issue. Add provider to readme. Add var for vk image.
This commit is contained in:
Lawrence Gripper
2018-06-23 00:33:49 +01:00
committed by Robbie Zhang
parent 1ad6fb434e
commit d6e8b3daf7
75 changed files with 20040 additions and 6 deletions

View File

@@ -0,0 +1,79 @@
# Kubernetes Virtual Kubelet with Azure Batch
[Azure Batch](https://docs.microsoft.com/en-us/azure/batch/) provides a HPC Computing environment in Azure for distributed tasks. Azure Batch handles scheduling decrete jobs and tasks accross pools of VM's. It is commonly used for batch processing tasks such as rendering.
The Virtual kubelet integration allows you to take advantage of this from within Kubernetes. The primary usecase for the provider is to make it easy to use GPU based workload from normal Kubernetes clusters. For example, creating Kubernetes Jobs which train or execute ML models using Nvidia GPU's or using FFMPEG.
Azure Batch allows for [low priority nodes](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) which can also help to reduce cost for non-time sensitive workloads.
__The [ACI provider](../azure/README.MD) is the best option unless you're looking to utilise some specific features of Azure Batch__.
## Status: Experimental
This provider is currently in the exterimental stages. Contributions welcome!
## Quick Start
The following Terraform template deploys an AKS cluster with the Virtual Kubelet, Azure Batch Account and GPU enabled Azure Batch pool. The Batch pool contains 1 Dedicated NC6 Node and 2 Low Priority NC6 Nodes.
1. Setup Terraform for Azure following [this guide here](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/terraform-install-configure)
2. From the commandline move to the deployment folder `cd ./providers/azurebatch/deployment` then edit `vars.example.tfvars` adding in your Service Principal details
3. Download the latest version of the Community Kubernetes Provider for Terraform. Get the correct link [from here](https://github.com/sl1pm4t/terraform-provider-kubernetes/releases) and use it as follows: (Current official Terraform K8s provider doesn't support `Deployments`)
```shell
curl -L -o - PUT_RELASE_BINARY_LINK_YOU_FOUND_HERE | gunzip > terraform-provider-kubernetes
chmod +x ./terraform-provider-kubernetes
```
4. Use `terraform init` to initialize the template
5. Use `terraform plan -var-file=./vars.example.tfvars` and `terraform apply -var-file=./vars.example.tfvars` to deploy the template
6. Run `kubectl describe deployment/vkdeployment` to check the virtual kubelet is running correctly.
7. Run `kubectl create -f examplegpupod.yaml`
8. Run `pods=$(kubectl get pods --selector=app=examplegpupod --show-all --output=jsonpath={.items..metadata.name})` then `kubectl logs $pods` to view the logs. Should see:
```text
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
### Tweaking the Quickstart
You can update [main.tf](./main.tf) to increase the number of nodes allocated to the Azure Batch pool or update [./aks/main.tf](./aks/main.tf) to increase the number of agent nodes allocated to your AKS cluster.
## Advanced Setup
## Prerequistes
1. An Azure Batch Account configurated
2. An Azure Batch Pool created with necessary VM spec. VM's in the pool must have:
- `docker` installed and correctly configured
- `nvidia-docker` and `cuda` drivers installed
3. K8s cluster
4. Azure Service Principal with access to the Azure Batch Account
## Setup
The provider expects the following environment variables to be configured:
```
ClientID: AZURE_CLIENT_ID
ClientSecret: AZURE_CLIENT_SECRET
ResourceGroup: AZURE_RESOURCE_GROUP
SubscriptionID: AZURE_SUBSCRIPTION_ID
TenantID: AZURE_TENANT_ID
PoolID: AZURE_BATCH_POOLID
JobID (optional):AZURE_BATCH_JOBID
AccountLocation: AZURE_BATCH_ACCOUNT_LOCATION
AccountName: AZURE_BATCH_ACCOUNT_NAME
```
## Running
The provider will assign pods to machines in the Azure Batch Pool. Each machine can, by default, process only one pod at a time
running more than 1 pod per machine isn't currently supported and will result in errors.
Azure Batch queues tasks when no machines are available so pods will sit in `podPending` state while waiting for a VM to become available.

View File

@@ -0,0 +1,409 @@
package azurebatch
import (
"context"
"encoding/json"
"fmt"
"github.com/Azure/go-autorest/autorest"
"os"
"strings"
"io/ioutil"
"log"
"net/http"
"github.com/Azure/go-autorest/autorest/azure"
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
"github.com/Azure/go-autorest/autorest/to"
"github.com/lawrencegripper/pod2docker"
"github.com/virtual-kubelet/virtual-kubelet/manager"
azureCreds "github.com/virtual-kubelet/virtual-kubelet/providers/azure"
"k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
const (
podJSONKey string = "virtualkubelet_pod"
)
// Provider the base struct for the Azure Batch provider
type Provider struct {
batchConfig *Config
ctx context.Context
cancelCtx context.CancelFunc
fileClient *batch.FileClient
resourceManager *manager.ResourceManager
listTasks func() (*[]batch.CloudTask, error)
addTask func(batch.TaskAddParameter) (autorest.Response, error)
getTask func(taskID string) (batch.CloudTask, error)
deleteTask func(taskID string) (autorest.Response, error)
getFileFromTask func(taskID, path string) (result batch.ReadCloser, err error)
nodeName string
operatingSystem string
cpu string
memory string
pods string
internalIP string
daemonEndpointPort int32
}
// Config - Basic azure config used to interact with ARM resources.
type Config struct {
ClientID string
ClientSecret string
SubscriptionID string
TenantID string
ResourceGroup string
PoolID string
JobID string
AccountName string
AccountLocation string
}
// NewBatchProvider Creates a batch provider
func NewBatchProvider(configString string, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error) {
fmt.Println("Starting create provider")
config := &Config{}
if azureCredsFilepath := os.Getenv("AZURE_CREDENTIALS_LOCATION"); azureCredsFilepath != "" {
creds, err := azureCreds.NewAcsCredential(azureCredsFilepath)
if err != nil {
return nil, err
}
config.ClientID = creds.ClientID
config.ClientSecret = creds.ClientSecret
config.SubscriptionID = creds.SubscriptionID
config.TenantID = creds.TenantID
}
err := getAzureConfigFromEnv(config)
if err != nil {
log.Println("Failed to get auth information")
}
return NewBatchProviderFromConfig(config, rm, nodeName, operatingSystem, internalIP, daemonEndpointPort)
}
// NewBatchProviderFromConfig Creates a batch provider
func NewBatchProviderFromConfig(config *Config, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error) {
p := Provider{}
p.batchConfig = config
// Set sane defaults for Capacity in case config is not supplied
p.cpu = "20"
p.memory = "100Gi"
p.pods = "20"
p.resourceManager = rm
p.operatingSystem = operatingSystem
p.nodeName = nodeName
p.internalIP = internalIP
p.daemonEndpointPort = daemonEndpointPort
p.ctx, p.cancelCtx = context.WithCancel(context.Background())
auth := getAzureADAuthorizer(config, azure.PublicCloud.BatchManagementEndpoint)
batchBaseURL := getBatchBaseURL(config.AccountName, config.AccountLocation)
_, err := getPool(p.ctx, batchBaseURL, config.PoolID, auth)
if err != nil {
log.Panicf("Error retreiving Azure Batch pool: %v", err)
}
_, err = createOrGetJob(p.ctx, batchBaseURL, config.JobID, config.PoolID, auth)
if err != nil {
log.Panicf("Error retreiving/creating Azure Batch job: %v", err)
}
taskClient := batch.NewTaskClientWithBaseURI(batchBaseURL)
taskClient.Authorizer = auth
p.listTasks = func() (*[]batch.CloudTask, error) {
res, err := taskClient.List(p.ctx, config.JobID, "", "", "", nil, nil, nil, nil, nil)
if err != nil {
return &[]batch.CloudTask{}, err
}
currentTasks := res.Values()
for res.NotDone() {
err = res.Next()
if err != nil {
return &[]batch.CloudTask{}, err
}
pageTasks := res.Values()
if pageTasks != nil || len(pageTasks) != 0 {
currentTasks = append(currentTasks, pageTasks...)
}
}
return &currentTasks, nil
}
p.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
return taskClient.Add(p.ctx, config.JobID, task, nil, nil, nil, nil)
}
p.getTask = func(taskID string) (batch.CloudTask, error) {
return taskClient.Get(p.ctx, config.JobID, taskID, "", "", nil, nil, nil, nil, "", "", nil, nil)
}
p.deleteTask = func(taskID string) (autorest.Response, error) {
return taskClient.Delete(p.ctx, config.JobID, taskID, nil, nil, nil, nil, "", "", nil, nil)
}
p.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
return p.fileClient.GetFromTask(p.ctx, config.JobID, taskID, path, nil, nil, nil, nil, "", nil, nil)
}
fileClient := batch.NewFileClientWithBaseURI(batchBaseURL)
fileClient.Authorizer = auth
p.fileClient = &fileClient
return &p, nil
}
// CreatePod accepts a Pod definition
func (p *Provider) CreatePod(pod *v1.Pod) error {
log.Println("Creating pod...")
podCommand, err := pod2docker.GetBashCommand(pod2docker.PodComponents{
InitContainers: pod.Spec.InitContainers,
Containers: pod.Spec.Containers,
PodName: pod.Name,
Volumes: pod.Spec.Volumes,
})
if err != nil {
return err
}
bytes, err := json.Marshal(pod)
if err != nil {
panic(err)
}
task := batch.TaskAddParameter{
DisplayName: to.StringPtr(string(pod.UID)),
ID: to.StringPtr(getTaskIDForPod(pod.Namespace, pod.Name)),
CommandLine: to.StringPtr(fmt.Sprintf(`/bin/bash -c "%s"`, podCommand)),
UserIdentity: &batch.UserIdentity{
AutoUser: &batch.AutoUserSpecification{
ElevationLevel: batch.Admin,
Scope: batch.Pool,
},
},
EnvironmentSettings: &[]batch.EnvironmentSetting{
{
Name: to.StringPtr(podJSONKey),
Value: to.StringPtr(string(bytes)),
},
},
}
_, err = p.addTask(task)
if err != nil {
return err
}
return nil
}
// GetPodStatus retrieves the status of a given pod by name.
func (p *Provider) GetPodStatus(namespace, name string) (*v1.PodStatus, error) {
log.Println("Getting pod status ....")
pod, err := p.GetPod(namespace, name)
if err != nil {
return nil, err
}
if pod == nil {
return nil, nil
}
return &pod.Status, nil
}
// UpdatePod accepts a Pod definition
func (p *Provider) UpdatePod(pod *v1.Pod) error {
log.Println("Pod Update called: No-op as not implemented")
return nil
}
// DeletePod accepts a Pod definition
func (p *Provider) DeletePod(pod *v1.Pod) error {
taskID := getTaskIDForPod(pod.Namespace, pod.Name)
task, err := p.deleteTask(taskID)
if err != nil {
log.Println(task)
log.Println(err)
return err
}
log.Printf(fmt.Sprintf("Deleting task: %v", taskID))
return nil
}
// GetPod returns a pod by name
func (p *Provider) GetPod(namespace, name string) (*v1.Pod, error) {
log.Println("Getting Pod ...")
task, err := p.getTask(getTaskIDForPod(namespace, name))
if err != nil {
if task.Response.StatusCode == http.StatusNotFound {
return nil, nil
}
log.Println(err)
return nil, err
}
pod, err := getPodFromTask(&task)
if err != nil {
panic(err)
}
status, _ := convertTaskToPodStatus(&task)
pod.Status = *status
return pod, nil
}
// GetContainerLogs returns the logs of a container running in a pod by name.
func (p *Provider) GetContainerLogs(namespace, podName, containerName string, tail int) (string, error) {
log.Println("Getting pod logs ....")
taskID := getTaskIDForPod(namespace, podName)
logFileLocation := fmt.Sprintf("wd/%s.log", containerName)
containerLogReader, err := p.getFileFromTask(taskID, logFileLocation)
if containerLogReader.Response.Response != nil && containerLogReader.StatusCode == http.StatusNotFound {
stdoutReader, err := p.getFileFromTask(taskID, "stdout.txt")
if err != nil {
return "", err
}
stderrReader, err := p.getFileFromTask(taskID, "stderr.txt")
if err != nil {
return "", err
}
var builder strings.Builder
builderPtr := &builder
mustWriteString(builderPtr, "Container still starting....\n")
mustWriteString(builderPtr, "Showing startup logs from Azure Batch node instead:\n")
mustWriteString(builderPtr, "----- STDOUT -----\n")
stdoutBytes, _ := ioutil.ReadAll(*stdoutReader.Value)
mustWrite(builderPtr, stdoutBytes)
mustWriteString(builderPtr, "\n")
mustWriteString(builderPtr, "----- STDERR -----\n")
stderrBytes, _ := ioutil.ReadAll(*stderrReader.Value)
mustWrite(builderPtr, stderrBytes)
mustWriteString(builderPtr, "\n")
return builder.String(), nil
}
if err != nil {
return "", err
}
result, err := formatLogJSON(containerLogReader)
if err != nil {
return "", fmt.Errorf("Container log formating failed err: %v", err)
}
return result, nil
}
// GetPods retrieves a list of all pods scheduled to run.
func (p *Provider) GetPods() ([]*v1.Pod, error) {
log.Println("Getting pods...")
tasksPtr, err := p.listTasks()
if err != nil {
panic(err)
}
if tasksPtr == nil {
return []*v1.Pod{}, nil
}
tasks := *tasksPtr
pods := make([]*v1.Pod, len(tasks), len(tasks))
for i, t := range tasks {
pod, err := getPodFromTask(&t)
if err != nil {
panic(err)
}
pods[i] = pod
}
return pods, nil
}
// Capacity returns a resource list containing the capacity limits
func (p *Provider) Capacity() v1.ResourceList {
return v1.ResourceList{
"cpu": resource.MustParse(p.cpu),
"memory": resource.MustParse(p.memory),
"pods": resource.MustParse(p.pods),
"nvidia.com/gpu": resource.MustParse("1"),
}
}
// NodeConditions returns a list of conditions (Ready, OutOfDisk, etc), for updates to the node status
// within Kubernetes.
func (p *Provider) NodeConditions() []v1.NodeCondition {
return []v1.NodeCondition{
{
Type: "Ready",
Status: v1.ConditionTrue,
LastHeartbeatTime: metav1.Now(),
LastTransitionTime: metav1.Now(),
Reason: "KubeletReady",
Message: "kubelet is ready.",
},
{
Type: "OutOfDisk",
Status: v1.ConditionFalse,
LastHeartbeatTime: metav1.Now(),
LastTransitionTime: metav1.Now(),
Reason: "KubeletHasSufficientDisk",
Message: "kubelet has sufficient disk space available",
},
{
Type: "MemoryPressure",
Status: v1.ConditionFalse,
LastHeartbeatTime: metav1.Now(),
LastTransitionTime: metav1.Now(),
Reason: "KubeletHasSufficientMemory",
Message: "kubelet has sufficient memory available",
},
{
Type: "DiskPressure",
Status: v1.ConditionFalse,
LastHeartbeatTime: metav1.Now(),
LastTransitionTime: metav1.Now(),
Reason: "KubeletHasNoDiskPressure",
Message: "kubelet has no disk pressure",
},
{
Type: "NetworkUnavailable",
Status: v1.ConditionFalse,
LastHeartbeatTime: metav1.Now(),
LastTransitionTime: metav1.Now(),
Reason: "RouteCreated",
Message: "RouteController created a route",
},
}
}
// NodeAddresses returns a list of addresses for the node status
// within Kubernetes.
func (p *Provider) NodeAddresses() []v1.NodeAddress {
// TODO: Make these dynamic and augment with custom ACI specific conditions of interest
return []v1.NodeAddress{
{
Type: "InternalIP",
Address: p.internalIP,
},
}
}
// NodeDaemonEndpoints returns NodeDaemonEndpoints for the node status
// within Kubernetes.
func (p *Provider) NodeDaemonEndpoints() *v1.NodeDaemonEndpoints {
return &v1.NodeDaemonEndpoints{
KubeletEndpoint: v1.DaemonEndpoint{
Port: p.daemonEndpointPort,
},
}
}
// OperatingSystem returns the operating system for this provider.
func (p *Provider) OperatingSystem() string {
return p.operatingSystem
}

View File

@@ -0,0 +1,216 @@
package azurebatch
import (
"bufio"
"context"
"crypto/md5"
"encoding/json"
"fmt"
"log"
"os"
"strings"
"time"
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
"github.com/Azure/go-autorest/autorest"
"github.com/Azure/go-autorest/autorest/adal"
"github.com/Azure/go-autorest/autorest/azure"
)
func mustWriteString(builder *strings.Builder, s string) {
_, err := builder.WriteString(s)
if err != nil {
panic(err)
}
}
func mustWrite(builder *strings.Builder, b []byte) {
_, err := builder.Write(b)
if err != nil {
panic(err)
}
}
// NewServicePrincipalTokenFromCredentials creates a new ServicePrincipalToken using values of the
// passed credentials map.
func newServicePrincipalTokenFromCredentials(c *Config, scope string) (*adal.ServicePrincipalToken, error) {
oauthConfig, err := adal.NewOAuthConfig(azure.PublicCloud.ActiveDirectoryEndpoint, c.TenantID)
if err != nil {
panic(err)
}
return adal.NewServicePrincipalToken(*oauthConfig, c.ClientID, c.ClientSecret, scope)
}
// GetAzureADAuthorizer return an authorizor for Azure SP
func getAzureADAuthorizer(c *Config, azureEndpoint string) autorest.Authorizer {
spt, err := newServicePrincipalTokenFromCredentials(c, azureEndpoint)
if err != nil {
panic(fmt.Sprintf("Failed to create authorizer: %v", err))
}
auth := autorest.NewBearerAuthorizer(spt)
return auth
}
func getPool(ctx context.Context, batchBaseURL, poolID string, auth autorest.Authorizer) (*batch.PoolClient, error) {
poolClient := batch.NewPoolClientWithBaseURI(batchBaseURL)
poolClient.Authorizer = auth
poolClient.RetryAttempts = 0
pool, err := poolClient.Get(ctx, poolID, "*", "", nil, nil, nil, nil, "", "", nil, nil)
// If we observe an error which isn't related to the pool not existing panic.
// 404 is expected if this is first run.
if err != nil && pool.Response.Response == nil {
log.Printf("Failed to get pool. nil response %v", poolID)
return nil, err
} else if err != nil && pool.StatusCode == 404 {
log.Printf("Pool doesn't exist 404 received Error: %v PoolID: %v", err, poolID)
return nil, err
} else if err != nil {
log.Printf("Failed to get pool. Response:%v", pool.Response)
return nil, err
}
if pool.State == batch.PoolStateActive {
log.Println("Pool active and running...")
return &poolClient, nil
}
return nil, fmt.Errorf("Pool not in active state: %v", pool.State)
}
func createOrGetJob(ctx context.Context, batchBaseURL, jobID, poolID string, auth autorest.Authorizer) (*batch.JobClient, error) {
jobClient := batch.NewJobClientWithBaseURI(batchBaseURL)
jobClient.Authorizer = auth
// check if job exists already
currentJob, err := jobClient.Get(ctx, jobID, "", "", nil, nil, nil, nil, "", "", nil, nil)
if err == nil && currentJob.State == batch.JobStateActive {
log.Println("Wrapper job already exists...")
return &jobClient, nil
} else if currentJob.Response.StatusCode == 404 {
log.Println("Wrapper job missing... creating...")
wrapperJob := batch.JobAddParameter{
ID: &jobID,
PoolInfo: &batch.PoolInformation{
PoolID: &poolID,
},
}
_, err := jobClient.Add(ctx, wrapperJob, nil, nil, nil, nil)
if err != nil {
return nil, err
}
return &jobClient, nil
} else if currentJob.State == batch.JobStateDeleting {
log.Printf("Job is being deleted... Waiting then will retry")
time.Sleep(time.Minute)
return createOrGetJob(ctx, batchBaseURL, jobID, poolID, auth)
}
return nil, err
}
func getBatchBaseURL(batchAccountName, batchAccountLocation string) string {
return fmt.Sprintf("https://%s.%s.batch.azure.com", batchAccountName, batchAccountLocation)
}
func envHasValue(env string) bool {
val := os.Getenv(env)
if val == "" {
return false
}
return true
}
// GetConfigFromEnv - Retreives the azure configuration from environment variables.
func getAzureConfigFromEnv(config *Config) error {
if envHasValue("AZURE_CLIENT_ID") {
config.ClientID = os.Getenv("AZURE_CLIENT_ID")
}
if envHasValue("AZURE_CLIENT_SECRET") {
config.ClientSecret = os.Getenv("AZURE_CLIENT_SECRET")
}
if envHasValue("AZURE_RESOURCE_GROUP") {
config.ResourceGroup = os.Getenv("AZURE_RESOURCE_GROUP")
}
if envHasValue("AZURE_SUBSCRIPTION_ID") {
config.SubscriptionID = os.Getenv("AZURE_SUBSCRIPTION_ID")
}
if envHasValue("AZURE_TENANT_ID") {
config.TenantID = os.Getenv("AZURE_TENANT_ID")
}
if envHasValue("AZURE_BATCH_POOLID") {
config.PoolID = os.Getenv("AZURE_BATCH_POOLID")
}
if envHasValue("AZURE_BATCH_JOBID") {
config.JobID = os.Getenv("AZURE_BATCH_JOBID")
}
if envHasValue("AZURE_BATCH_ACCOUNT_LOCATION") {
config.AccountLocation = os.Getenv("AZURE_BATCH_ACCOUNT_LOCATION")
}
if envHasValue("AZURE_BATCH_ACCOUNT_NAME") {
config.AccountName = os.Getenv("AZURE_BATCH_ACCOUNT_NAME")
}
if config.JobID == "" {
hostname, err := os.Hostname()
if err != nil {
log.Panic(err)
}
config.JobID = hostname
}
if config.ClientID == "" ||
config.ClientSecret == "" ||
config.ResourceGroup == "" ||
config.SubscriptionID == "" ||
config.PoolID == "" ||
config.TenantID == "" {
return &ConfigError{CurrentConfig: config, ErrorDetails: "Missing configuration"}
}
return nil
}
func getTaskIDForPod(namespace, name string) string {
ID := []byte(fmt.Sprintf("%s-%s", namespace, name))
return string(fmt.Sprintf("%x", md5.Sum(ID)))
}
type jsonLog struct {
Log string `json:"log"`
}
func formatLogJSON(readCloser batch.ReadCloser) (string, error) {
//Read line by line as file isn't valid json. Each line is a single valid json object.
scanner := bufio.NewScanner(*readCloser.Value)
var b strings.Builder
for scanner.Scan() {
result := jsonLog{}
err := json.Unmarshal(scanner.Bytes(), &result)
if err != nil {
return "", err
}
mustWriteString(&b, result.Log)
}
return b.String(), nil
}
// ConfigError - Error when reading configuration values.
type ConfigError struct {
CurrentConfig *Config
ErrorDetails string
}
func (e *ConfigError) Error() string {
configJSON, err := json.Marshal(e.CurrentConfig)
if err != nil {
return e.ErrorDetails
}
return e.ErrorDetails + ": " + string(configJSON)
}

View File

@@ -0,0 +1,141 @@
package azurebatch
import (
"encoding/json"
"fmt"
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
apiv1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func getPodFromTask(task *batch.CloudTask) (pod *apiv1.Pod, err error) {
if task == nil || task.EnvironmentSettings == nil {
return nil, fmt.Errorf("invalid input task: %v", task)
}
ok := false
jsonData := ""
settings := *task.EnvironmentSettings
for _, s := range settings {
if *s.Name == podJSONKey && s.Value != nil {
ok = true
jsonData = *s.Value
}
}
if !ok {
return nil, fmt.Errorf("task doesn't have pod json stored in it: %v", task.EnvironmentSettings)
}
pod = &apiv1.Pod{}
err = json.Unmarshal([]byte(jsonData), pod)
if err != nil {
return nil, err
}
return
}
func convertTaskToPodStatus(task *batch.CloudTask) (status *apiv1.PodStatus, err error) {
pod, err := getPodFromTask(task)
if err != nil {
return
}
// Todo: Review indivudal container status response
status = &apiv1.PodStatus{
Phase: convertTaskStatusToPodPhase(task),
Conditions: []apiv1.PodCondition{},
Message: "",
Reason: "",
HostIP: "",
PodIP: "127.0.0.1",
StartTime: &pod.CreationTimestamp,
}
for _, container := range pod.Spec.Containers {
containerStatus := apiv1.ContainerStatus{
Name: container.Name,
State: convertTaskStatusToContainerState(task),
Ready: true,
RestartCount: 0,
Image: container.Image,
ImageID: "",
ContainerID: "",
}
status.ContainerStatuses = append(status.ContainerStatuses, containerStatus)
}
return
}
func convertTaskStatusToPodPhase(t *batch.CloudTask) (podPhase apiv1.PodPhase) {
switch t.State {
case batch.TaskStatePreparing:
podPhase = apiv1.PodPending
case batch.TaskStateActive:
podPhase = apiv1.PodPending
case batch.TaskStateRunning:
podPhase = apiv1.PodRunning
case batch.TaskStateCompleted:
podPhase = apiv1.PodFailed
if t.ExecutionInfo != nil && t.ExecutionInfo.ExitCode != nil && *t.ExecutionInfo.ExitCode == 0 {
podPhase = apiv1.PodSucceeded
}
}
return
}
func convertTaskStatusToContainerState(t *batch.CloudTask) (containerState apiv1.ContainerState) {
startTime := metav1.Time{}
if t.ExecutionInfo != nil {
if t.ExecutionInfo.StartTime != nil {
startTime.Time = t.ExecutionInfo.StartTime.Time
}
}
switch t.State {
case batch.TaskStatePreparing:
containerState = apiv1.ContainerState{
Waiting: &apiv1.ContainerStateWaiting{
Message: "Waiting for machine in AzureBatch",
Reason: "Preparing",
},
}
case batch.TaskStateActive:
containerState = apiv1.ContainerState{
Waiting: &apiv1.ContainerStateWaiting{
Message: "Waiting for machine in AzureBatch",
Reason: "Queued",
},
}
case batch.TaskStateRunning:
containerState = apiv1.ContainerState{
Running: &apiv1.ContainerStateRunning{
StartedAt: startTime,
},
}
case batch.TaskStateCompleted:
termStatus := apiv1.ContainerState{
Terminated: &apiv1.ContainerStateTerminated{
FinishedAt: metav1.Time{
Time: t.StateTransitionTime.Time,
},
StartedAt: startTime,
},
}
if t.ExecutionInfo != nil && t.ExecutionInfo.ExitCode != nil {
exitCode := *t.ExecutionInfo.ExitCode
termStatus.Terminated.ExitCode = exitCode
if exitCode != 0 {
termStatus.Terminated.Message = *t.ExecutionInfo.FailureInfo.Message
}
}
}
return
}

View File

@@ -0,0 +1,166 @@
package azurebatch
import (
"github.com/Azure/go-autorest/autorest/to"
"reflect"
"testing"
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
apiv1 "k8s.io/api/core/v1"
)
func Test_getPodFromTask(t *testing.T) {
type args struct {
task *batch.CloudTask
}
tests := []struct {
name string
task batch.CloudTask
wantPod *apiv1.Pod
wantErr bool
}{
{
name: "SimplePod",
task: batch.CloudTask{
EnvironmentSettings: &[]batch.EnvironmentSetting{
{
Name: to.StringPtr(podJSONKey),
Value: to.StringPtr(`{"metadata":{"creationTimestamp":null},"spec":{"containers":[{"name":"web","image":"nginx:1.12","ports":[{"name":"http","containerPort":80,"protocol":"TCP"}],"resources":{}}]},"status":{}}`),
},
},
},
wantPod: &apiv1.Pod{
Spec: apiv1.PodSpec{
Containers: []apiv1.Container{
{
Name: "web",
Image: "nginx:1.12",
Ports: []apiv1.ContainerPort{
{
Name: "http",
Protocol: apiv1.ProtocolTCP,
ContainerPort: 80,
},
},
},
},
},
},
},
{
name: "InvalidJson",
task: batch.CloudTask{
EnvironmentSettings: &[]batch.EnvironmentSetting{
{
Name: to.StringPtr(podJSONKey),
Value: to.StringPtr("---notjson--"),
},
},
},
wantErr: true,
},
{
name: "NilEnvironment",
task: batch.CloudTask{
EnvironmentSettings: nil,
},
wantErr: true,
},
{
name: "NilString",
task: batch.CloudTask{
EnvironmentSettings: &[]batch.EnvironmentSetting{
{
Name: to.StringPtr(podJSONKey),
Value: nil,
},
},
},
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
gotPod, err := getPodFromTask(&tt.task)
if (err != nil) != tt.wantErr {
t.Errorf("getPodFromTask() error = %v, wantErr %v", err, tt.wantErr)
return
}
if !reflect.DeepEqual(gotPod, tt.wantPod) {
t.Errorf("getPodFromTask() = %v, want %v", gotPod, tt.wantPod)
}
})
}
}
func Test_convertTaskStatusToPodPhase(t *testing.T) {
type args struct {
t *batch.CloudTask
}
tests := []struct {
name string
task batch.CloudTask
wantPodPhase apiv1.PodPhase
}{
{
name: "PreparingTask",
task: batch.CloudTask{
State: batch.TaskStatePreparing,
},
wantPodPhase: apiv1.PodPending,
},
{
//Active tasks are sitting in a queue waiting for a node
// so maps best to pending state
name: "ActiveTask",
task: batch.CloudTask{
State: batch.TaskStateActive,
},
wantPodPhase: apiv1.PodPending,
},
{
name: "RunningTask",
task: batch.CloudTask{
State: batch.TaskStateRunning,
},
wantPodPhase: apiv1.PodRunning,
},
{
name: "CompletedTask_ExitCode0",
task: batch.CloudTask{
State: batch.TaskStateCompleted,
ExecutionInfo: &batch.TaskExecutionInformation{
ExitCode: to.Int32Ptr(0),
},
},
wantPodPhase: apiv1.PodSucceeded,
},
{
name: "CompletedTask_ExitCode127",
task: batch.CloudTask{
State: batch.TaskStateCompleted,
ExecutionInfo: &batch.TaskExecutionInformation{
ExitCode: to.Int32Ptr(127),
},
},
wantPodPhase: apiv1.PodFailed,
},
{
name: "CompletedTask_nilExecInfo",
task: batch.CloudTask{
State: batch.TaskStateCompleted,
ExecutionInfo: nil,
},
wantPodPhase: apiv1.PodFailed,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if gotPodPhase := convertTaskStatusToPodPhase(&tt.task); !reflect.DeepEqual(gotPodPhase, tt.wantPodPhase) {
t.Errorf("convertTaskStatusToPodPhase() = %v, want %v", gotPodPhase, tt.wantPodPhase)
}
})
}
}

View File

@@ -0,0 +1,165 @@
package azurebatch
import (
"crypto/md5"
"fmt"
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
"github.com/Azure/go-autorest/autorest"
"io/ioutil"
"net/http"
"os"
"strings"
"testing"
apiv1 "k8s.io/api/core/v1"
)
func Test_deletePod(t *testing.T) {
podNamespace := "bob"
podName := "marley"
concatName := []byte(fmt.Sprintf("%s-%s", podNamespace, podName))
expectedDeleteTaskID := fmt.Sprintf("%x", md5.Sum(concatName))
provider := Provider{}
provider.deleteTask = func(taskID string) (autorest.Response, error) {
if taskID != expectedDeleteTaskID {
t.Errorf("Deleted wrong task! Expected delete: %v Actual: %v", taskID, expectedDeleteTaskID)
}
return autorest.Response{}, nil
}
pod := &apiv1.Pod{}
pod.Name = podName
pod.Namespace = podNamespace
err := provider.DeletePod(pod)
if err != nil {
t.Error(err)
}
}
func Test_deletePod_doesntExist(t *testing.T) {
pod := &apiv1.Pod{}
pod.Namespace = "bob"
pod.Name = "marley"
provider := Provider{}
provider.deleteTask = func(taskID string) (autorest.Response, error) {
return autorest.Response{}, fmt.Errorf("Task doesn't exist")
}
err := provider.DeletePod(pod)
if err == nil {
t.Error("Expected error but none seen")
}
}
func Test_createPod(t *testing.T) {
pod := &apiv1.Pod{}
pod.Namespace = "bob"
pod.Name = "marley"
provider := Provider{}
provider.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
if task.CommandLine == nil || *task.CommandLine == "" {
t.Error("Missing commandline args")
}
derefVars := *task.EnvironmentSettings
if len(derefVars) != 1 || *derefVars[0].Name != podJSONKey {
t.Error("Missing pod json")
}
return autorest.Response{}, nil
}
err := provider.CreatePod(pod)
if err != nil {
t.Errorf("Unexpected error creating pod %v", err)
}
}
func Test_createPod_errorResponse(t *testing.T) {
pod := &apiv1.Pod{}
pod.Namespace = "bob"
pod.Name = "marley"
provider := Provider{}
provider.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
return autorest.Response{}, fmt.Errorf("Failed creating task")
}
err := provider.CreatePod(pod)
if err == nil {
t.Error("Expected error but none seen")
}
}
func Test_readLogs_404Response_expectReturnStartupLogs(t *testing.T) {
pod := &apiv1.Pod{}
pod.Namespace = "bob"
pod.Name = "marley"
containerName := "sam"
provider := Provider{}
provider.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
if path == "wd/sam.log" {
// Autorest - Seriously? Can't find a better way to make a 404 :(
return batch.ReadCloser{Response: autorest.Response{Response: &http.Response{StatusCode: 404}}}, nil
} else if path == "stderr.txt" {
response := ioutil.NopCloser(strings.NewReader("stderrResponse"))
return batch.ReadCloser{Value: &response}, nil
} else if path == "stdout.txt" {
response := ioutil.NopCloser(strings.NewReader("stdoutResponse"))
return batch.ReadCloser{Value: &response}, nil
} else {
t.Errorf("Unexpected Filepath: %v", path)
}
return batch.ReadCloser{}, fmt.Errorf("Failed in test mock of getFileFromTask")
}
result, err := provider.GetContainerLogs(pod.Namespace, pod.Name, containerName, 0)
if err != nil {
t.Errorf("GetContainerLogs return error: %v", err)
}
fmt.Print(result)
if !strings.Contains(result, "stderrResponse") || !strings.Contains(result, "stdoutResponse") {
t.Errorf("Result didn't contain expected content have: %v", result)
}
}
func Test_readLogs_JsonResponse_expectFormattedLogs(t *testing.T) {
pod := &apiv1.Pod{}
pod.Namespace = "bob"
pod.Name = "marley"
containerName := "sam"
provider := Provider{}
provider.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
if path == "wd/sam.log" {
fileReader, err := os.Open("./testdata/logresponse.json")
if err != nil {
t.Error(err)
}
readCloser := ioutil.NopCloser(fileReader)
return batch.ReadCloser{Value: &readCloser, Response: autorest.Response{Response: &http.Response{StatusCode: 200}}}, nil
}
t.Errorf("Unexpected Filepath: %v", path)
return batch.ReadCloser{}, fmt.Errorf("Failed in test mock of getFileFromTask")
}
result, err := provider.GetContainerLogs(pod.Namespace, pod.Name, containerName, 0)
if err != nil {
t.Errorf("GetContainerLogs return error: %v", err)
}
fmt.Print(result)
if !strings.Contains(result, "Copy output data from the CUDA device to the host memory") || strings.Contains(result, "{") {
t.Errorf("Result didn't contain expected content have or had json: %v", result)
}
}

View File

@@ -0,0 +1,59 @@
resource "random_id" "workspace" {
keepers = {
# Generate a new id each time we switch to a new resource group
group_name = "${var.resource_group_name}"
}
byte_length = 8
}
#an attempt to keep the AKS name (and dns label) somewhat unique
resource "random_integer" "random_int" {
min = 100
max = 999
}
resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-${random_integer.random_int.result}"
location = "${var.resource_group_location}"
dns_prefix = "aks-${random_integer.random_int.result}"
resource_group_name = "${var.resource_group_name}"
kubernetes_version = "1.9.2"
linux_profile {
admin_username = "${var.linux_admin_username}"
ssh_key {
key_data = "${var.linux_admin_ssh_publickey}"
}
}
agent_pool_profile {
name = "agentpool"
count = "2"
vm_size = "Standard_DS2_v2"
os_type = "Linux"
}
service_principal {
client_id = "${var.client_id}"
client_secret = "${var.client_secret}"
}
}
output "cluster_client_certificate" {
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)}"
}
output "cluster_client_key" {
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)}"
}
output "cluster_ca" {
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)}"
}
output "host" {
value = "${azurerm_kubernetes_cluster.aks.kube_config.0.host}"
}

View File

@@ -0,0 +1,31 @@
variable "client_id" {
type = "string"
description = "Client ID"
}
variable "client_secret" {
type = "string"
description = "Client secret."
}
variable "resource_group_name" {
type = "string"
description = "Name of the azure resource group."
default = "akc-rg"
}
variable "resource_group_location" {
type = "string"
description = "Location of the azure resource group."
default = "eastus"
}
variable "linux_admin_username" {
type = "string"
description = "User name for authentication to the Kubernetes linux agent virtual machines in the cluster."
}
variable "linux_admin_ssh_publickey" {
type = "string"
description = "Configure all the linux virtual machines in the cluster with the SSH RSA public key string. The key should include three parts, for example 'ssh-rsa AAAAB...snip...UcyupgH azureuser@linuxvm'"
}

View File

@@ -0,0 +1,146 @@
resource "random_string" "batchname" {
keepers = {
# Generate a new id each time we switch to a new resource group
group_name = "${var.resource_group_name}"
}
length = 8
upper = false
special = false
number = false
}
resource "azurerm_template_deployment" "test" {
name = "tfdeployment"
resource_group_name = "${var.resource_group_name}"
# these key-value pairs are passed into the ARM Template's `parameters` block
parameters {
"batchAccountName" = "${random_string.batchname.result}"
"storageAccountID" = "${var.storage_account_id}"
"poolBoostrapScriptUrl" = "${var.pool_bootstrap_script_url}"
"location" = "${var.resource_group_location}"
"poolID" = "${var.pool_id}"
"vmSku" = "${var.vm_sku}"
"lowPriorityNodeCount" = "${var.low_priority_node_count}"
"dedicatedNodeCount" = "${var.dedicated_node_count}"
}
deployment_mode = "Incremental"
template_body = <<DEPLOY
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"batchAccountName": {
"type": "string",
"metadata": {
"description": "Batch Account Name"
}
},
"poolID": {
"type": "string",
"metadata": {
"description": "GPU Pool ID"
}
},
"dedicatedNodeCount": {
"type": "string"
},
"lowPriorityNodeCount": {
"type": "string"
},
"vmSku": {
"type": "string"
},
"storageAccountID": {
"type": "string"
},
"poolBoostrapScriptUrl": {
"type": "string"
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location for all resources."
}
}
},
"resources": [
{
"type": "Microsoft.Batch/batchAccounts",
"name": "[parameters('batchAccountName')]",
"apiVersion": "2015-12-01",
"location": "[parameters('location')]",
"tags": {
"ObjectName": "[parameters('batchAccountName')]"
},
"properties": {
"autoStorage": {
"storageAccountId": "[parameters('storageAccountID')]"
}
}
},
{
"type": "Microsoft.Batch/batchAccounts/pools",
"name": "[concat(parameters('batchAccountName'), '/', parameters('poolID'))]",
"apiVersion": "2017-09-01",
"scale": null,
"properties": {
"vmSize": "STANDARD_NC6",
"interNodeCommunication": "Disabled",
"maxTasksPerNode": 1,
"taskSchedulingPolicy": {
"nodeFillType": "Spread"
},
"startTask": {
"commandLine": "/bin/bash -c ./init.sh",
"resourceFiles": [
{
"blobSource": "[parameters('poolBoostrapScriptUrl')]",
"fileMode": "777",
"filePath": "./init.sh"
}
],
"userIdentity": {
"autoUser": {
"elevationLevel": "Admin",
"scope": "Pool"
}
},
"waitForSuccess": true,
"maxTaskRetryCount": 0
},
"deploymentConfiguration": {
"virtualMachineConfiguration": {
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "16.04-LTS",
"version": "latest"
},
"nodeAgentSkuId": "batch.node.ubuntu 16.04"
}
},
"scaleSettings": {
"fixedScale": {
"targetDedicatedNodes": "[parameters('dedicatedNodeCount')]",
"targetLowPriorityNodes": "[parameters('lowPriorityNodeCount')]",
"resizeTimeout": "PT15M"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Batch/batchAccounts', parameters('batchAccountName'))]"
]
}
]
}
DEPLOY
}
output "name" {
value = "${random_string.batchname.result}"
}

View File

@@ -0,0 +1,43 @@
variable "pool_id" {
type = "string"
description = "Name of the Azure Batch pool to create."
default = "pool1"
}
variable "vm_sku" {
type = "string"
description = "VM SKU to use - Default to NC6 GPU SKU."
default = "STANDARD_NC6"
}
variable "pool_bootstrap_script_url" {
type = "string"
description = "Publicly accessible url used for boostrapping nodes in the pool. Installing GPU drivers, for example."
}
variable "storage_account_id" {
type = "string"
description = "Name of the storage account to be used by Azure Batch"
}
variable "resource_group_name" {
type = "string"
description = "Name of the azure resource group."
default = "akc-rg"
}
variable "resource_group_location" {
type = "string"
description = "Location of the azure resource group."
default = "eastus"
}
variable "low_priority_node_count" {
type = "string"
description = "The number of low priority nodes to allocate to the pool"
}
variable "dedicated_node_count" {
type = "string"
description = "The number dedicated nodes to allocate to the pool"
}

View File

@@ -0,0 +1,19 @@
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
labels:
app: examplegpupod
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
nodeName: virtual-kubelet
tolerations:
- key: azure.com/batch
effect: NoSchedule

View File

@@ -0,0 +1,20 @@
apiVersion: v1
kind: Pod
metadata:
name: exampegpujob
spec:
containers:
- image: nvidia/cuda
command: ["nvidia-smi"]
imagePullPolicy: Always
name: nvidia
resources:
requests:
memory: 1G
cpu: 1
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
nodeName: virtual-kubelet
tolerations:
- key: azure.com/batch
effect: NoSchedule

View File

@@ -0,0 +1,53 @@
resource "azurerm_resource_group" "batchrg" {
name = "${var.resource_group_name}"
location = "${var.resource_group_location}"
}
module "aks" {
source = "aks"
//Defaults to using current ssh key: recomend changing as needed
linux_admin_username = "aks"
linux_admin_ssh_publickey = "${file("~/.ssh/id_rsa.pub")}"
client_id = "${var.client_id}"
client_secret = "${var.client_secret}"
resource_group_name = "${azurerm_resource_group.batchrg.name}"
resource_group_location = "${azurerm_resource_group.batchrg.location}"
}
module "storage" {
source = "storage"
pool_bootstrap_script_path = "./scripts/poolstartup.sh"
resource_group_name = "${azurerm_resource_group.batchrg.name}"
resource_group_location = "${azurerm_resource_group.batchrg.location}"
}
module "azurebatch" {
source = "azurebatch"
storage_account_id = "${module.storage.id}"
pool_bootstrap_script_url = "${module.storage.pool_boostrap_script_url}"
resource_group_name = "${azurerm_resource_group.batchrg.name}"
resource_group_location = "${azurerm_resource_group.batchrg.location}"
dedicated_node_count = 1
low_priority_node_count = 2
}
module "virtualkubelet" {
source = "virtualkubelet"
virtualkubelet_docker_image = "${var.virtualkubelet_docker_image}"
cluster_client_key = "${module.aks.cluster_client_key}"
cluster_client_certificate = "${module.aks.cluster_client_certificate}"
cluster_ca = "${module.aks.cluster_ca}"
cluster_host = "${module.aks.host}"
azure_batch_account_name = "${module.azurebatch.name}"
resource_group_location = "${azurerm_resource_group.batchrg.location}"
}

View File

@@ -0,0 +1,49 @@
export DEBIAN_FRONTEND=noninteractive
export TEMP_DISK=/mnt
apt-get install -y -q --no-install-recommends \
build-essential
# Add dockerce repo
apt-get update -y -q --no-install-recommends
apt-get install -y -q -o Dpkg::Options::="--force-confnew" --no-install-recommends \
apt-transport-https ca-certificates curl software-properties-common cgroup-lite
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
#Install latest cuda driver..
CUDA_REPO_PKG=cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
wget -O /tmp/${CUDA_REPO_PKG} http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG}
sudo dpkg -i /tmp/${CUDA_REPO_PKG}
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
rm -f /tmp/${CUDA_REPO_PKG}
sudo apt-get update -y -q --no-install-recommends
sudo apt-get install cuda-drivers -y -q --no-install-recommends
# install nvidia-docker
curl -fSsL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -fSsL https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | \
tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update -y -q --no-install-recommends
apt-get install -y -q --no-install-recommends -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confnew" nvidia-docker2
systemctl restart docker.service
nvidia-docker version
# prep docker
systemctl stop docker.service
rm -rf /var/lib/docker
mkdir -p /etc/docker
mkdir -p $TEMPDISK/docker
chmod 777 $TEMPDISK/docker
echo "{ \"data-root\": \"$TEMP_DISK/docker\", \"hosts\": [ \"unix:///var/run/docker.sock\", \"tcp://127.0.0.1:2375\" ] }" > /etc/docker/daemon.json.merge
python -c "import json;a=json.load(open('/etc/docker/daemon.json.merge'));b=json.load(open('/etc/docker/daemon.json'));a.update(b);f=open('/etc/docker/daemon.json','w');json.dump(a,f);f.close();"
rm -f /etc/docker/daemon.json.merge
sed -i 's|^ExecStart=/usr/bin/dockerd.*|ExecStart=/usr/bin/dockerd|' /lib/systemd/system/docker.service
systemctl daemon-reload
systemctl start docker.service

View File

@@ -0,0 +1,77 @@
resource "random_string" "storage" {
keepers = {
# Generate a new id each time we switch to a new resource group
group_name = "${var.resource_group_name}"
}
length = 8
upper = false
special = false
number = false
}
resource "azurerm_storage_account" "batchstorage" {
name = "${lower(random_string.storage.result)}"
resource_group_name = "${var.resource_group_name}"
location = "${var.resource_group_location}"
account_tier = "Standard"
account_replication_type = "LRS"
}
resource "azurerm_storage_container" "boostrapscript" {
name = "scripts"
resource_group_name = "${var.resource_group_name}"
storage_account_name = "${azurerm_storage_account.batchstorage.name}"
container_access_type = "private"
}
resource "azurerm_storage_blob" "initscript" {
name = "init.sh"
resource_group_name = "${var.resource_group_name}"
storage_account_name = "${azurerm_storage_account.batchstorage.name}"
storage_container_name = "${azurerm_storage_container.boostrapscript.name}"
type = "block"
source = "${var.pool_bootstrap_script_path}"
}
data "azurerm_storage_account_sas" "scriptaccess" {
connection_string = "${azurerm_storage_account.batchstorage.primary_connection_string}"
https_only = true
resource_types {
service = false
container = false
object = true
}
services {
blob = true
queue = false
table = false
file = false
}
start = "${timestamp()}"
expiry = "${timeadd(timestamp(), "8776h")}"
permissions {
read = true
write = false
delete = false
list = false
add = false
create = false
update = false
process = false
}
}
output "pool_boostrap_script_url" {
value = "${azurerm_storage_blob.initscript.url}${data.azurerm_storage_account_sas.scriptaccess.sas}"
}
output "id" {
value = "${azurerm_storage_account.batchstorage.id}"
}

View File

@@ -0,0 +1,14 @@
variable "resource_group_name" {
description = "Resource group name"
type = "string"
}
variable "resource_group_location" {
description = "Resource group location"
type = "string"
}
variable "pool_bootstrap_script_path" {
description = "The filepath of the pool boostrapping script"
type = "string"
}

View File

@@ -0,0 +1,23 @@
variable "client_id" {
type = "string"
description = "Client ID"
}
variable "client_secret" {
type = "string"
description = "Client secret."
}
variable "resource_group_name" {
description = "Resource group name"
type = "string"
}
variable "resource_group_location" {
description = "Resource group location"
type = "string"
}
variable "virtualkubelet_docker_image" {
type = "string"
}

View File

@@ -0,0 +1,14 @@
// Provide the Client ID of a service principal for use by AKS
client_id = "00000000-0000-0000-0000-000000000000"
// Provide the Client Secret of a service principal for use by AKS
client_secret = "00000000-0000-0000-0000-000000000000"
// The resource group you would like to deploy too
resource_group_name = "vkgpu"
// The location of all resources
resource_group_location = "westeurope"
// Virtual Kubelet docker image
virtualkubelet_docker_image = "microsoft/virtual-kubelet"

View File

@@ -0,0 +1,126 @@
provider "kubernetes" {
host = "${var.cluster_host}"
client_certificate = "${var.cluster_client_certificate}"
client_key = "${var.cluster_client_key}"
cluster_ca_certificate = "${var.cluster_ca}"
}
resource "kubernetes_secret" "vkcredentials" {
metadata {
name = "vkcredentials"
}
data {
cert.pem = "${var.cluster_client_certificate}"
key.pem = "${var.cluster_client_key}"
}
}
resource "kubernetes_deployment" "vkdeployment" {
metadata {
name = "vkdeployment"
}
spec {
selector {
app = "virtualkubelet"
}
template {
metadata {
labels {
app = "virtualkubelet"
}
}
spec {
container {
name = "vk"
image = "${var.virtualkubelet_docker_image}"
args = [
"--provider",
"azurebatch",
"--taint",
"azure.com/batch",
"--namespace",
"default",
]
port {
container_port = 10250
protocol = "TCP"
name = "kubeletport"
}
volume_mount {
name = "azure-credentials"
mount_path = "/etc/aks/azure.json"
}
volume_mount {
name = "credentials"
mount_path = "/etc/virtual-kubelet"
}
env = [
{
name = "AZURE_BATCH_ACCOUNT_LOCATION"
value = "${var.resource_group_location}"
},
{
name = "AZURE_BATCH_ACCOUNT_NAME"
value = "${var.azure_batch_account_name}"
},
{
name = "AZURE_BATCH_POOLID"
value = "${var.azure_batch_pool_id}"
},
{
name = "KUBELET_PORT"
value = "10250"
},
{
name = "AZURE_CREDENTIALS_LOCATION"
value = "/etc/aks/azure.json"
},
{
name = "APISERVER_CERT_LOCATION"
value = "/etc/virtual-kubelet/cert.pem"
},
{
name = "APISERVER_KEY_LOCATION"
value = "/etc/virtual-kubelet/key.pem"
},
{
name = "VKUBELET_POD_IP"
value_from {
field_ref {
field_path = "status.podIP"
}
}
},
]
}
volume {
name = "azure-credentials"
host_path {
path = "/etc/kubernetes/azure.json"
}
}
volume {
name = "credentials"
secret {
secret_name = "vkcredentials"
}
}
}
}
}
}

View File

@@ -0,0 +1,41 @@
variable "cluster_client_certificate" {
type = "string"
description = "Cluster client Certificate"
default = "eastus"
}
variable "cluster_client_key" {
type = "string"
description = "Cluster client Certificate Key"
}
variable "cluster_ca" {
type = "string"
description = "Cluster Certificate Authority"
}
variable "cluster_host" {
type = "string"
description = "Cluster Admin API host"
}
variable "virtualkubelet_docker_image" {
type = "string"
description = "The docker image to use for deploying the virtual kubelet"
}
variable "azure_batch_account_name" {
type = "string"
description = "The name of the Azure Batch account to use"
}
variable "azure_batch_pool_id" {
type = "string"
description = "The PoolID to use in Azure batch"
default = "pool1"
}
variable "resource_group_location" {
description = "Resource group location"
type = "string"
}

View File

@@ -0,0 +1,6 @@
{"log":"[Vector addition of 50000 elements]\n","stream":"stdout","time":"2018-05-30T17:02:49.967357287Z"}
{"log":"Copy input data from the host memory to the CUDA device\n","stream":"stdout","time":"2018-05-30T17:02:49.967417086Z"}
{"log":"CUDA kernel launch with 196 blocks of 256 threads\n","stream":"stdout","time":"2018-05-30T17:02:49.967423286Z"}
{"log":"Copy output data from the CUDA device to the host memory\n","stream":"stdout","time":"2018-05-30T17:02:49.967427386Z"}
{"log":"Test PASSED\n","stream":"stdout","time":"2018-05-30T17:02:49.967431286Z"}
{"log":"Done\n","stream":"stdout","time":"2018-05-30T17:02:49.967435286Z"}

View File

@@ -616,7 +616,7 @@ func readLogFile(filename string, tail int) (string, error) {
lines = append(lines, scanner.Text())
}
if tail > 0 && tail < len(lines) {
lines = lines[len(lines)-tail : len(lines)]
lines = lines[len(lines)-tail:]
}
return strings.Join(lines, ""), nil
}
@@ -636,7 +636,7 @@ func (p *CRIProvider) GetContainerLogs(namespace, podName, containerName string,
}
container := pod.containers[containerName]
if container == nil {
return "", fmt.Errorf("Cannot find container %s in pod %s namespace %s", containerName, pod, namespace)
return "", fmt.Errorf("Cannot find container %s in pod %s namespace %s", containerName, podName, namespace)
}
return readLogFile(container.LogPath, tail)