Create a provider to use Azure Batch (#133)
* Started work on provider * WIP Adding batch provider * Working basic call into pool client. Need to parameterize the baseurl * Fixed job creation by manipulating the content-type * WIP Kicking off containers. Dirty * [wip] More meat around scheduling simple containers. * Working on basic task wrapper to co-schedule pods * WIP on task wrapper * WIP * Working pod minimal wrapper for batch * Integrate pod template code into provider * Cleaning up * Move to docker without gpu * WIP batch integration * partially working * Working logs * Tidy code * WIP: Testing and readme * Added readme and terraform deployment for GPU Azure Batch pool. * Update to enable low priority nodes for gpu * Fix log formatting bug. Return node logs when container not yet started * Moved to golang v1.10 * Fix cri test * Fix up minor docs Issue. Add provider to readme. Add var for vk image.
This commit is contained in:
committed by
Robbie Zhang
parent
1ad6fb434e
commit
d6e8b3daf7
79
providers/azurebatch/README.md
Normal file
79
providers/azurebatch/README.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Kubernetes Virtual Kubelet with Azure Batch
|
||||
|
||||
[Azure Batch](https://docs.microsoft.com/en-us/azure/batch/) provides a HPC Computing environment in Azure for distributed tasks. Azure Batch handles scheduling decrete jobs and tasks accross pools of VM's. It is commonly used for batch processing tasks such as rendering.
|
||||
|
||||
The Virtual kubelet integration allows you to take advantage of this from within Kubernetes. The primary usecase for the provider is to make it easy to use GPU based workload from normal Kubernetes clusters. For example, creating Kubernetes Jobs which train or execute ML models using Nvidia GPU's or using FFMPEG.
|
||||
|
||||
Azure Batch allows for [low priority nodes](https://docs.microsoft.com/en-us/azure/batch/batch-low-pri-vms) which can also help to reduce cost for non-time sensitive workloads.
|
||||
|
||||
__The [ACI provider](../azure/README.MD) is the best option unless you're looking to utilise some specific features of Azure Batch__.
|
||||
|
||||
## Status: Experimental
|
||||
|
||||
This provider is currently in the exterimental stages. Contributions welcome!
|
||||
|
||||
## Quick Start
|
||||
|
||||
The following Terraform template deploys an AKS cluster with the Virtual Kubelet, Azure Batch Account and GPU enabled Azure Batch pool. The Batch pool contains 1 Dedicated NC6 Node and 2 Low Priority NC6 Nodes.
|
||||
|
||||
1. Setup Terraform for Azure following [this guide here](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/terraform-install-configure)
|
||||
2. From the commandline move to the deployment folder `cd ./providers/azurebatch/deployment` then edit `vars.example.tfvars` adding in your Service Principal details
|
||||
3. Download the latest version of the Community Kubernetes Provider for Terraform. Get the correct link [from here](https://github.com/sl1pm4t/terraform-provider-kubernetes/releases) and use it as follows: (Current official Terraform K8s provider doesn't support `Deployments`)
|
||||
|
||||
```shell
|
||||
curl -L -o - PUT_RELASE_BINARY_LINK_YOU_FOUND_HERE | gunzip > terraform-provider-kubernetes
|
||||
chmod +x ./terraform-provider-kubernetes
|
||||
```
|
||||
|
||||
4. Use `terraform init` to initialize the template
|
||||
5. Use `terraform plan -var-file=./vars.example.tfvars` and `terraform apply -var-file=./vars.example.tfvars` to deploy the template
|
||||
6. Run `kubectl describe deployment/vkdeployment` to check the virtual kubelet is running correctly.
|
||||
7. Run `kubectl create -f examplegpupod.yaml`
|
||||
8. Run `pods=$(kubectl get pods --selector=app=examplegpupod --show-all --output=jsonpath={.items..metadata.name})` then `kubectl logs $pods` to view the logs. Should see:
|
||||
|
||||
```text
|
||||
[Vector addition of 50000 elements]
|
||||
Copy input data from the host memory to the CUDA device
|
||||
CUDA kernel launch with 196 blocks of 256 threads
|
||||
Copy output data from the CUDA device to the host memory
|
||||
Test PASSED
|
||||
Done
|
||||
```
|
||||
|
||||
### Tweaking the Quickstart
|
||||
|
||||
You can update [main.tf](./main.tf) to increase the number of nodes allocated to the Azure Batch pool or update [./aks/main.tf](./aks/main.tf) to increase the number of agent nodes allocated to your AKS cluster.
|
||||
|
||||
## Advanced Setup
|
||||
|
||||
## Prerequistes
|
||||
|
||||
1. An Azure Batch Account configurated
|
||||
2. An Azure Batch Pool created with necessary VM spec. VM's in the pool must have:
|
||||
- `docker` installed and correctly configured
|
||||
- `nvidia-docker` and `cuda` drivers installed
|
||||
3. K8s cluster
|
||||
4. Azure Service Principal with access to the Azure Batch Account
|
||||
|
||||
## Setup
|
||||
|
||||
The provider expects the following environment variables to be configured:
|
||||
|
||||
```
|
||||
ClientID: AZURE_CLIENT_ID
|
||||
ClientSecret: AZURE_CLIENT_SECRET
|
||||
ResourceGroup: AZURE_RESOURCE_GROUP
|
||||
SubscriptionID: AZURE_SUBSCRIPTION_ID
|
||||
TenantID: AZURE_TENANT_ID
|
||||
PoolID: AZURE_BATCH_POOLID
|
||||
JobID (optional):AZURE_BATCH_JOBID
|
||||
AccountLocation: AZURE_BATCH_ACCOUNT_LOCATION
|
||||
AccountName: AZURE_BATCH_ACCOUNT_NAME
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
The provider will assign pods to machines in the Azure Batch Pool. Each machine can, by default, process only one pod at a time
|
||||
running more than 1 pod per machine isn't currently supported and will result in errors.
|
||||
|
||||
Azure Batch queues tasks when no machines are available so pods will sit in `podPending` state while waiting for a VM to become available.
|
||||
409
providers/azurebatch/batch.go
Normal file
409
providers/azurebatch/batch.go
Normal file
@@ -0,0 +1,409 @@
|
||||
package azurebatch
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"github.com/Azure/go-autorest/autorest"
|
||||
"os"
|
||||
"strings"
|
||||
|
||||
"io/ioutil"
|
||||
"log"
|
||||
"net/http"
|
||||
|
||||
"github.com/Azure/go-autorest/autorest/azure"
|
||||
|
||||
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
|
||||
"github.com/Azure/go-autorest/autorest/to"
|
||||
"github.com/lawrencegripper/pod2docker"
|
||||
"github.com/virtual-kubelet/virtual-kubelet/manager"
|
||||
azureCreds "github.com/virtual-kubelet/virtual-kubelet/providers/azure"
|
||||
"k8s.io/api/core/v1"
|
||||
"k8s.io/apimachinery/pkg/api/resource"
|
||||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
|
||||
)
|
||||
|
||||
const (
|
||||
podJSONKey string = "virtualkubelet_pod"
|
||||
)
|
||||
|
||||
// Provider the base struct for the Azure Batch provider
|
||||
type Provider struct {
|
||||
batchConfig *Config
|
||||
ctx context.Context
|
||||
cancelCtx context.CancelFunc
|
||||
fileClient *batch.FileClient
|
||||
resourceManager *manager.ResourceManager
|
||||
listTasks func() (*[]batch.CloudTask, error)
|
||||
addTask func(batch.TaskAddParameter) (autorest.Response, error)
|
||||
getTask func(taskID string) (batch.CloudTask, error)
|
||||
deleteTask func(taskID string) (autorest.Response, error)
|
||||
getFileFromTask func(taskID, path string) (result batch.ReadCloser, err error)
|
||||
nodeName string
|
||||
operatingSystem string
|
||||
cpu string
|
||||
memory string
|
||||
pods string
|
||||
internalIP string
|
||||
daemonEndpointPort int32
|
||||
}
|
||||
|
||||
// Config - Basic azure config used to interact with ARM resources.
|
||||
type Config struct {
|
||||
ClientID string
|
||||
ClientSecret string
|
||||
SubscriptionID string
|
||||
TenantID string
|
||||
ResourceGroup string
|
||||
PoolID string
|
||||
JobID string
|
||||
AccountName string
|
||||
AccountLocation string
|
||||
}
|
||||
|
||||
// NewBatchProvider Creates a batch provider
|
||||
func NewBatchProvider(configString string, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error) {
|
||||
fmt.Println("Starting create provider")
|
||||
|
||||
config := &Config{}
|
||||
if azureCredsFilepath := os.Getenv("AZURE_CREDENTIALS_LOCATION"); azureCredsFilepath != "" {
|
||||
creds, err := azureCreds.NewAcsCredential(azureCredsFilepath)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
config.ClientID = creds.ClientID
|
||||
config.ClientSecret = creds.ClientSecret
|
||||
config.SubscriptionID = creds.SubscriptionID
|
||||
config.TenantID = creds.TenantID
|
||||
}
|
||||
|
||||
err := getAzureConfigFromEnv(config)
|
||||
if err != nil {
|
||||
log.Println("Failed to get auth information")
|
||||
}
|
||||
|
||||
return NewBatchProviderFromConfig(config, rm, nodeName, operatingSystem, internalIP, daemonEndpointPort)
|
||||
}
|
||||
|
||||
// NewBatchProviderFromConfig Creates a batch provider
|
||||
func NewBatchProviderFromConfig(config *Config, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error) {
|
||||
p := Provider{}
|
||||
p.batchConfig = config
|
||||
// Set sane defaults for Capacity in case config is not supplied
|
||||
p.cpu = "20"
|
||||
p.memory = "100Gi"
|
||||
p.pods = "20"
|
||||
p.resourceManager = rm
|
||||
p.operatingSystem = operatingSystem
|
||||
p.nodeName = nodeName
|
||||
p.internalIP = internalIP
|
||||
p.daemonEndpointPort = daemonEndpointPort
|
||||
p.ctx, p.cancelCtx = context.WithCancel(context.Background())
|
||||
|
||||
auth := getAzureADAuthorizer(config, azure.PublicCloud.BatchManagementEndpoint)
|
||||
|
||||
batchBaseURL := getBatchBaseURL(config.AccountName, config.AccountLocation)
|
||||
_, err := getPool(p.ctx, batchBaseURL, config.PoolID, auth)
|
||||
if err != nil {
|
||||
log.Panicf("Error retreiving Azure Batch pool: %v", err)
|
||||
}
|
||||
_, err = createOrGetJob(p.ctx, batchBaseURL, config.JobID, config.PoolID, auth)
|
||||
if err != nil {
|
||||
log.Panicf("Error retreiving/creating Azure Batch job: %v", err)
|
||||
}
|
||||
taskClient := batch.NewTaskClientWithBaseURI(batchBaseURL)
|
||||
taskClient.Authorizer = auth
|
||||
p.listTasks = func() (*[]batch.CloudTask, error) {
|
||||
res, err := taskClient.List(p.ctx, config.JobID, "", "", "", nil, nil, nil, nil, nil)
|
||||
if err != nil {
|
||||
return &[]batch.CloudTask{}, err
|
||||
}
|
||||
currentTasks := res.Values()
|
||||
for res.NotDone() {
|
||||
err = res.Next()
|
||||
if err != nil {
|
||||
return &[]batch.CloudTask{}, err
|
||||
}
|
||||
pageTasks := res.Values()
|
||||
if pageTasks != nil || len(pageTasks) != 0 {
|
||||
currentTasks = append(currentTasks, pageTasks...)
|
||||
}
|
||||
}
|
||||
|
||||
return ¤tTasks, nil
|
||||
}
|
||||
p.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
|
||||
return taskClient.Add(p.ctx, config.JobID, task, nil, nil, nil, nil)
|
||||
}
|
||||
p.getTask = func(taskID string) (batch.CloudTask, error) {
|
||||
return taskClient.Get(p.ctx, config.JobID, taskID, "", "", nil, nil, nil, nil, "", "", nil, nil)
|
||||
}
|
||||
p.deleteTask = func(taskID string) (autorest.Response, error) {
|
||||
return taskClient.Delete(p.ctx, config.JobID, taskID, nil, nil, nil, nil, "", "", nil, nil)
|
||||
}
|
||||
p.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
|
||||
return p.fileClient.GetFromTask(p.ctx, config.JobID, taskID, path, nil, nil, nil, nil, "", nil, nil)
|
||||
}
|
||||
|
||||
fileClient := batch.NewFileClientWithBaseURI(batchBaseURL)
|
||||
fileClient.Authorizer = auth
|
||||
p.fileClient = &fileClient
|
||||
|
||||
return &p, nil
|
||||
}
|
||||
|
||||
// CreatePod accepts a Pod definition
|
||||
func (p *Provider) CreatePod(pod *v1.Pod) error {
|
||||
log.Println("Creating pod...")
|
||||
podCommand, err := pod2docker.GetBashCommand(pod2docker.PodComponents{
|
||||
InitContainers: pod.Spec.InitContainers,
|
||||
Containers: pod.Spec.Containers,
|
||||
PodName: pod.Name,
|
||||
Volumes: pod.Spec.Volumes,
|
||||
})
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
bytes, err := json.Marshal(pod)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
task := batch.TaskAddParameter{
|
||||
DisplayName: to.StringPtr(string(pod.UID)),
|
||||
ID: to.StringPtr(getTaskIDForPod(pod.Namespace, pod.Name)),
|
||||
CommandLine: to.StringPtr(fmt.Sprintf(`/bin/bash -c "%s"`, podCommand)),
|
||||
UserIdentity: &batch.UserIdentity{
|
||||
AutoUser: &batch.AutoUserSpecification{
|
||||
ElevationLevel: batch.Admin,
|
||||
Scope: batch.Pool,
|
||||
},
|
||||
},
|
||||
EnvironmentSettings: &[]batch.EnvironmentSetting{
|
||||
{
|
||||
Name: to.StringPtr(podJSONKey),
|
||||
Value: to.StringPtr(string(bytes)),
|
||||
},
|
||||
},
|
||||
}
|
||||
_, err = p.addTask(task)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetPodStatus retrieves the status of a given pod by name.
|
||||
func (p *Provider) GetPodStatus(namespace, name string) (*v1.PodStatus, error) {
|
||||
log.Println("Getting pod status ....")
|
||||
pod, err := p.GetPod(namespace, name)
|
||||
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if pod == nil {
|
||||
return nil, nil
|
||||
}
|
||||
return &pod.Status, nil
|
||||
}
|
||||
|
||||
// UpdatePod accepts a Pod definition
|
||||
func (p *Provider) UpdatePod(pod *v1.Pod) error {
|
||||
log.Println("Pod Update called: No-op as not implemented")
|
||||
return nil
|
||||
}
|
||||
|
||||
// DeletePod accepts a Pod definition
|
||||
func (p *Provider) DeletePod(pod *v1.Pod) error {
|
||||
taskID := getTaskIDForPod(pod.Namespace, pod.Name)
|
||||
task, err := p.deleteTask(taskID)
|
||||
if err != nil {
|
||||
log.Println(task)
|
||||
log.Println(err)
|
||||
return err
|
||||
}
|
||||
|
||||
log.Printf(fmt.Sprintf("Deleting task: %v", taskID))
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetPod returns a pod by name
|
||||
func (p *Provider) GetPod(namespace, name string) (*v1.Pod, error) {
|
||||
log.Println("Getting Pod ...")
|
||||
task, err := p.getTask(getTaskIDForPod(namespace, name))
|
||||
if err != nil {
|
||||
if task.Response.StatusCode == http.StatusNotFound {
|
||||
return nil, nil
|
||||
}
|
||||
log.Println(err)
|
||||
return nil, err
|
||||
}
|
||||
|
||||
pod, err := getPodFromTask(&task)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
status, _ := convertTaskToPodStatus(&task)
|
||||
pod.Status = *status
|
||||
|
||||
return pod, nil
|
||||
}
|
||||
|
||||
// GetContainerLogs returns the logs of a container running in a pod by name.
|
||||
func (p *Provider) GetContainerLogs(namespace, podName, containerName string, tail int) (string, error) {
|
||||
log.Println("Getting pod logs ....")
|
||||
|
||||
taskID := getTaskIDForPod(namespace, podName)
|
||||
logFileLocation := fmt.Sprintf("wd/%s.log", containerName)
|
||||
containerLogReader, err := p.getFileFromTask(taskID, logFileLocation)
|
||||
|
||||
if containerLogReader.Response.Response != nil && containerLogReader.StatusCode == http.StatusNotFound {
|
||||
stdoutReader, err := p.getFileFromTask(taskID, "stdout.txt")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
stderrReader, err := p.getFileFromTask(taskID, "stderr.txt")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
var builder strings.Builder
|
||||
builderPtr := &builder
|
||||
mustWriteString(builderPtr, "Container still starting....\n")
|
||||
mustWriteString(builderPtr, "Showing startup logs from Azure Batch node instead:\n")
|
||||
mustWriteString(builderPtr, "----- STDOUT -----\n")
|
||||
stdoutBytes, _ := ioutil.ReadAll(*stdoutReader.Value)
|
||||
mustWrite(builderPtr, stdoutBytes)
|
||||
mustWriteString(builderPtr, "\n")
|
||||
|
||||
mustWriteString(builderPtr, "----- STDERR -----\n")
|
||||
stderrBytes, _ := ioutil.ReadAll(*stderrReader.Value)
|
||||
mustWrite(builderPtr, stderrBytes)
|
||||
mustWriteString(builderPtr, "\n")
|
||||
|
||||
return builder.String(), nil
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
result, err := formatLogJSON(containerLogReader)
|
||||
if err != nil {
|
||||
return "", fmt.Errorf("Container log formating failed err: %v", err)
|
||||
}
|
||||
|
||||
return result, nil
|
||||
}
|
||||
|
||||
// GetPods retrieves a list of all pods scheduled to run.
|
||||
func (p *Provider) GetPods() ([]*v1.Pod, error) {
|
||||
log.Println("Getting pods...")
|
||||
tasksPtr, err := p.listTasks()
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
if tasksPtr == nil {
|
||||
return []*v1.Pod{}, nil
|
||||
}
|
||||
|
||||
tasks := *tasksPtr
|
||||
|
||||
pods := make([]*v1.Pod, len(tasks), len(tasks))
|
||||
for i, t := range tasks {
|
||||
pod, err := getPodFromTask(&t)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
pods[i] = pod
|
||||
}
|
||||
return pods, nil
|
||||
}
|
||||
|
||||
// Capacity returns a resource list containing the capacity limits
|
||||
func (p *Provider) Capacity() v1.ResourceList {
|
||||
return v1.ResourceList{
|
||||
"cpu": resource.MustParse(p.cpu),
|
||||
"memory": resource.MustParse(p.memory),
|
||||
"pods": resource.MustParse(p.pods),
|
||||
"nvidia.com/gpu": resource.MustParse("1"),
|
||||
}
|
||||
}
|
||||
|
||||
// NodeConditions returns a list of conditions (Ready, OutOfDisk, etc), for updates to the node status
|
||||
// within Kubernetes.
|
||||
func (p *Provider) NodeConditions() []v1.NodeCondition {
|
||||
return []v1.NodeCondition{
|
||||
{
|
||||
Type: "Ready",
|
||||
Status: v1.ConditionTrue,
|
||||
LastHeartbeatTime: metav1.Now(),
|
||||
LastTransitionTime: metav1.Now(),
|
||||
Reason: "KubeletReady",
|
||||
Message: "kubelet is ready.",
|
||||
},
|
||||
{
|
||||
Type: "OutOfDisk",
|
||||
Status: v1.ConditionFalse,
|
||||
LastHeartbeatTime: metav1.Now(),
|
||||
LastTransitionTime: metav1.Now(),
|
||||
Reason: "KubeletHasSufficientDisk",
|
||||
Message: "kubelet has sufficient disk space available",
|
||||
},
|
||||
{
|
||||
Type: "MemoryPressure",
|
||||
Status: v1.ConditionFalse,
|
||||
LastHeartbeatTime: metav1.Now(),
|
||||
LastTransitionTime: metav1.Now(),
|
||||
Reason: "KubeletHasSufficientMemory",
|
||||
Message: "kubelet has sufficient memory available",
|
||||
},
|
||||
{
|
||||
Type: "DiskPressure",
|
||||
Status: v1.ConditionFalse,
|
||||
LastHeartbeatTime: metav1.Now(),
|
||||
LastTransitionTime: metav1.Now(),
|
||||
Reason: "KubeletHasNoDiskPressure",
|
||||
Message: "kubelet has no disk pressure",
|
||||
},
|
||||
{
|
||||
Type: "NetworkUnavailable",
|
||||
Status: v1.ConditionFalse,
|
||||
LastHeartbeatTime: metav1.Now(),
|
||||
LastTransitionTime: metav1.Now(),
|
||||
Reason: "RouteCreated",
|
||||
Message: "RouteController created a route",
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
// NodeAddresses returns a list of addresses for the node status
|
||||
// within Kubernetes.
|
||||
func (p *Provider) NodeAddresses() []v1.NodeAddress {
|
||||
// TODO: Make these dynamic and augment with custom ACI specific conditions of interest
|
||||
return []v1.NodeAddress{
|
||||
{
|
||||
Type: "InternalIP",
|
||||
Address: p.internalIP,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
// NodeDaemonEndpoints returns NodeDaemonEndpoints for the node status
|
||||
// within Kubernetes.
|
||||
func (p *Provider) NodeDaemonEndpoints() *v1.NodeDaemonEndpoints {
|
||||
return &v1.NodeDaemonEndpoints{
|
||||
KubeletEndpoint: v1.DaemonEndpoint{
|
||||
Port: p.daemonEndpointPort,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
// OperatingSystem returns the operating system for this provider.
|
||||
func (p *Provider) OperatingSystem() string {
|
||||
return p.operatingSystem
|
||||
}
|
||||
216
providers/azurebatch/batch_helpers.go
Normal file
216
providers/azurebatch/batch_helpers.go
Normal file
@@ -0,0 +1,216 @@
|
||||
package azurebatch
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"context"
|
||||
"crypto/md5"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
|
||||
"github.com/Azure/go-autorest/autorest"
|
||||
"github.com/Azure/go-autorest/autorest/adal"
|
||||
"github.com/Azure/go-autorest/autorest/azure"
|
||||
)
|
||||
|
||||
func mustWriteString(builder *strings.Builder, s string) {
|
||||
_, err := builder.WriteString(s)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
|
||||
func mustWrite(builder *strings.Builder, b []byte) {
|
||||
_, err := builder.Write(b)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
|
||||
// NewServicePrincipalTokenFromCredentials creates a new ServicePrincipalToken using values of the
|
||||
// passed credentials map.
|
||||
func newServicePrincipalTokenFromCredentials(c *Config, scope string) (*adal.ServicePrincipalToken, error) {
|
||||
oauthConfig, err := adal.NewOAuthConfig(azure.PublicCloud.ActiveDirectoryEndpoint, c.TenantID)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
return adal.NewServicePrincipalToken(*oauthConfig, c.ClientID, c.ClientSecret, scope)
|
||||
}
|
||||
|
||||
// GetAzureADAuthorizer return an authorizor for Azure SP
|
||||
func getAzureADAuthorizer(c *Config, azureEndpoint string) autorest.Authorizer {
|
||||
spt, err := newServicePrincipalTokenFromCredentials(c, azureEndpoint)
|
||||
if err != nil {
|
||||
panic(fmt.Sprintf("Failed to create authorizer: %v", err))
|
||||
}
|
||||
auth := autorest.NewBearerAuthorizer(spt)
|
||||
return auth
|
||||
}
|
||||
|
||||
func getPool(ctx context.Context, batchBaseURL, poolID string, auth autorest.Authorizer) (*batch.PoolClient, error) {
|
||||
poolClient := batch.NewPoolClientWithBaseURI(batchBaseURL)
|
||||
poolClient.Authorizer = auth
|
||||
poolClient.RetryAttempts = 0
|
||||
|
||||
pool, err := poolClient.Get(ctx, poolID, "*", "", nil, nil, nil, nil, "", "", nil, nil)
|
||||
|
||||
// If we observe an error which isn't related to the pool not existing panic.
|
||||
// 404 is expected if this is first run.
|
||||
if err != nil && pool.Response.Response == nil {
|
||||
log.Printf("Failed to get pool. nil response %v", poolID)
|
||||
return nil, err
|
||||
} else if err != nil && pool.StatusCode == 404 {
|
||||
log.Printf("Pool doesn't exist 404 received Error: %v PoolID: %v", err, poolID)
|
||||
return nil, err
|
||||
} else if err != nil {
|
||||
log.Printf("Failed to get pool. Response:%v", pool.Response)
|
||||
return nil, err
|
||||
}
|
||||
|
||||
if pool.State == batch.PoolStateActive {
|
||||
log.Println("Pool active and running...")
|
||||
return &poolClient, nil
|
||||
}
|
||||
return nil, fmt.Errorf("Pool not in active state: %v", pool.State)
|
||||
}
|
||||
|
||||
func createOrGetJob(ctx context.Context, batchBaseURL, jobID, poolID string, auth autorest.Authorizer) (*batch.JobClient, error) {
|
||||
jobClient := batch.NewJobClientWithBaseURI(batchBaseURL)
|
||||
jobClient.Authorizer = auth
|
||||
// check if job exists already
|
||||
currentJob, err := jobClient.Get(ctx, jobID, "", "", nil, nil, nil, nil, "", "", nil, nil)
|
||||
|
||||
if err == nil && currentJob.State == batch.JobStateActive {
|
||||
log.Println("Wrapper job already exists...")
|
||||
return &jobClient, nil
|
||||
} else if currentJob.Response.StatusCode == 404 {
|
||||
|
||||
log.Println("Wrapper job missing... creating...")
|
||||
wrapperJob := batch.JobAddParameter{
|
||||
ID: &jobID,
|
||||
PoolInfo: &batch.PoolInformation{
|
||||
PoolID: &poolID,
|
||||
},
|
||||
}
|
||||
|
||||
_, err := jobClient.Add(ctx, wrapperJob, nil, nil, nil, nil)
|
||||
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &jobClient, nil
|
||||
|
||||
} else if currentJob.State == batch.JobStateDeleting {
|
||||
log.Printf("Job is being deleted... Waiting then will retry")
|
||||
time.Sleep(time.Minute)
|
||||
return createOrGetJob(ctx, batchBaseURL, jobID, poolID, auth)
|
||||
}
|
||||
|
||||
return nil, err
|
||||
}
|
||||
|
||||
func getBatchBaseURL(batchAccountName, batchAccountLocation string) string {
|
||||
return fmt.Sprintf("https://%s.%s.batch.azure.com", batchAccountName, batchAccountLocation)
|
||||
}
|
||||
|
||||
func envHasValue(env string) bool {
|
||||
val := os.Getenv(env)
|
||||
if val == "" {
|
||||
return false
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
// GetConfigFromEnv - Retreives the azure configuration from environment variables.
|
||||
func getAzureConfigFromEnv(config *Config) error {
|
||||
if envHasValue("AZURE_CLIENT_ID") {
|
||||
config.ClientID = os.Getenv("AZURE_CLIENT_ID")
|
||||
}
|
||||
if envHasValue("AZURE_CLIENT_SECRET") {
|
||||
config.ClientSecret = os.Getenv("AZURE_CLIENT_SECRET")
|
||||
}
|
||||
if envHasValue("AZURE_RESOURCE_GROUP") {
|
||||
config.ResourceGroup = os.Getenv("AZURE_RESOURCE_GROUP")
|
||||
}
|
||||
if envHasValue("AZURE_SUBSCRIPTION_ID") {
|
||||
config.SubscriptionID = os.Getenv("AZURE_SUBSCRIPTION_ID")
|
||||
}
|
||||
if envHasValue("AZURE_TENANT_ID") {
|
||||
config.TenantID = os.Getenv("AZURE_TENANT_ID")
|
||||
}
|
||||
if envHasValue("AZURE_BATCH_POOLID") {
|
||||
config.PoolID = os.Getenv("AZURE_BATCH_POOLID")
|
||||
}
|
||||
if envHasValue("AZURE_BATCH_JOBID") {
|
||||
config.JobID = os.Getenv("AZURE_BATCH_JOBID")
|
||||
}
|
||||
if envHasValue("AZURE_BATCH_ACCOUNT_LOCATION") {
|
||||
config.AccountLocation = os.Getenv("AZURE_BATCH_ACCOUNT_LOCATION")
|
||||
}
|
||||
if envHasValue("AZURE_BATCH_ACCOUNT_NAME") {
|
||||
config.AccountName = os.Getenv("AZURE_BATCH_ACCOUNT_NAME")
|
||||
}
|
||||
|
||||
if config.JobID == "" {
|
||||
hostname, err := os.Hostname()
|
||||
if err != nil {
|
||||
log.Panic(err)
|
||||
}
|
||||
config.JobID = hostname
|
||||
}
|
||||
|
||||
if config.ClientID == "" ||
|
||||
config.ClientSecret == "" ||
|
||||
config.ResourceGroup == "" ||
|
||||
config.SubscriptionID == "" ||
|
||||
config.PoolID == "" ||
|
||||
config.TenantID == "" {
|
||||
return &ConfigError{CurrentConfig: config, ErrorDetails: "Missing configuration"}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func getTaskIDForPod(namespace, name string) string {
|
||||
ID := []byte(fmt.Sprintf("%s-%s", namespace, name))
|
||||
return string(fmt.Sprintf("%x", md5.Sum(ID)))
|
||||
}
|
||||
|
||||
type jsonLog struct {
|
||||
Log string `json:"log"`
|
||||
}
|
||||
|
||||
func formatLogJSON(readCloser batch.ReadCloser) (string, error) {
|
||||
//Read line by line as file isn't valid json. Each line is a single valid json object.
|
||||
scanner := bufio.NewScanner(*readCloser.Value)
|
||||
var b strings.Builder
|
||||
for scanner.Scan() {
|
||||
result := jsonLog{}
|
||||
err := json.Unmarshal(scanner.Bytes(), &result)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
mustWriteString(&b, result.Log)
|
||||
}
|
||||
|
||||
return b.String(), nil
|
||||
}
|
||||
|
||||
// ConfigError - Error when reading configuration values.
|
||||
type ConfigError struct {
|
||||
CurrentConfig *Config
|
||||
ErrorDetails string
|
||||
}
|
||||
|
||||
func (e *ConfigError) Error() string {
|
||||
configJSON, err := json.Marshal(e.CurrentConfig)
|
||||
if err != nil {
|
||||
return e.ErrorDetails
|
||||
}
|
||||
|
||||
return e.ErrorDetails + ": " + string(configJSON)
|
||||
}
|
||||
141
providers/azurebatch/batch_status.go
Normal file
141
providers/azurebatch/batch_status.go
Normal file
@@ -0,0 +1,141 @@
|
||||
package azurebatch
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
|
||||
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
|
||||
apiv1 "k8s.io/api/core/v1"
|
||||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
|
||||
)
|
||||
|
||||
func getPodFromTask(task *batch.CloudTask) (pod *apiv1.Pod, err error) {
|
||||
if task == nil || task.EnvironmentSettings == nil {
|
||||
return nil, fmt.Errorf("invalid input task: %v", task)
|
||||
}
|
||||
|
||||
ok := false
|
||||
jsonData := ""
|
||||
settings := *task.EnvironmentSettings
|
||||
for _, s := range settings {
|
||||
if *s.Name == podJSONKey && s.Value != nil {
|
||||
ok = true
|
||||
jsonData = *s.Value
|
||||
}
|
||||
}
|
||||
if !ok {
|
||||
return nil, fmt.Errorf("task doesn't have pod json stored in it: %v", task.EnvironmentSettings)
|
||||
}
|
||||
|
||||
pod = &apiv1.Pod{}
|
||||
err = json.Unmarshal([]byte(jsonData), pod)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
func convertTaskToPodStatus(task *batch.CloudTask) (status *apiv1.PodStatus, err error) {
|
||||
|
||||
pod, err := getPodFromTask(task)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
// Todo: Review indivudal container status response
|
||||
status = &apiv1.PodStatus{
|
||||
Phase: convertTaskStatusToPodPhase(task),
|
||||
Conditions: []apiv1.PodCondition{},
|
||||
Message: "",
|
||||
Reason: "",
|
||||
HostIP: "",
|
||||
PodIP: "127.0.0.1",
|
||||
StartTime: &pod.CreationTimestamp,
|
||||
}
|
||||
|
||||
for _, container := range pod.Spec.Containers {
|
||||
containerStatus := apiv1.ContainerStatus{
|
||||
Name: container.Name,
|
||||
State: convertTaskStatusToContainerState(task),
|
||||
Ready: true,
|
||||
RestartCount: 0,
|
||||
Image: container.Image,
|
||||
ImageID: "",
|
||||
ContainerID: "",
|
||||
}
|
||||
status.ContainerStatuses = append(status.ContainerStatuses, containerStatus)
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func convertTaskStatusToPodPhase(t *batch.CloudTask) (podPhase apiv1.PodPhase) {
|
||||
switch t.State {
|
||||
case batch.TaskStatePreparing:
|
||||
podPhase = apiv1.PodPending
|
||||
case batch.TaskStateActive:
|
||||
podPhase = apiv1.PodPending
|
||||
case batch.TaskStateRunning:
|
||||
podPhase = apiv1.PodRunning
|
||||
case batch.TaskStateCompleted:
|
||||
podPhase = apiv1.PodFailed
|
||||
|
||||
if t.ExecutionInfo != nil && t.ExecutionInfo.ExitCode != nil && *t.ExecutionInfo.ExitCode == 0 {
|
||||
podPhase = apiv1.PodSucceeded
|
||||
}
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
func convertTaskStatusToContainerState(t *batch.CloudTask) (containerState apiv1.ContainerState) {
|
||||
|
||||
startTime := metav1.Time{}
|
||||
if t.ExecutionInfo != nil {
|
||||
if t.ExecutionInfo.StartTime != nil {
|
||||
startTime.Time = t.ExecutionInfo.StartTime.Time
|
||||
}
|
||||
}
|
||||
|
||||
switch t.State {
|
||||
case batch.TaskStatePreparing:
|
||||
containerState = apiv1.ContainerState{
|
||||
Waiting: &apiv1.ContainerStateWaiting{
|
||||
Message: "Waiting for machine in AzureBatch",
|
||||
Reason: "Preparing",
|
||||
},
|
||||
}
|
||||
case batch.TaskStateActive:
|
||||
containerState = apiv1.ContainerState{
|
||||
Waiting: &apiv1.ContainerStateWaiting{
|
||||
Message: "Waiting for machine in AzureBatch",
|
||||
Reason: "Queued",
|
||||
},
|
||||
}
|
||||
case batch.TaskStateRunning:
|
||||
containerState = apiv1.ContainerState{
|
||||
Running: &apiv1.ContainerStateRunning{
|
||||
StartedAt: startTime,
|
||||
},
|
||||
}
|
||||
case batch.TaskStateCompleted:
|
||||
termStatus := apiv1.ContainerState{
|
||||
Terminated: &apiv1.ContainerStateTerminated{
|
||||
FinishedAt: metav1.Time{
|
||||
Time: t.StateTransitionTime.Time,
|
||||
},
|
||||
StartedAt: startTime,
|
||||
},
|
||||
}
|
||||
|
||||
if t.ExecutionInfo != nil && t.ExecutionInfo.ExitCode != nil {
|
||||
exitCode := *t.ExecutionInfo.ExitCode
|
||||
termStatus.Terminated.ExitCode = exitCode
|
||||
if exitCode != 0 {
|
||||
termStatus.Terminated.Message = *t.ExecutionInfo.FailureInfo.Message
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
166
providers/azurebatch/batch_status_test.go
Normal file
166
providers/azurebatch/batch_status_test.go
Normal file
@@ -0,0 +1,166 @@
|
||||
package azurebatch
|
||||
|
||||
import (
|
||||
"github.com/Azure/go-autorest/autorest/to"
|
||||
"reflect"
|
||||
"testing"
|
||||
|
||||
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
|
||||
apiv1 "k8s.io/api/core/v1"
|
||||
)
|
||||
|
||||
func Test_getPodFromTask(t *testing.T) {
|
||||
type args struct {
|
||||
task *batch.CloudTask
|
||||
}
|
||||
tests := []struct {
|
||||
name string
|
||||
task batch.CloudTask
|
||||
wantPod *apiv1.Pod
|
||||
wantErr bool
|
||||
}{
|
||||
{
|
||||
name: "SimplePod",
|
||||
task: batch.CloudTask{
|
||||
EnvironmentSettings: &[]batch.EnvironmentSetting{
|
||||
{
|
||||
Name: to.StringPtr(podJSONKey),
|
||||
Value: to.StringPtr(`{"metadata":{"creationTimestamp":null},"spec":{"containers":[{"name":"web","image":"nginx:1.12","ports":[{"name":"http","containerPort":80,"protocol":"TCP"}],"resources":{}}]},"status":{}}`),
|
||||
},
|
||||
},
|
||||
},
|
||||
wantPod: &apiv1.Pod{
|
||||
Spec: apiv1.PodSpec{
|
||||
Containers: []apiv1.Container{
|
||||
{
|
||||
Name: "web",
|
||||
Image: "nginx:1.12",
|
||||
Ports: []apiv1.ContainerPort{
|
||||
{
|
||||
Name: "http",
|
||||
Protocol: apiv1.ProtocolTCP,
|
||||
ContainerPort: 80,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "InvalidJson",
|
||||
task: batch.CloudTask{
|
||||
EnvironmentSettings: &[]batch.EnvironmentSetting{
|
||||
{
|
||||
Name: to.StringPtr(podJSONKey),
|
||||
Value: to.StringPtr("---notjson--"),
|
||||
},
|
||||
},
|
||||
},
|
||||
wantErr: true,
|
||||
},
|
||||
{
|
||||
name: "NilEnvironment",
|
||||
task: batch.CloudTask{
|
||||
EnvironmentSettings: nil,
|
||||
},
|
||||
wantErr: true,
|
||||
},
|
||||
{
|
||||
name: "NilString",
|
||||
task: batch.CloudTask{
|
||||
EnvironmentSettings: &[]batch.EnvironmentSetting{
|
||||
{
|
||||
Name: to.StringPtr(podJSONKey),
|
||||
Value: nil,
|
||||
},
|
||||
},
|
||||
},
|
||||
wantErr: true,
|
||||
},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
gotPod, err := getPodFromTask(&tt.task)
|
||||
if (err != nil) != tt.wantErr {
|
||||
t.Errorf("getPodFromTask() error = %v, wantErr %v", err, tt.wantErr)
|
||||
return
|
||||
}
|
||||
|
||||
if !reflect.DeepEqual(gotPod, tt.wantPod) {
|
||||
|
||||
t.Errorf("getPodFromTask() = %v, want %v", gotPod, tt.wantPod)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func Test_convertTaskStatusToPodPhase(t *testing.T) {
|
||||
type args struct {
|
||||
t *batch.CloudTask
|
||||
}
|
||||
tests := []struct {
|
||||
name string
|
||||
task batch.CloudTask
|
||||
wantPodPhase apiv1.PodPhase
|
||||
}{
|
||||
{
|
||||
name: "PreparingTask",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStatePreparing,
|
||||
},
|
||||
wantPodPhase: apiv1.PodPending,
|
||||
},
|
||||
{
|
||||
//Active tasks are sitting in a queue waiting for a node
|
||||
// so maps best to pending state
|
||||
name: "ActiveTask",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStateActive,
|
||||
},
|
||||
wantPodPhase: apiv1.PodPending,
|
||||
},
|
||||
{
|
||||
name: "RunningTask",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStateRunning,
|
||||
},
|
||||
wantPodPhase: apiv1.PodRunning,
|
||||
},
|
||||
{
|
||||
name: "CompletedTask_ExitCode0",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStateCompleted,
|
||||
ExecutionInfo: &batch.TaskExecutionInformation{
|
||||
ExitCode: to.Int32Ptr(0),
|
||||
},
|
||||
},
|
||||
wantPodPhase: apiv1.PodSucceeded,
|
||||
},
|
||||
{
|
||||
name: "CompletedTask_ExitCode127",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStateCompleted,
|
||||
ExecutionInfo: &batch.TaskExecutionInformation{
|
||||
ExitCode: to.Int32Ptr(127),
|
||||
},
|
||||
},
|
||||
wantPodPhase: apiv1.PodFailed,
|
||||
},
|
||||
{
|
||||
name: "CompletedTask_nilExecInfo",
|
||||
task: batch.CloudTask{
|
||||
State: batch.TaskStateCompleted,
|
||||
ExecutionInfo: nil,
|
||||
},
|
||||
wantPodPhase: apiv1.PodFailed,
|
||||
},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
if gotPodPhase := convertTaskStatusToPodPhase(&tt.task); !reflect.DeepEqual(gotPodPhase, tt.wantPodPhase) {
|
||||
t.Errorf("convertTaskStatusToPodPhase() = %v, want %v", gotPodPhase, tt.wantPodPhase)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
165
providers/azurebatch/batch_test.go
Normal file
165
providers/azurebatch/batch_test.go
Normal file
@@ -0,0 +1,165 @@
|
||||
package azurebatch
|
||||
|
||||
import (
|
||||
"crypto/md5"
|
||||
"fmt"
|
||||
"github.com/Azure/azure-sdk-for-go/services/batch/2017-09-01.6.0/batch"
|
||||
"github.com/Azure/go-autorest/autorest"
|
||||
"io/ioutil"
|
||||
"net/http"
|
||||
"os"
|
||||
"strings"
|
||||
"testing"
|
||||
|
||||
apiv1 "k8s.io/api/core/v1"
|
||||
)
|
||||
|
||||
func Test_deletePod(t *testing.T) {
|
||||
podNamespace := "bob"
|
||||
podName := "marley"
|
||||
concatName := []byte(fmt.Sprintf("%s-%s", podNamespace, podName))
|
||||
expectedDeleteTaskID := fmt.Sprintf("%x", md5.Sum(concatName))
|
||||
|
||||
provider := Provider{}
|
||||
provider.deleteTask = func(taskID string) (autorest.Response, error) {
|
||||
if taskID != expectedDeleteTaskID {
|
||||
t.Errorf("Deleted wrong task! Expected delete: %v Actual: %v", taskID, expectedDeleteTaskID)
|
||||
}
|
||||
return autorest.Response{}, nil
|
||||
}
|
||||
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Name = podName
|
||||
pod.Namespace = podNamespace
|
||||
|
||||
err := provider.DeletePod(pod)
|
||||
if err != nil {
|
||||
t.Error(err)
|
||||
}
|
||||
}
|
||||
|
||||
func Test_deletePod_doesntExist(t *testing.T) {
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Namespace = "bob"
|
||||
pod.Name = "marley"
|
||||
|
||||
provider := Provider{}
|
||||
provider.deleteTask = func(taskID string) (autorest.Response, error) {
|
||||
return autorest.Response{}, fmt.Errorf("Task doesn't exist")
|
||||
}
|
||||
|
||||
err := provider.DeletePod(pod)
|
||||
if err == nil {
|
||||
t.Error("Expected error but none seen")
|
||||
}
|
||||
}
|
||||
|
||||
func Test_createPod(t *testing.T) {
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Namespace = "bob"
|
||||
pod.Name = "marley"
|
||||
|
||||
provider := Provider{}
|
||||
provider.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
|
||||
if task.CommandLine == nil || *task.CommandLine == "" {
|
||||
t.Error("Missing commandline args")
|
||||
}
|
||||
|
||||
derefVars := *task.EnvironmentSettings
|
||||
if len(derefVars) != 1 || *derefVars[0].Name != podJSONKey {
|
||||
t.Error("Missing pod json")
|
||||
}
|
||||
return autorest.Response{}, nil
|
||||
}
|
||||
|
||||
err := provider.CreatePod(pod)
|
||||
if err != nil {
|
||||
t.Errorf("Unexpected error creating pod %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func Test_createPod_errorResponse(t *testing.T) {
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Namespace = "bob"
|
||||
pod.Name = "marley"
|
||||
|
||||
provider := Provider{}
|
||||
provider.addTask = func(task batch.TaskAddParameter) (autorest.Response, error) {
|
||||
return autorest.Response{}, fmt.Errorf("Failed creating task")
|
||||
}
|
||||
|
||||
err := provider.CreatePod(pod)
|
||||
if err == nil {
|
||||
t.Error("Expected error but none seen")
|
||||
}
|
||||
}
|
||||
|
||||
func Test_readLogs_404Response_expectReturnStartupLogs(t *testing.T) {
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Namespace = "bob"
|
||||
pod.Name = "marley"
|
||||
containerName := "sam"
|
||||
|
||||
provider := Provider{}
|
||||
provider.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
|
||||
if path == "wd/sam.log" {
|
||||
// Autorest - Seriously? Can't find a better way to make a 404 :(
|
||||
return batch.ReadCloser{Response: autorest.Response{Response: &http.Response{StatusCode: 404}}}, nil
|
||||
} else if path == "stderr.txt" {
|
||||
response := ioutil.NopCloser(strings.NewReader("stderrResponse"))
|
||||
return batch.ReadCloser{Value: &response}, nil
|
||||
} else if path == "stdout.txt" {
|
||||
response := ioutil.NopCloser(strings.NewReader("stdoutResponse"))
|
||||
return batch.ReadCloser{Value: &response}, nil
|
||||
} else {
|
||||
t.Errorf("Unexpected Filepath: %v", path)
|
||||
}
|
||||
|
||||
return batch.ReadCloser{}, fmt.Errorf("Failed in test mock of getFileFromTask")
|
||||
}
|
||||
|
||||
result, err := provider.GetContainerLogs(pod.Namespace, pod.Name, containerName, 0)
|
||||
if err != nil {
|
||||
t.Errorf("GetContainerLogs return error: %v", err)
|
||||
}
|
||||
|
||||
fmt.Print(result)
|
||||
|
||||
if !strings.Contains(result, "stderrResponse") || !strings.Contains(result, "stdoutResponse") {
|
||||
t.Errorf("Result didn't contain expected content have: %v", result)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func Test_readLogs_JsonResponse_expectFormattedLogs(t *testing.T) {
|
||||
pod := &apiv1.Pod{}
|
||||
pod.Namespace = "bob"
|
||||
pod.Name = "marley"
|
||||
containerName := "sam"
|
||||
|
||||
provider := Provider{}
|
||||
provider.getFileFromTask = func(taskID, path string) (batch.ReadCloser, error) {
|
||||
if path == "wd/sam.log" {
|
||||
fileReader, err := os.Open("./testdata/logresponse.json")
|
||||
if err != nil {
|
||||
t.Error(err)
|
||||
}
|
||||
readCloser := ioutil.NopCloser(fileReader)
|
||||
return batch.ReadCloser{Value: &readCloser, Response: autorest.Response{Response: &http.Response{StatusCode: 200}}}, nil
|
||||
}
|
||||
|
||||
t.Errorf("Unexpected Filepath: %v", path)
|
||||
return batch.ReadCloser{}, fmt.Errorf("Failed in test mock of getFileFromTask")
|
||||
}
|
||||
|
||||
result, err := provider.GetContainerLogs(pod.Namespace, pod.Name, containerName, 0)
|
||||
if err != nil {
|
||||
t.Errorf("GetContainerLogs return error: %v", err)
|
||||
}
|
||||
|
||||
fmt.Print(result)
|
||||
if !strings.Contains(result, "Copy output data from the CUDA device to the host memory") || strings.Contains(result, "{") {
|
||||
t.Errorf("Result didn't contain expected content have or had json: %v", result)
|
||||
}
|
||||
|
||||
}
|
||||
59
providers/azurebatch/deployment/aks/main.tf
Normal file
59
providers/azurebatch/deployment/aks/main.tf
Normal file
@@ -0,0 +1,59 @@
|
||||
resource "random_id" "workspace" {
|
||||
keepers = {
|
||||
# Generate a new id each time we switch to a new resource group
|
||||
group_name = "${var.resource_group_name}"
|
||||
}
|
||||
|
||||
byte_length = 8
|
||||
}
|
||||
|
||||
#an attempt to keep the AKS name (and dns label) somewhat unique
|
||||
resource "random_integer" "random_int" {
|
||||
min = 100
|
||||
max = 999
|
||||
}
|
||||
|
||||
resource "azurerm_kubernetes_cluster" "aks" {
|
||||
name = "aks-${random_integer.random_int.result}"
|
||||
location = "${var.resource_group_location}"
|
||||
dns_prefix = "aks-${random_integer.random_int.result}"
|
||||
|
||||
resource_group_name = "${var.resource_group_name}"
|
||||
kubernetes_version = "1.9.2"
|
||||
|
||||
linux_profile {
|
||||
admin_username = "${var.linux_admin_username}"
|
||||
|
||||
ssh_key {
|
||||
key_data = "${var.linux_admin_ssh_publickey}"
|
||||
}
|
||||
}
|
||||
|
||||
agent_pool_profile {
|
||||
name = "agentpool"
|
||||
count = "2"
|
||||
vm_size = "Standard_DS2_v2"
|
||||
os_type = "Linux"
|
||||
}
|
||||
|
||||
service_principal {
|
||||
client_id = "${var.client_id}"
|
||||
client_secret = "${var.client_secret}"
|
||||
}
|
||||
}
|
||||
|
||||
output "cluster_client_certificate" {
|
||||
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate)}"
|
||||
}
|
||||
|
||||
output "cluster_client_key" {
|
||||
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.client_key)}"
|
||||
}
|
||||
|
||||
output "cluster_ca" {
|
||||
value = "${base64decode(azurerm_kubernetes_cluster.aks.kube_config.0.cluster_ca_certificate)}"
|
||||
}
|
||||
|
||||
output "host" {
|
||||
value = "${azurerm_kubernetes_cluster.aks.kube_config.0.host}"
|
||||
}
|
||||
31
providers/azurebatch/deployment/aks/variables.tf
Normal file
31
providers/azurebatch/deployment/aks/variables.tf
Normal file
@@ -0,0 +1,31 @@
|
||||
variable "client_id" {
|
||||
type = "string"
|
||||
description = "Client ID"
|
||||
}
|
||||
|
||||
variable "client_secret" {
|
||||
type = "string"
|
||||
description = "Client secret."
|
||||
}
|
||||
|
||||
variable "resource_group_name" {
|
||||
type = "string"
|
||||
description = "Name of the azure resource group."
|
||||
default = "akc-rg"
|
||||
}
|
||||
|
||||
variable "resource_group_location" {
|
||||
type = "string"
|
||||
description = "Location of the azure resource group."
|
||||
default = "eastus"
|
||||
}
|
||||
|
||||
variable "linux_admin_username" {
|
||||
type = "string"
|
||||
description = "User name for authentication to the Kubernetes linux agent virtual machines in the cluster."
|
||||
}
|
||||
|
||||
variable "linux_admin_ssh_publickey" {
|
||||
type = "string"
|
||||
description = "Configure all the linux virtual machines in the cluster with the SSH RSA public key string. The key should include three parts, for example 'ssh-rsa AAAAB...snip...UcyupgH azureuser@linuxvm'"
|
||||
}
|
||||
146
providers/azurebatch/deployment/azurebatch/main.tf
Normal file
146
providers/azurebatch/deployment/azurebatch/main.tf
Normal file
@@ -0,0 +1,146 @@
|
||||
resource "random_string" "batchname" {
|
||||
keepers = {
|
||||
# Generate a new id each time we switch to a new resource group
|
||||
group_name = "${var.resource_group_name}"
|
||||
}
|
||||
|
||||
length = 8
|
||||
upper = false
|
||||
special = false
|
||||
number = false
|
||||
}
|
||||
|
||||
resource "azurerm_template_deployment" "test" {
|
||||
name = "tfdeployment"
|
||||
resource_group_name = "${var.resource_group_name}"
|
||||
|
||||
# these key-value pairs are passed into the ARM Template's `parameters` block
|
||||
parameters {
|
||||
"batchAccountName" = "${random_string.batchname.result}"
|
||||
"storageAccountID" = "${var.storage_account_id}"
|
||||
"poolBoostrapScriptUrl" = "${var.pool_bootstrap_script_url}"
|
||||
"location" = "${var.resource_group_location}"
|
||||
"poolID" = "${var.pool_id}"
|
||||
"vmSku" = "${var.vm_sku}"
|
||||
"lowPriorityNodeCount" = "${var.low_priority_node_count}"
|
||||
"dedicatedNodeCount" = "${var.dedicated_node_count}"
|
||||
}
|
||||
|
||||
deployment_mode = "Incremental"
|
||||
|
||||
template_body = <<DEPLOY
|
||||
{
|
||||
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
|
||||
"contentVersion": "1.0.0.0",
|
||||
"parameters": {
|
||||
"batchAccountName": {
|
||||
"type": "string",
|
||||
"metadata": {
|
||||
"description": "Batch Account Name"
|
||||
}
|
||||
},
|
||||
"poolID": {
|
||||
"type": "string",
|
||||
"metadata": {
|
||||
"description": "GPU Pool ID"
|
||||
}
|
||||
},
|
||||
"dedicatedNodeCount": {
|
||||
"type": "string"
|
||||
},
|
||||
"lowPriorityNodeCount": {
|
||||
"type": "string"
|
||||
},
|
||||
"vmSku": {
|
||||
"type": "string"
|
||||
},
|
||||
"storageAccountID": {
|
||||
"type": "string"
|
||||
},
|
||||
"poolBoostrapScriptUrl": {
|
||||
"type": "string"
|
||||
},
|
||||
"location": {
|
||||
"type": "string",
|
||||
"defaultValue": "[resourceGroup().location]",
|
||||
"metadata": {
|
||||
"description": "Location for all resources."
|
||||
}
|
||||
}
|
||||
},
|
||||
"resources": [
|
||||
{
|
||||
"type": "Microsoft.Batch/batchAccounts",
|
||||
"name": "[parameters('batchAccountName')]",
|
||||
"apiVersion": "2015-12-01",
|
||||
"location": "[parameters('location')]",
|
||||
"tags": {
|
||||
"ObjectName": "[parameters('batchAccountName')]"
|
||||
},
|
||||
"properties": {
|
||||
"autoStorage": {
|
||||
"storageAccountId": "[parameters('storageAccountID')]"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "Microsoft.Batch/batchAccounts/pools",
|
||||
"name": "[concat(parameters('batchAccountName'), '/', parameters('poolID'))]",
|
||||
"apiVersion": "2017-09-01",
|
||||
"scale": null,
|
||||
"properties": {
|
||||
"vmSize": "STANDARD_NC6",
|
||||
"interNodeCommunication": "Disabled",
|
||||
"maxTasksPerNode": 1,
|
||||
"taskSchedulingPolicy": {
|
||||
"nodeFillType": "Spread"
|
||||
},
|
||||
"startTask": {
|
||||
"commandLine": "/bin/bash -c ./init.sh",
|
||||
"resourceFiles": [
|
||||
{
|
||||
"blobSource": "[parameters('poolBoostrapScriptUrl')]",
|
||||
"fileMode": "777",
|
||||
"filePath": "./init.sh"
|
||||
}
|
||||
],
|
||||
"userIdentity": {
|
||||
"autoUser": {
|
||||
"elevationLevel": "Admin",
|
||||
"scope": "Pool"
|
||||
}
|
||||
},
|
||||
"waitForSuccess": true,
|
||||
"maxTaskRetryCount": 0
|
||||
},
|
||||
"deploymentConfiguration": {
|
||||
"virtualMachineConfiguration": {
|
||||
"imageReference": {
|
||||
"publisher": "Canonical",
|
||||
"offer": "UbuntuServer",
|
||||
"sku": "16.04-LTS",
|
||||
"version": "latest"
|
||||
},
|
||||
"nodeAgentSkuId": "batch.node.ubuntu 16.04"
|
||||
}
|
||||
},
|
||||
"scaleSettings": {
|
||||
"fixedScale": {
|
||||
"targetDedicatedNodes": "[parameters('dedicatedNodeCount')]",
|
||||
"targetLowPriorityNodes": "[parameters('lowPriorityNodeCount')]",
|
||||
"resizeTimeout": "PT15M"
|
||||
}
|
||||
}
|
||||
},
|
||||
"dependsOn": [
|
||||
"[resourceId('Microsoft.Batch/batchAccounts', parameters('batchAccountName'))]"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
DEPLOY
|
||||
}
|
||||
|
||||
output "name" {
|
||||
value = "${random_string.batchname.result}"
|
||||
}
|
||||
43
providers/azurebatch/deployment/azurebatch/variables.tf
Normal file
43
providers/azurebatch/deployment/azurebatch/variables.tf
Normal file
@@ -0,0 +1,43 @@
|
||||
variable "pool_id" {
|
||||
type = "string"
|
||||
description = "Name of the Azure Batch pool to create."
|
||||
default = "pool1"
|
||||
}
|
||||
|
||||
variable "vm_sku" {
|
||||
type = "string"
|
||||
description = "VM SKU to use - Default to NC6 GPU SKU."
|
||||
default = "STANDARD_NC6"
|
||||
}
|
||||
|
||||
variable "pool_bootstrap_script_url" {
|
||||
type = "string"
|
||||
description = "Publicly accessible url used for boostrapping nodes in the pool. Installing GPU drivers, for example."
|
||||
}
|
||||
|
||||
variable "storage_account_id" {
|
||||
type = "string"
|
||||
description = "Name of the storage account to be used by Azure Batch"
|
||||
}
|
||||
|
||||
variable "resource_group_name" {
|
||||
type = "string"
|
||||
description = "Name of the azure resource group."
|
||||
default = "akc-rg"
|
||||
}
|
||||
|
||||
variable "resource_group_location" {
|
||||
type = "string"
|
||||
description = "Location of the azure resource group."
|
||||
default = "eastus"
|
||||
}
|
||||
|
||||
variable "low_priority_node_count" {
|
||||
type = "string"
|
||||
description = "The number of low priority nodes to allocate to the pool"
|
||||
}
|
||||
|
||||
variable "dedicated_node_count" {
|
||||
type = "string"
|
||||
description = "The number dedicated nodes to allocate to the pool"
|
||||
}
|
||||
19
providers/azurebatch/deployment/examplegpupod.yaml
Normal file
19
providers/azurebatch/deployment/examplegpupod.yaml
Normal file
@@ -0,0 +1,19 @@
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: cuda-vector-add
|
||||
labels:
|
||||
app: examplegpupod
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
containers:
|
||||
- name: cuda-vector-add
|
||||
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
|
||||
image: "k8s.gcr.io/cuda-vector-add:v0.1"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # requesting 1 GPU
|
||||
nodeName: virtual-kubelet
|
||||
tolerations:
|
||||
- key: azure.com/batch
|
||||
effect: NoSchedule
|
||||
20
providers/azurebatch/deployment/examplenvidiasmi.yaml
Normal file
20
providers/azurebatch/deployment/examplenvidiasmi.yaml
Normal file
@@ -0,0 +1,20 @@
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: exampegpujob
|
||||
spec:
|
||||
containers:
|
||||
- image: nvidia/cuda
|
||||
command: ["nvidia-smi"]
|
||||
imagePullPolicy: Always
|
||||
name: nvidia
|
||||
resources:
|
||||
requests:
|
||||
memory: 1G
|
||||
cpu: 1
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # requesting 1 GPU
|
||||
nodeName: virtual-kubelet
|
||||
tolerations:
|
||||
- key: azure.com/batch
|
||||
effect: NoSchedule
|
||||
53
providers/azurebatch/deployment/main.tf
Normal file
53
providers/azurebatch/deployment/main.tf
Normal file
@@ -0,0 +1,53 @@
|
||||
resource "azurerm_resource_group" "batchrg" {
|
||||
name = "${var.resource_group_name}"
|
||||
location = "${var.resource_group_location}"
|
||||
}
|
||||
|
||||
module "aks" {
|
||||
source = "aks"
|
||||
|
||||
//Defaults to using current ssh key: recomend changing as needed
|
||||
linux_admin_username = "aks"
|
||||
linux_admin_ssh_publickey = "${file("~/.ssh/id_rsa.pub")}"
|
||||
|
||||
client_id = "${var.client_id}"
|
||||
client_secret = "${var.client_secret}"
|
||||
|
||||
resource_group_name = "${azurerm_resource_group.batchrg.name}"
|
||||
resource_group_location = "${azurerm_resource_group.batchrg.location}"
|
||||
}
|
||||
|
||||
module "storage" {
|
||||
source = "storage"
|
||||
pool_bootstrap_script_path = "./scripts/poolstartup.sh"
|
||||
|
||||
resource_group_name = "${azurerm_resource_group.batchrg.name}"
|
||||
resource_group_location = "${azurerm_resource_group.batchrg.location}"
|
||||
}
|
||||
|
||||
module "azurebatch" {
|
||||
source = "azurebatch"
|
||||
|
||||
storage_account_id = "${module.storage.id}"
|
||||
pool_bootstrap_script_url = "${module.storage.pool_boostrap_script_url}"
|
||||
|
||||
resource_group_name = "${azurerm_resource_group.batchrg.name}"
|
||||
resource_group_location = "${azurerm_resource_group.batchrg.location}"
|
||||
|
||||
dedicated_node_count = 1
|
||||
low_priority_node_count = 2
|
||||
}
|
||||
|
||||
module "virtualkubelet" {
|
||||
source = "virtualkubelet"
|
||||
virtualkubelet_docker_image = "${var.virtualkubelet_docker_image}"
|
||||
|
||||
cluster_client_key = "${module.aks.cluster_client_key}"
|
||||
cluster_client_certificate = "${module.aks.cluster_client_certificate}"
|
||||
cluster_ca = "${module.aks.cluster_ca}"
|
||||
cluster_host = "${module.aks.host}"
|
||||
|
||||
azure_batch_account_name = "${module.azurebatch.name}"
|
||||
|
||||
resource_group_location = "${azurerm_resource_group.batchrg.location}"
|
||||
}
|
||||
49
providers/azurebatch/deployment/scripts/poolstartup.sh
Normal file
49
providers/azurebatch/deployment/scripts/poolstartup.sh
Normal file
@@ -0,0 +1,49 @@
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
export TEMP_DISK=/mnt
|
||||
|
||||
apt-get install -y -q --no-install-recommends \
|
||||
build-essential
|
||||
|
||||
|
||||
# Add dockerce repo
|
||||
apt-get update -y -q --no-install-recommends
|
||||
apt-get install -y -q -o Dpkg::Options::="--force-confnew" --no-install-recommends \
|
||||
apt-transport-https ca-certificates curl software-properties-common cgroup-lite
|
||||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
|
||||
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
|
||||
apt-get update
|
||||
|
||||
|
||||
#Install latest cuda driver..
|
||||
CUDA_REPO_PKG=cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
|
||||
wget -O /tmp/${CUDA_REPO_PKG} http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG}
|
||||
sudo dpkg -i /tmp/${CUDA_REPO_PKG}
|
||||
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
|
||||
rm -f /tmp/${CUDA_REPO_PKG}
|
||||
sudo apt-get update -y -q --no-install-recommends
|
||||
sudo apt-get install cuda-drivers -y -q --no-install-recommends
|
||||
|
||||
# install nvidia-docker
|
||||
curl -fSsL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
|
||||
curl -fSsL https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | \
|
||||
tee /etc/apt/sources.list.d/nvidia-docker.list
|
||||
apt-get update -y -q --no-install-recommends
|
||||
apt-get install -y -q --no-install-recommends -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confnew" nvidia-docker2
|
||||
systemctl restart docker.service
|
||||
nvidia-docker version
|
||||
|
||||
# prep docker
|
||||
systemctl stop docker.service
|
||||
rm -rf /var/lib/docker
|
||||
mkdir -p /etc/docker
|
||||
mkdir -p $TEMPDISK/docker
|
||||
chmod 777 $TEMPDISK/docker
|
||||
echo "{ \"data-root\": \"$TEMP_DISK/docker\", \"hosts\": [ \"unix:///var/run/docker.sock\", \"tcp://127.0.0.1:2375\" ] }" > /etc/docker/daemon.json.merge
|
||||
python -c "import json;a=json.load(open('/etc/docker/daemon.json.merge'));b=json.load(open('/etc/docker/daemon.json'));a.update(b);f=open('/etc/docker/daemon.json','w');json.dump(a,f);f.close();"
|
||||
rm -f /etc/docker/daemon.json.merge
|
||||
sed -i 's|^ExecStart=/usr/bin/dockerd.*|ExecStart=/usr/bin/dockerd|' /lib/systemd/system/docker.service
|
||||
systemctl daemon-reload
|
||||
systemctl start docker.service
|
||||
|
||||
|
||||
|
||||
77
providers/azurebatch/deployment/storage/main.tf
Normal file
77
providers/azurebatch/deployment/storage/main.tf
Normal file
@@ -0,0 +1,77 @@
|
||||
resource "random_string" "storage" {
|
||||
keepers = {
|
||||
# Generate a new id each time we switch to a new resource group
|
||||
group_name = "${var.resource_group_name}"
|
||||
}
|
||||
|
||||
length = 8
|
||||
upper = false
|
||||
special = false
|
||||
number = false
|
||||
}
|
||||
|
||||
resource "azurerm_storage_account" "batchstorage" {
|
||||
name = "${lower(random_string.storage.result)}"
|
||||
resource_group_name = "${var.resource_group_name}"
|
||||
location = "${var.resource_group_location}"
|
||||
account_tier = "Standard"
|
||||
account_replication_type = "LRS"
|
||||
}
|
||||
|
||||
resource "azurerm_storage_container" "boostrapscript" {
|
||||
name = "scripts"
|
||||
resource_group_name = "${var.resource_group_name}"
|
||||
storage_account_name = "${azurerm_storage_account.batchstorage.name}"
|
||||
container_access_type = "private"
|
||||
}
|
||||
|
||||
resource "azurerm_storage_blob" "initscript" {
|
||||
name = "init.sh"
|
||||
|
||||
resource_group_name = "${var.resource_group_name}"
|
||||
storage_account_name = "${azurerm_storage_account.batchstorage.name}"
|
||||
storage_container_name = "${azurerm_storage_container.boostrapscript.name}"
|
||||
|
||||
type = "block"
|
||||
source = "${var.pool_bootstrap_script_path}"
|
||||
}
|
||||
|
||||
data "azurerm_storage_account_sas" "scriptaccess" {
|
||||
connection_string = "${azurerm_storage_account.batchstorage.primary_connection_string}"
|
||||
https_only = true
|
||||
|
||||
resource_types {
|
||||
service = false
|
||||
container = false
|
||||
object = true
|
||||
}
|
||||
|
||||
services {
|
||||
blob = true
|
||||
queue = false
|
||||
table = false
|
||||
file = false
|
||||
}
|
||||
|
||||
start = "${timestamp()}"
|
||||
expiry = "${timeadd(timestamp(), "8776h")}"
|
||||
|
||||
permissions {
|
||||
read = true
|
||||
write = false
|
||||
delete = false
|
||||
list = false
|
||||
add = false
|
||||
create = false
|
||||
update = false
|
||||
process = false
|
||||
}
|
||||
}
|
||||
|
||||
output "pool_boostrap_script_url" {
|
||||
value = "${azurerm_storage_blob.initscript.url}${data.azurerm_storage_account_sas.scriptaccess.sas}"
|
||||
}
|
||||
|
||||
output "id" {
|
||||
value = "${azurerm_storage_account.batchstorage.id}"
|
||||
}
|
||||
14
providers/azurebatch/deployment/storage/variables.tf
Normal file
14
providers/azurebatch/deployment/storage/variables.tf
Normal file
@@ -0,0 +1,14 @@
|
||||
variable "resource_group_name" {
|
||||
description = "Resource group name"
|
||||
type = "string"
|
||||
}
|
||||
|
||||
variable "resource_group_location" {
|
||||
description = "Resource group location"
|
||||
type = "string"
|
||||
}
|
||||
|
||||
variable "pool_bootstrap_script_path" {
|
||||
description = "The filepath of the pool boostrapping script"
|
||||
type = "string"
|
||||
}
|
||||
23
providers/azurebatch/deployment/variables.tf
Normal file
23
providers/azurebatch/deployment/variables.tf
Normal file
@@ -0,0 +1,23 @@
|
||||
variable "client_id" {
|
||||
type = "string"
|
||||
description = "Client ID"
|
||||
}
|
||||
|
||||
variable "client_secret" {
|
||||
type = "string"
|
||||
description = "Client secret."
|
||||
}
|
||||
|
||||
variable "resource_group_name" {
|
||||
description = "Resource group name"
|
||||
type = "string"
|
||||
}
|
||||
|
||||
variable "resource_group_location" {
|
||||
description = "Resource group location"
|
||||
type = "string"
|
||||
}
|
||||
|
||||
variable "virtualkubelet_docker_image" {
|
||||
type = "string"
|
||||
}
|
||||
14
providers/azurebatch/deployment/vars.example.tfvars
Normal file
14
providers/azurebatch/deployment/vars.example.tfvars
Normal file
@@ -0,0 +1,14 @@
|
||||
// Provide the Client ID of a service principal for use by AKS
|
||||
client_id = "00000000-0000-0000-0000-000000000000"
|
||||
|
||||
// Provide the Client Secret of a service principal for use by AKS
|
||||
client_secret = "00000000-0000-0000-0000-000000000000"
|
||||
|
||||
// The resource group you would like to deploy too
|
||||
resource_group_name = "vkgpu"
|
||||
|
||||
// The location of all resources
|
||||
resource_group_location = "westeurope"
|
||||
|
||||
// Virtual Kubelet docker image
|
||||
virtualkubelet_docker_image = "microsoft/virtual-kubelet"
|
||||
126
providers/azurebatch/deployment/virtualkubelet/main.tf
Normal file
126
providers/azurebatch/deployment/virtualkubelet/main.tf
Normal file
@@ -0,0 +1,126 @@
|
||||
provider "kubernetes" {
|
||||
host = "${var.cluster_host}"
|
||||
|
||||
client_certificate = "${var.cluster_client_certificate}"
|
||||
client_key = "${var.cluster_client_key}"
|
||||
cluster_ca_certificate = "${var.cluster_ca}"
|
||||
}
|
||||
|
||||
resource "kubernetes_secret" "vkcredentials" {
|
||||
metadata {
|
||||
name = "vkcredentials"
|
||||
}
|
||||
|
||||
data {
|
||||
cert.pem = "${var.cluster_client_certificate}"
|
||||
key.pem = "${var.cluster_client_key}"
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "vkdeployment" {
|
||||
metadata {
|
||||
name = "vkdeployment"
|
||||
}
|
||||
|
||||
spec {
|
||||
selector {
|
||||
app = "virtualkubelet"
|
||||
}
|
||||
|
||||
template {
|
||||
metadata {
|
||||
labels {
|
||||
app = "virtualkubelet"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
container {
|
||||
name = "vk"
|
||||
image = "${var.virtualkubelet_docker_image}"
|
||||
|
||||
args = [
|
||||
"--provider",
|
||||
"azurebatch",
|
||||
"--taint",
|
||||
"azure.com/batch",
|
||||
"--namespace",
|
||||
"default",
|
||||
]
|
||||
|
||||
port {
|
||||
container_port = 10250
|
||||
protocol = "TCP"
|
||||
name = "kubeletport"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "azure-credentials"
|
||||
mount_path = "/etc/aks/azure.json"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "credentials"
|
||||
mount_path = "/etc/virtual-kubelet"
|
||||
}
|
||||
|
||||
env = [
|
||||
{
|
||||
name = "AZURE_BATCH_ACCOUNT_LOCATION"
|
||||
value = "${var.resource_group_location}"
|
||||
},
|
||||
{
|
||||
name = "AZURE_BATCH_ACCOUNT_NAME"
|
||||
value = "${var.azure_batch_account_name}"
|
||||
},
|
||||
{
|
||||
name = "AZURE_BATCH_POOLID"
|
||||
value = "${var.azure_batch_pool_id}"
|
||||
},
|
||||
{
|
||||
name = "KUBELET_PORT"
|
||||
value = "10250"
|
||||
},
|
||||
{
|
||||
name = "AZURE_CREDENTIALS_LOCATION"
|
||||
value = "/etc/aks/azure.json"
|
||||
},
|
||||
{
|
||||
name = "APISERVER_CERT_LOCATION"
|
||||
value = "/etc/virtual-kubelet/cert.pem"
|
||||
},
|
||||
{
|
||||
name = "APISERVER_KEY_LOCATION"
|
||||
value = "/etc/virtual-kubelet/key.pem"
|
||||
},
|
||||
{
|
||||
name = "VKUBELET_POD_IP"
|
||||
|
||||
value_from {
|
||||
field_ref {
|
||||
field_path = "status.podIP"
|
||||
}
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "azure-credentials"
|
||||
|
||||
host_path {
|
||||
path = "/etc/kubernetes/azure.json"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "credentials"
|
||||
|
||||
secret {
|
||||
secret_name = "vkcredentials"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
41
providers/azurebatch/deployment/virtualkubelet/variables.tf
Normal file
41
providers/azurebatch/deployment/virtualkubelet/variables.tf
Normal file
@@ -0,0 +1,41 @@
|
||||
variable "cluster_client_certificate" {
|
||||
type = "string"
|
||||
description = "Cluster client Certificate"
|
||||
default = "eastus"
|
||||
}
|
||||
|
||||
variable "cluster_client_key" {
|
||||
type = "string"
|
||||
description = "Cluster client Certificate Key"
|
||||
}
|
||||
|
||||
variable "cluster_ca" {
|
||||
type = "string"
|
||||
description = "Cluster Certificate Authority"
|
||||
}
|
||||
|
||||
variable "cluster_host" {
|
||||
type = "string"
|
||||
description = "Cluster Admin API host"
|
||||
}
|
||||
|
||||
variable "virtualkubelet_docker_image" {
|
||||
type = "string"
|
||||
description = "The docker image to use for deploying the virtual kubelet"
|
||||
}
|
||||
|
||||
variable "azure_batch_account_name" {
|
||||
type = "string"
|
||||
description = "The name of the Azure Batch account to use"
|
||||
}
|
||||
|
||||
variable "azure_batch_pool_id" {
|
||||
type = "string"
|
||||
description = "The PoolID to use in Azure batch"
|
||||
default = "pool1"
|
||||
}
|
||||
|
||||
variable "resource_group_location" {
|
||||
description = "Resource group location"
|
||||
type = "string"
|
||||
}
|
||||
6
providers/azurebatch/testdata/logresponse.json
vendored
Normal file
6
providers/azurebatch/testdata/logresponse.json
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
{"log":"[Vector addition of 50000 elements]\n","stream":"stdout","time":"2018-05-30T17:02:49.967357287Z"}
|
||||
{"log":"Copy input data from the host memory to the CUDA device\n","stream":"stdout","time":"2018-05-30T17:02:49.967417086Z"}
|
||||
{"log":"CUDA kernel launch with 196 blocks of 256 threads\n","stream":"stdout","time":"2018-05-30T17:02:49.967423286Z"}
|
||||
{"log":"Copy output data from the CUDA device to the host memory\n","stream":"stdout","time":"2018-05-30T17:02:49.967427386Z"}
|
||||
{"log":"Test PASSED\n","stream":"stdout","time":"2018-05-30T17:02:49.967431286Z"}
|
||||
{"log":"Done\n","stream":"stdout","time":"2018-05-30T17:02:49.967435286Z"}
|
||||
@@ -616,7 +616,7 @@ func readLogFile(filename string, tail int) (string, error) {
|
||||
lines = append(lines, scanner.Text())
|
||||
}
|
||||
if tail > 0 && tail < len(lines) {
|
||||
lines = lines[len(lines)-tail : len(lines)]
|
||||
lines = lines[len(lines)-tail:]
|
||||
}
|
||||
return strings.Join(lines, ""), nil
|
||||
}
|
||||
@@ -636,7 +636,7 @@ func (p *CRIProvider) GetContainerLogs(namespace, podName, containerName string,
|
||||
}
|
||||
container := pod.containers[containerName]
|
||||
if container == nil {
|
||||
return "", fmt.Errorf("Cannot find container %s in pod %s namespace %s", containerName, pod, namespace)
|
||||
return "", fmt.Errorf("Cannot find container %s in pod %s namespace %s", containerName, podName, namespace)
|
||||
}
|
||||
|
||||
return readLogFile(container.LogPath, tail)
|
||||
|
||||
Reference in New Issue
Block a user