LGTM Alloy Loki Tempo Trace Log

[LGTM] Alloy 與 Tempo、Loki 整合：以 Java 為例

石偉琪 Vicky Shih 2025/03/03 16:00:00

0 1518

1. LGTM 與 Alloy 簡介

Open Source 陣營中的 Grafana，針對了 Day 2 operation 裡頭的 log、trace 以及 metrics，分別推出了 Loki、Tempo 以及 Mimir 來做為後端儲存使用的 backend。順帶一提的是，這三個 solution 的架構極為相似，可單獨使用，也可以一起搭配使用，也都很容易就能橫向擴充，從小型到大型的架構都適合使用。前述三個 solution 統一使用 Grafana 做為資料呈現使用，這四個 solution 首字字母合起來縮寫為 LGTM。

除此之外，alloy 做為 agent，負責將收集到的 log、trace 以及 metrics，根據設定檔來處理資料，像是上 label、轉送到指定的地方

2. Lab 說明

這回會實際以一個 Java 的 application 來做為範例，進行以下的測試：

application 的 log 要如何透過 alloy 接進 loki
applicaiton 的 trace 如何直接寫至 tempo
application 的 trace 如何透過 alloy 再寫至 tempo

其中，Loki / Tempo / Alloy 以及測試用的 Java application，都會是部署在 K8S cluster 上

K8S 的版本為 1.28.12 搭配 containerd

3. 實作步驟

以下為安裝的步驟，只列出如何透過 helm 與 helm chart，將相關的 solution 部署至 K8S cluster 之上

3.1 安裝/設定 Loki

我先裝了 minio 讓 loki 可以把資料收到 minio 的 bucket 裡

loki 的 helm chart 裡，values.yaml 有寫到，deploymentMode 有分為三種：SingleBinary 、SimpleScalable 以及 Distributed

其中，

SingleBinary 是以單獨的 binary 的方式來運作，有點像以前單體式的架構
SimpleScalable 則是將架構大概地分為 read、write 以及 backend
Distributed 則是把各服務以 microservices 部署至 K8S cluster 上，未來可以視需求針對不同的服務各別擴充

這裡當然就是以 Distributed 的模式來部署

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/loki --untar

cd loki
cat > values-loki.yaml << EOF
loki:
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
  tracing:
    enabled: true
  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4
  storage:
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
    type: s3
    s3:
      s3: null
      endpoint: http://minio.minio:9000/loki
      accessKeyId: loki_admin
      secretAccessKey: loki_admin123
      s3ForcePathStyle: true
      insecure: true
  auth_enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

deploymentMode: Distributed

ingester:
  replicas: 1
querier:
  replicas: 1
  maxUnavailable: 2
queryFrontend:
  replicas: 1
  maxUnavailable: 1
queryScheduler:
  replicas: 1
distributor:
  replicas: 1
  maxUnavailable: 2
compactor:
  replicas: 1
indexGateway:
  replicas: 1
  maxUnavailable: 1

# optional experimental components
bloomPlanner:
  replicas: 0
bloomBuilder:
  replicas: 0
bloomGateway:
  replicas: 0

# Enable minio for storage
minio:
  enabled: false

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

singleBinary:
  replicas: 0
EOF

sed -i 's/memory: {{ $requestMemory }}Mi/memory: 500Mi/g' templates/memcached/_memcached-statefulset.tpl

helm install -n loki --create-namespace loki . -f values-loki.yaml

安裝後，來確認一下狀況，先來看 pod 的狀態

再來，看一下 service

3.2 安裝/設定 Grafana

接著一樣以 helm chart 的方式來安裝 grafana，安裝於 grafana 這個 namespace

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/grafana --untar

cd grafana
cat > values-grafana.yaml << EOF
ingress:
  enabled: true
  ingressClassName: nginx
  path: "/grafana"
  pathType: Prefix
  hosts:
  - "test.example.com"

grafana.ini:
  server:
    domain: test.example.com
    root_url: "http://test.example.com:32080/grafana"
    serve_from_sub_path: true

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
        url: http://loki-gateway.loki

adminUser: admin
adminPassword: admin123
EOF

helm install -n grafana --create-namespace grafana . -f values-grafana.yaml

grafana 安裝好了之後，它的 pod 與 svc 比較簡單一些

因為剛剛在 values-grafana.yaml 裡的 datasources 那一段已經事先寫好了一個 type 為 loki，名字則為 Loki 的 data source

登入 grafana 就會看到該 data source 了

3.3 安裝/設定 Tempo

繼續往下裝 tempo

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/tempo-distributed --untar

cd tempo-distributed
cat > values-tempo.yaml << EOF
global_overrides:
  metrics_generator_processors:
    - service-graphs
    - span-metrics

metricsGenerator:
  enabled: true
  storage:
    remote_write:
      - url: "http://loki-stack-prometheus-server.monitoring/api/v1/write"

gateway:
  enabled: true

traces:
  otlp:
    http:
      enabled: true
    grpc:
      enabled: true
EOF

helm install -n tracing --create-namespace tempo . -f values-tempo.yaml

因為這回要測試的是 OTLP，所以只 enable 了 http 以及 grpc 兩種

一樣，來看一下安裝後 pod 的狀態

再來看 svc

然後，用 web browser 連到 grafana，加入一個 type 為 tempo 的 data source，連到 http://tempo-gateway.tracing

3.4 安裝/設定 Alloy

接著，換來安裝 alloy

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/alloy --untar

cat > values-alloy-to-loki.yaml << EOF
alloy:
  mounts:
    varlog: true

  extraPorts:
    - name: "otlp-grpc"
      port: 4317
      targetPort: 4317
    - name: "otlp-http"
      port: 4318
      targetPort: 4318

  resources:
    limits:
      cpu: 10m
      memory: 100Mi
EOF

sed -i 's/example.alloy/config.alloy/g' ~/helm-charts/alloy/templates/configmap.yaml

alloy 都是根據設定檔來決定要做什麼事情，這裡直接把設定檔寫在 helm chart 裡頭的 config/config.alloy

configmap 預設則是去讀 config/example.alloy，所以最後是去修改了 templates/configmap.yaml 裡的內容

接著才是要來新增 config.alloy

logging {
  level = "debug"
  format = "logfmt"
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway.loki/loki/api/v1/push"
  }
}

// for k8s pod logs
discovery.kubernetes "pod" {
  role = "pod"
}

discovery.relabel "pod_logs" {
  targets = discovery.kubernetes.pod.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    action = "replace"
    target_label = "namespace"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    action = "replace"
    target_label = "node_name"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    action = "replace"
    target_label = "pod"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_ip"]
    action = "replace"
    target_label = "pod_ip"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "container"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
    action = "replace"
    target_label = "app"
  }

  rule {
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "job"
    separator = "/"
    replacement ="$1"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "__path__"
    separator = "/"
    replacement = "/var/log/pods/*$1/*.log"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_container_id"]
    action = "replace"
    target_label = "container_runtime"
    regex = "^(\\S+):\\/\\/.+$"
    replacement = "$1"
  }
}

loki.source.kubernetes "pod_logs" {
  targets    = discovery.relabel.pod_logs.output
  forward_to = [loki.process.pod_logs.receiver]
}

loki.process "pod_logs" {
  stage.static_labels {
      values = {
        cluster = "vicky-test",
      }
  }

  forward_to = [loki.write.default.receiver]
}

// k8s events
loki.source.kubernetes_events "cluster_events" {
  job_name   = "integrations/kubernetes/eventhandler"
  log_format = "logfmt"
  forward_to = [loki.process.cluster_events.receiver]
}

loki.process "cluster_events" {
  forward_to = [loki.write.default.receiver]

  stage.static_labels {
    values = {
      cluster = "vicky-test",
    }
  }

  stage.labels {
    values = {
      kubernetes_cluster_events = "job",
    }
  }
}

// for Traces - to Tempo
otelcol.receiver.otlp "otlp_receiver" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    traces = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo-distributor.tracing:4318"
    tls {
      insecure = true
      insecure_skip_verify = true
    }
  }
}

log 的部份，會去讀 /var/log/pods 下的檔案，然後再用其他的 rule 去做對應的處理，例如更換 label、新增一個 cluster: vicky-test 的 label。

最後，再根據 forward_to = [loki.write.default.receiver]，去查 loki.write "default" 那一段的設定，寫到指定的 endpoint，也就是 http://loki-gateway.loki/loki/api/v1/push

另外，trace 的部份則是透過 otelcol.receiver.otlp "otlp_erceiver" 那一段，trace 由 alloy 的 4317 或 4318 port 進到 alloy，然後 output 寫到 otelcol.exporter.otlphttp.tempo 那一段指定的 endpoint，也就是 http://tempo-distributor.tracing:4318

最後，以 DaemonSet 的方式來部署 alloy 吧

helm install -n alloy --create-namespace alloy . -f values-alloy.yaml

一樣，看查看一下部署之後的狀態

呼～到這裡總算完成了 server 端的設定了，只等測試的 application 把 log 跟 trace 傳送過來

3.5 部署 Java 測試 application - Spring PetClinic

這個部份，主要就是要把測試的 application compile 出來成 JAR file，會產出在 target/spring-petclinic-3.4.0-SNAPSHOT.jar（版本不同檔名可能不太一樣）

另外，要先下載 opentelemetry java agent

wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

然後透過 Dockerfile 打包成 container image 並 push 至 contaienr registry 或在 worker node 上手動 load image

步驟不詳細寫了，我把 Dockerfile 留在下方

FROM eclipse-temurin:21-jdk-jammy

WORKDIR /app

COPY target/spring-petclinic-3.4.0-SNAPSHOT.jar spring-petclinic.jar
COPY opentelemetry-javaagent.jar opentelemetry-javaagent.jar

CMD ["sh", "-c", "java \
 -Dotel.service.name=${OTEL_SERVICE_NAME} \
 -Dotel.exporter.otlp.endpoint=${OTEL_EXPORTER_OTLP_ENDPOINT} \
 -Dotel.traces.exporter=${OTEL_TRACES_EXPORTER} \
 -Dotel.metrics.exporter=${OTEL_METRICS_EXPORTER} \
 -javaagent:./opentelemetry-javaagent.jar \
 -jar spring-petclinic.jar"]

以 helm create spring-petclinic 為 application 建立 helm chart，並修改相關的檔案

values.yaml

這裡的 otlp.ServiceName 是稍後在 tempo 裡可以用來查詢用的 service name；而 otlp.endpoint 則是要把 trace 傳送到什麼地方

因為這個服務要給來自 K8S cluster 以外的 request 可以連接，service 開成 node port

# Default values for spring-petclinic.# This is a YAML-formatted file.# Declare variables to be passed into your templates.

replicaCount: 1

otlp:
  ServiceName: petclinic-direct
  endpoint: http://tempo-distributor.tracing:4318

image:
  repository: petclinic
  pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
  tag: "0.1"

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
# Specifies whether a service account should be created
  create: true
# Automatically mount a ServiceAccount's API credentials?
  automount: true
# Annotations to add to the service account
  annotations: {}
# The name of the service account to use.# If not set and create is true, a name is generated using the fullname template
  name: ""

podAnnotations: {}
podLabels: {}

podSecurityContext: {}
# fsGroup: 2000

securityContext: {}
# capabilities:#   drop:#   - ALL# readOnlyRootFilesystem: true# runAsNonRoot: true# runAsUser: 1000

service:
  type: NodePort
  port: 8080

ingress:
  enabled: false
  className: ""
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
#  - secretName: chart-example-tls#    hosts:#      - chart-example.local

resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious# choice for the user. This also increases chances charts run on environments with little# resources, such as Minikube. If you do want to specify resources, uncomment the following# lines, adjust them as necessary, and remove the curly braces after 'resources:'.# limits:#   cpu: 100m#   memory: 128Mi# requests:#   cpu: 100m#   memory: 128Mi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 100
  targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80# Additional volumes on the output Deployment definition.
volumes: []
# - name: foo#   secret:#     secretName: mysecret#     optional: false# Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# - name: foo#   mountPath: "/etc/foo"#   readOnly: true

nodeSelector: {}

tolerations: []

affinity: {}

templates/deployment.yaml

這裡有加了幾個環境變數進去，像是 OTEL_SERVICE_NAME、OTEL_EXPORTER_OTLP_ENDPOINT .... etc

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "spring-petclinic.fullname" . }}
  labels:
    {{- include "spring-petclinic.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "spring-petclinic.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "spring-petclinic.labels" . | nindent 8 }}
	{{- with .Values.podLabels }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "spring-petclinic.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          env:
            - name: OTEL_SERVICE_NAME
              value: "{{ .Values.otlp.ServiceName }}"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "{{ .Values.otlp.endpoint }}"
            - name: OTEL_TRACES_EXPORTER
              value: "otlp"
            - name: OTEL_METRICS_EXPORTER
              value: "none"
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          {{- with .Values.volumeMounts }}
          volumeMounts:
            {{- toYaml . | nindent 12 }}
          {{- end }}
      {{- with .Values.volumes }}
      volumes:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

templates/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: {{ include "spring-petclinic.fullname" . }}
  labels:
    {{- include "spring-petclinic.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: 8080
      protocol: TCP
      name: http
  selector:
    {{- include "spring-petclinic.selectorLabels" . | nindent 4 }

這個都準備好了後，就可以下 helm install -n default petclinic . 來安裝這個 application

一樣，來看看安裝後的狀態。從下圖，可以看到 node port 是開在 port 30621

先連到首頁，能看到以下的網頁，即代表 application 有成功部署至 K8S cluster 上

4. 測試

application 部署成功後，就可以來進行測試了，這裡會進行以下幾個測試

4.1 application log -> loki

直接進到 grafana 頁面操作，explore -> loki，然後 label filter 設為 app = spring-petclinic

應該就要能看到如下的 log 了，可以看到 pod 啟動時的 log，也可以看到 trace 是直接接到 http://tempo-distributor.tracing:4318/

4.2 application trace -> 直接進 tempo

由 4.1 的 log，可以看得出來 trace 是直接由 application 接至 http://tempo-distributor.tracing:4318/

點一筆出來 trace 來看一下

就可以看得到類似像 Jaeger 查詢結果的畫面，很容易看得出來是哪一段最耗時

4.3 applcation trace -> 先到 alloy -> alloy 再轉至 tempo

在進行本次測試前，要先來修改一下 values.yaml 裡的值，把 otlp 那一個 block 的內容，改成如下

otlp:
  ServiceName: petclinic-alloy
  endpoint: http://alloy.alloy:4318

改好後，執行以下的指令讓設定生效

helm upgrade -n default petclinic .

等到 pod 起來後，先來查看 log，application 正常啟動，但其中 trace 的部份，則是先送至 alloy，再由 alloy 轉至 tempo

從 log 裡就可以看到剛剛在 values.yaml 裡修改過的設定已經生效了

然後，連到 application 頁面，隨意做一些操作後，再來查看 trace

會看到 Service 的名稱已經變成 petclinic-alloy，也就是剛剛在 values.yaml 裡的設定，這個部份是用來做為辨別使用

能看到 trace 即表示 alloy 可以正確地把 trace 傳送至 tempo 了

5. 結論

Alloy 是 Grafana 陣營用來做為傳送 log, metrics 以及 trace 三合一的 agent，alloy 讓日常維運工作相對簡化，從以前要維護三種不同的 agent，變成只要維護一種即可。未來如果有別種後端的 solution，也可以很輕易透過更改 alloy 的設定，就轉送到其他的地方，而不用修改程式端的設定

不過，alloy 本身學習曲線有點高，一開始在文件的研讀上，得花上較多的時間才能理解設定檔要如何撰寫。也希望未來 Grafana 不管是在網站或是 helm chart 上，相關文件可以變得更容易上手啊。