LGTM Alloy Loki Tempo Trace Log

[LGTM] Alloy 與 Tempo、Loki 整合:以 Java 為例

石偉琪 Vicky Shih 2025/03/03 16:00:00
21

1. LGTM 與 Alloy 簡介

Open Source 陣營中的 Grafana,針對了 Day 2 operation 裡頭的 log、trace 以及 metrics,分別推出了 Loki、Tempo 以及 Mimir 來做為後端儲存使用的 backend。順帶一提的是,這三個 solution 的架構極為相似,可單獨使用,也可以一起搭配使用,也都很容易就能橫向擴充,從小型到大型的架構都適合使用。前述三個 solution 統一使用 Grafana 做為資料呈現使用,這四個 solution 首字字母合起來縮寫為 LGTM。

除此之外,alloy 做為 agent,負責將收集到的 log、trace 以及 metrics,根據設定檔來處理資料,像是上 label、轉送到指定的地方

2. Lab 說明

這回會實際以一個 Java 的 application 來做為範例,進行以下的測試:

  1. application 的 log 要如何透過 alloy 接進 loki

  2. applicaiton 的 trace 如何直接寫至 tempo

  3. application 的 trace 如何透過 alloy 再寫至 tempo

其中,Loki / Tempo / Alloy 以及測試用的 Java application,都會是部署在 K8S cluster 上

K8S 的版本為 1.28.12 搭配 containerd

3. 實作步驟

以下為安裝的步驟,只列出如何透過 helm 與 helm chart,將相關的 solution 部署至 K8S cluster 之上

3.1 安裝/設定 Loki

我先裝了 minio 讓 loki 可以把資料收到 minio 的 bucket 裡

loki 的 helm chart 裡,values.yaml 有寫到,deploymentMode 有分為三種:SingleBinary 、SimpleScalable 以及 Distributed

其中,

  • SingleBinary 是以單獨的 binary 的方式來運作,有點像以前單體式的架構
  • SimpleScalable 則是將架構大概地分為 read、write 以及 backend
  • Distributed 則是把各服務以 microservices 部署至 K8S cluster 上,未來可以視需求針對不同的服務各別擴充

這裡當然就是以 Distributed 的模式來部署

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/loki --untar

cd loki
cat > values-loki.yaml << EOF
loki:
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
  tracing:
    enabled: true
  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4
  storage:
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
    type: s3
    s3:
      s3: null
      endpoint: http://minio.minio:9000/loki
      accessKeyId: loki_admin
      secretAccessKey: loki_admin123
      s3ForcePathStyle: true
      insecure: true
  auth_enabled: false

lokiCanary:
  enabled: false

test:
  enabled: false

deploymentMode: Distributed

ingester:
  replicas: 1
querier:
  replicas: 1
  maxUnavailable: 2
queryFrontend:
  replicas: 1
  maxUnavailable: 1
queryScheduler:
  replicas: 1
distributor:
  replicas: 1
  maxUnavailable: 2
compactor:
  replicas: 1
indexGateway:
  replicas: 1
  maxUnavailable: 1

# optional experimental components
bloomPlanner:
  replicas: 0
bloomBuilder:
  replicas: 0
bloomGateway:
  replicas: 0

# Enable minio for storage
minio:
  enabled: false

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

singleBinary:
  replicas: 0
EOF

sed -i 's/memory: {{ $requestMemory }}Mi/memory: 500Mi/g' templates/memcached/_memcached-statefulset.tpl

helm install -n loki --create-namespace loki . -f values-loki.yaml

安裝後,來確認一下狀況,先來看 pod 的狀態

再來,看一下 service

3.2 安裝/設定 Grafana

接著一樣以 helm chart 的方式來安裝 grafana,安裝於 grafana 這個 namespace

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/grafana --untar

cd grafana
cat > values-grafana.yaml << EOF
ingress:
  enabled: true
  ingressClassName: nginx
  path: "/grafana"
  pathType: Prefix
  hosts:
  - "test.example.com"

grafana.ini:
  server:
    domain: test.example.com
    root_url: "http://test.example.com:32080/grafana"
    serve_from_sub_path: true

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
        url: http://loki-gateway.loki

adminUser: admin
adminPassword: admin123
EOF

helm install -n grafana --create-namespace grafana . -f values-grafana.yaml

grafana 安裝好了之後,它的 pod 與 svc 比較簡單一些

因為剛剛在 values-grafana.yaml 裡的 datasources 那一段已經事先寫好了一個 type 為 loki,名字則為 Loki 的 data source

登入 grafana 就會看到該 data source 了

3.3 安裝/設定 Tempo

繼續往下裝 tempo

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/tempo-distributed --untar

cd tempo-distributed
cat > values-tempo.yaml << EOF
global_overrides:
  metrics_generator_processors:
    - service-graphs
    - span-metrics

metricsGenerator:
  enabled: true
  storage:
    remote_write:
      - url: "http://loki-stack-prometheus-server.monitoring/api/v1/write"

gateway:
  enabled: true

traces:
  otlp:
    http:
      enabled: true
    grpc:
      enabled: true
EOF

helm install -n tracing --create-namespace tempo . -f values-tempo.yaml

因為這回要測試的是 OTLP,所以只 enable 了 http 以及 grpc 兩種

一樣,來看一下安裝後 pod 的狀態

再來看 svc

然後,用 web browser 連到 grafana,加入一個 type 為 tempo 的 data source,連到 http://tempo-gateway.tracing

3.4 安裝/設定 Alloy

接著,換來安裝 alloy

helm repo add grafana https://grafana.github.io/helm-charts
helm pull grafana/alloy --untar

cat > values-alloy-to-loki.yaml << EOF
alloy:
  mounts:
    varlog: true

  extraPorts:
    - name: "otlp-grpc"
      port: 4317
      targetPort: 4317
    - name: "otlp-http"
      port: 4318
      targetPort: 4318

  resources:
    limits:
      cpu: 10m
      memory: 100Mi
EOF

sed -i 's/example.alloy/config.alloy/g' ~/helm-charts/alloy/templates/configmap.yaml

alloy 都是根據設定檔來決定要做什麼事情,這裡直接把設定檔寫在 helm chart 裡頭的 config/config.alloy

configmap 預設則是去讀 config/example.alloy,所以最後是去修改了 templates/configmap.yaml 裡的內容

接著才是要來新增 config.alloy

logging {
  level = "debug"
  format = "logfmt"
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway.loki/loki/api/v1/push"
  }
}

// for k8s pod logs
discovery.kubernetes "pod" {
  role = "pod"
}

discovery.relabel "pod_logs" {
  targets = discovery.kubernetes.pod.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    action = "replace"
    target_label = "namespace"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    action = "replace"
    target_label = "node_name"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    action = "replace"
    target_label = "pod"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_ip"]
    action = "replace"
    target_label = "pod_ip"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "container"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
    action = "replace"
    target_label = "app"
  }

  rule {
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "job"
    separator = "/"
    replacement ="$1"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
    action = "replace"
    target_label = "__path__"
    separator = "/"
    replacement = "/var/log/pods/*$1/*.log"
  }

  rule {
    source_labels = ["__meta_kubernetes_pod_container_id"]
    action = "replace"
    target_label = "container_runtime"
    regex = "^(\\S+):\\/\\/.+$"
    replacement = "$1"
  }
}

loki.source.kubernetes "pod_logs" {
  targets    = discovery.relabel.pod_logs.output
  forward_to = [loki.process.pod_logs.receiver]
}

loki.process "pod_logs" {
  stage.static_labels {
      values = {
        cluster = "vicky-test",
      }
  }

  forward_to = [loki.write.default.receiver]
}

// k8s events
loki.source.kubernetes_events "cluster_events" {
  job_name   = "integrations/kubernetes/eventhandler"
  log_format = "logfmt"
  forward_to = [loki.process.cluster_events.receiver]
}

loki.process "cluster_events" {
  forward_to = [loki.write.default.receiver]

  stage.static_labels {
    values = {
      cluster = "vicky-test",
    }
  }

  stage.labels {
    values = {
      kubernetes_cluster_events = "job",
    }
  }
}

// for Traces - to Tempo
otelcol.receiver.otlp "otlp_receiver" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    traces = [otelcol.exporter.otlphttp.tempo.input]
  }
}

otelcol.exporter.otlphttp "tempo" {
  client {
    endpoint = "http://tempo-distributor.tracing:4318"
    tls {
      insecure = true
      insecure_skip_verify = true
    }
  }
}

log 的部份,會去讀 /var/log/pods 下的檔案,然後再用其他的 rule 去做對應的處理,例如更換 label、新增一個 cluster: vicky-test 的 label。

最後,再根據 forward_to = [loki.write.default.receiver],去查 loki.write "default" 那一段的設定,寫到指定的 endpoint,也就是 http://loki-gateway.loki/loki/api/v1/push

另外,trace 的部份則是透過 otelcol.receiver.otlp "otlp_erceiver" 那一段,trace 由 alloy 的 4317 或 4318 port 進到 alloy,然後 output 寫到 otelcol.exporter.otlphttp.tempo 那一段指定的 endpoint,也就是 http://tempo-distributor.tracing:4318

最後,以 DaemonSet 的方式來部署 alloy 吧

helm install -n alloy --create-namespace alloy . -f values-alloy.yaml

一樣,看查看一下部署之後的狀態

呼~到這裡總算完成了 server 端的設定了,只等測試的 application 把 log 跟 trace 傳送過來

3.5 部署 Java 測試 application - Spring PetClinic

這個部份,主要就是要把測試的 application compile 出來成 JAR file,會產出在 target/spring-petclinic-3.4.0-SNAPSHOT.jar(版本不同檔名可能不太一樣)

另外,要先下載 opentelemetry java agent

wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

然後透過 Dockerfile 打包成 container image 並 push 至 contaienr registry 或在 worker node 上手動 load image

步驟不詳細寫了,我把 Dockerfile 留在下方

FROM eclipse-temurin:21-jdk-jammy

WORKDIR /app

COPY target/spring-petclinic-3.4.0-SNAPSHOT.jar spring-petclinic.jar
COPY opentelemetry-javaagent.jar opentelemetry-javaagent.jar

CMD ["sh", "-c", "java \
 -Dotel.service.name=${OTEL_SERVICE_NAME} \
 -Dotel.exporter.otlp.endpoint=${OTEL_EXPORTER_OTLP_ENDPOINT} \
 -Dotel.traces.exporter=${OTEL_TRACES_EXPORTER} \
 -Dotel.metrics.exporter=${OTEL_METRICS_EXPORTER} \
 -javaagent:./opentelemetry-javaagent.jar \
 -jar spring-petclinic.jar"]

以 helm create spring-petclinic 為 application 建立 helm chart,並修改相關的檔案

 

  • values.yaml

這裡的 otlp.ServiceName 是稍後在 tempo 裡可以用來查詢用的 service name;而 otlp.endpoint 則是要把 trace 傳送到什麼地方

因為這個服務要給來自 K8S cluster 以外的 request 可以連接,service 開成 node port

# Default values for spring-petclinic.# This is a YAML-formatted file.# Declare variables to be passed into your templates.

replicaCount: 1

otlp:
  ServiceName: petclinic-direct
  endpoint: http://tempo-distributor.tracing:4318

image:
  repository: petclinic
  pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
  tag: "0.1"

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
# Specifies whether a service account should be created
  create: true
# Automatically mount a ServiceAccount's API credentials?
  automount: true
# Annotations to add to the service account
  annotations: {}
# The name of the service account to use.# If not set and create is true, a name is generated using the fullname template
  name: ""

podAnnotations: {}
podLabels: {}

podSecurityContext: {}
# fsGroup: 2000

securityContext: {}
# capabilities:#   drop:#   - ALL# readOnlyRootFilesystem: true# runAsNonRoot: true# runAsUser: 1000

service:
  type: NodePort
  port: 8080

ingress:
  enabled: false
  className: ""
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
#  - secretName: chart-example-tls#    hosts:#      - chart-example.local

resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious# choice for the user. This also increases chances charts run on environments with little# resources, such as Minikube. If you do want to specify resources, uncomment the following# lines, adjust them as necessary, and remove the curly braces after 'resources:'.# limits:#   cpu: 100m#   memory: 128Mi# requests:#   cpu: 100m#   memory: 128Mi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 100
  targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80# Additional volumes on the output Deployment definition.
volumes: []
# - name: foo#   secret:#     secretName: mysecret#     optional: false# Additional volumeMounts on the output Deployment definition.
volumeMounts: []
# - name: foo#   mountPath: "/etc/foo"#   readOnly: true

nodeSelector: {}

tolerations: []

affinity: {}

 

  • templates/deployment.yaml

這裡有加了幾個環境變數進去,像是 OTEL_SERVICE_NAME、OTEL_EXPORTER_OTLP_ENDPOINT .... etc

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "spring-petclinic.fullname" . }}
  labels:
    {{- include "spring-petclinic.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "spring-petclinic.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      {{- with .Values.podAnnotations }}
      annotations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      labels:
        {{- include "spring-petclinic.labels" . | nindent 8 }}
	{{- with .Values.podLabels }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "spring-petclinic.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          env:
            - name: OTEL_SERVICE_NAME
              value: "{{ .Values.otlp.ServiceName }}"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "{{ .Values.otlp.endpoint }}"
            - name: OTEL_TRACES_EXPORTER
              value: "otlp"
            - name: OTEL_METRICS_EXPORTER
              value: "none"
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          {{- with .Values.volumeMounts }}
          volumeMounts:
            {{- toYaml . | nindent 12 }}
          {{- end }}
      {{- with .Values.volumes }}
      volumes:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}

 

  • templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "spring-petclinic.fullname" . }}
  labels:
    {{- include "spring-petclinic.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: 8080
      protocol: TCP
      name: http
  selector:
    {{- include "spring-petclinic.selectorLabels" . | nindent 4 }

這個都準備好了後,就可以下 helm install -n default petclinic . 來安裝這個 application

一樣,來看看安裝後的狀態。從下圖,可以看到 node port 是開在 port 30621

先連到首頁,能看到以下的網頁,即代表 application 有成功部署至 K8S cluster 上

4. 測試

application 部署成功後,就可以來進行測試了,這裡會進行以下幾個測試

4.1 application log -> loki

直接進到 grafana 頁面操作,explore -> loki,然後 label filter 設為 app = spring-petclinic

應該就要能看到如下的 log 了,可以看到 pod 啟動時的 log,也可以看到 trace 是直接接到 http://tempo-distributor.tracing:4318/

4.2 application trace -> 直接進 tempo

由 4.1 的 log,可以看得出來 trace 是直接由 application 接至 http://tempo-distributor.tracing:4318/

點一筆出來 trace 來看一下

就可以看得到類似像 Jaeger 查詢結果的畫面,很容易看得出來是哪一段最耗時

4.3 applcation trace -> 先到 alloy -> alloy 再轉至 tempo

在進行本次測試前,要先來修改一下 values.yaml 裡的值,把 otlp 那一個 block 的內容,改成如下

otlp:
  ServiceName: petclinic-alloy
  endpoint: http://alloy.alloy:4318

改好後,執行以下的指令讓設定生效

helm upgrade -n default petclinic .

等到 pod 起來後,先來查看 log,application 正常啟動,但其中 trace 的部份,則是先送至 alloy,再由 alloy 轉至 tempo

從 log 裡就可以看到剛剛在 values.yaml 裡修改過的設定已經生效了

然後,連到 application 頁面,隨意做一些操作後,再來查看 trace

會看到 Service 的名稱已經變成 petclinic-alloy,也就是剛剛在 values.yaml 裡的設定,這個部份是用來做為辨別使用

能看到 trace 即表示 alloy 可以正確地把 trace 傳送至 tempo 了

5. 結論

Alloy 是 Grafana 陣營用來做為傳送 log, metrics 以及 trace 三合一的 agent,alloy 讓日常維運工作相對簡化,從以前要維護三種不同的 agent,變成只要維護一種即可。未來如果有別種後端的 solution,也可以很輕易透過更改 alloy 的設定,就轉送到其他的地方,而不用修改程式端的設定

不過,alloy 本身學習曲線有點高,一開始在文件的研讀上,得花上較多的時間才能理解設定檔要如何撰寫。也希望未來 Grafana 不管是在網站或是 helm chart 上,相關文件可以變得更容易上手啊。

 

 

 

石偉琪 Vicky Shih