/
Telegraf Configuration

Telegraf Configuration

Configuration Official Guides

General Guide

Guide details by Plugins

[중요] 주의 사항

CloudHub에서 Addon으로 지원하는 Cloud Service의 Instance 내부에서 동작하는 telegraf 설정은 반드시 [global_tags]csp를 기입하여야, CloudHub UI의 Infrastructure > Host > AWS 등의 Addon관련 리스트에 정상 출력 됩니다.

미리 정의된 csp 코드는 다음과 같습니다.

Cloud Service Provider

Key

Cloud Service Provider

Key

AWS

[global_tags] csp = "aws"

GCP

[global_tags] csp = "gcp"

Config Templates

Common(공통)

Log Handling

  • 아래 내용 중 logtarget은 The Twelve-Factor App > Logs 의 패턴을 준수하기 위해 설정하지 않았습니다.
    해당 로그는 stderr로 출력 되며,

  • Linux의 경우, /var/log/messages에 통합 기록됩니다.

  • Windows의 경우 eventlog로 설정하며, Windows Event Viewer tool을 통해 열람 할 수 있습니다.

참고 logtarget: Log target controls the destination for logs and can be one of "file", "stderr" or, on Windows, "eventlog". When set to "file", the output file is determined by the "logfile" setting.

 

기본 설정 Template

아래는 CloudHub의 설치가 완료되면 기본적인 서버 모니터링을 위한 수집을 위한 기본 설정 템플릿 입니다.

Collector Sever

Salt Master + Salt Minion + Salt API Service 설치된 경우

# global_tags는 구분이 필요한 경우 기입 [global_tags] svc_type = "ch-mgmt" dc = "" rack = "" [agent] interval = "10s" round_interval = true metric_batch_size = 1_000 metric_buffer_limit = 10_000 collection_jitter = "0s" flush_interval = "10s" flush_jitter = "10s" precision = "" hostname = "" # if empty, this is assigned by OS hostname. omit_hostname = false [[outputs.influxdb]] urls = [ "http://{influxdb_url}:8086" ] database = "Default" retention_policy = "" write_consistency = "any" username = "" password = "" [[inputs.cpu]] percpu = false totalcpu = true collect_cpu_time = false report_active = false [[inputs.disk]] ignore_fs = [ "tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs" ] [[inputs.diskio]] [[inputs.kernel]] [[inputs.mem]] [[inputs.net]] [[inputs.processes]] [[inputs.procstat]] systemd_unit = "snet-salt-master" [[inputs.procstat]] systemd_unit = "snet-salt-api" [[inputs.procstat]] systemd_unit = "snet-salt-minion" [[inputs.procstat]] systemd_unit = "telegraf" [[inputs.swap]] [[inputs.system]] [[inputs.netstat]] [[inputs.influxdb]] urls = [ "http://{influxdb_url}:8086/debug/vars" ] [[inputs.kapacitor]] urls = [ "http://{kapacitor_url}:9094/kapacitor/v1/debug/vars" ] timeout = "5s" [[inputs.prometheus]] urls = [ "http://{etcd_url}:2379/metrics" ] metric_version = 2

Collector Agent

Salt Minion만 설치한 경우 혹은 collect agent(telegraf)만 설치된 경우.
즉, 모니터링 대상 호스트.

# global_tags는 구분이 필요한 경우 기입 [global_tags] svc_type = "ch-mgmt" dc = "" rack = "" [agent] interval = "10s" round_interval = true metric_batch_size = 1_000 metric_buffer_limit = 10_000 collection_jitter = "0s" flush_interval = "10s" flush_jitter = "10s" precision = "" hostname = "" # if empty, this is assigned by OS hostname. omit_hostname = false [[outputs.influxdb]] urls = [ "http://{influxdb_url}:8086" ] database = "Default" retention_policy = "" write_consistency = "any" username = "" password = "" [[inputs.cpu]] percpu = false totalcpu = true collect_cpu_time = false report_active = false [[inputs.disk]] ignore_fs = [ "tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs" ] [[inputs.diskio]] [[inputs.kernel]] [[inputs.mem]] [[inputs.net]] [[inputs.processes]] [[inputs.procstat]] systemd_unit = "snet-salt-minion" [[inputs.procstat]] systemd_unit = "telegraf" [[inputs.swap]] [[inputs.system]] [[inputs.netstat]]

Telegraf config advanced sample on Linux (CloudHub Server)

Full Attributes sample: https://github.com/snetsystems/telegraf/blob/release-1.19-snet/etc/telegraf.conf

[global_tags] dc = "five.sensory.lab" rack = "RACK-A-01" [agent] interval = "10s" round_interval = true metric_batch_size = 1_000 metric_buffer_limit = 10_000 collection_jitter = "0s" flush_interval = "10s" flush_jitter = "10s" precision = "" hostname = "" # if empty, this is assigned by OS hostname. omit_hostname = false [[outputs.influxdb]] urls = [ "http://{influxdb}:8086" ] database = "Default" retention_policy = "" write_consistency = "any" username = "" password = "" [[inputs.cpu]] percpu = false totalcpu = true collect_cpu_time = false report_active = false [[inputs.disk]] ignore_fs = [ "tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs" ] [[inputs.diskio]] [[inputs.kernel]] [[inputs.mem]] [[inputs.net]] [[inputs.processes]] [[inputs.procstat]] systemd_unit = "snet-salt-master" [[inputs.procstat]] systemd_unit = "snet-salt-api" [[inputs.procstat]] systemd_unit = "snet-salt-minion" [[inputs.procstat]] systemd_unit = "telegraf" [[inputs.swap]] [[inputs.system]] [[inputs.netstat]] [[inputs.influxdb]] urls = [ "http://k8s-master01:8086/debug/vars" ] [[inputs.kapacitor]] urls = [ "http://k8s-master01:9094/kapacitor/v1/debug/vars" ] timeout = "5s" [[inputs.prometheus]] urls = [ "http://k8s-master01:2382/metrics" ] metric_version = 2 [[inputs.docker]] endpoint = "unix:///var/run/docker.sock" container_names = [ ] timeout = "10s" perdevice = true total = false [[inputs.kube_inventory]] interval = "1m" url = "https://k8s-master01:6443" namespace = "" bearer_token_string = "" response_timeout = "5s" insecure_skip_verify = true [inputs.kube_inventory.tagdrop] pod_name = [ "cronjob-*" ] [[inputs.kubernetes]] url = "https://k8s-master01:10250" bearer_token_string = "" insecure_skip_verify = true [inputs.kubernetes.tagdrop] pod_name = [ "cronjob-*" ] [[inputs.cloudwatch]] region = "ap-northeast-2" access_key = "" secret_key = "" period = "5m" delay = "5m" interval = "5m" namespace = "AWS/ApplicationELB" statistic_include = [ "average" ] [[inputs.cloudwatch]] region = "ap-northeast-2" access_key = "" secret_key = "" period = "5m" delay = "5m" interval = "5m" namespace = "AWS/EC2" statistic_include = [ "average" ] [[inputs.cloudwatch]] region = "ap-northeast-2" access_key = "" secret_key = "" period = "5m" delay = "5m" interval = "5m" namespace = "AWS/Usage" statistic_include = [ "average" ] [[inputs.vsphere]] interval = "1m" vcenters = [ "https://{vcenter}/sdk" ] username = "" password = "" insecure_skip_verify = true force_discover_on_init = true datastore_metric_exclude = [ "*" ] cluster_metric_exclude = [ "*" ] datacenter_metric_exclude = [ "*" ] collect_concurrency = 5 discover_concurrency = 5 [[inputs.vsphere]] interval = "5m" vcenters = [ "https://{vcenter}/sdk" ] username = "" password = "" insecure_skip_verify = true force_discover_on_init = true host_metric_exclude = [ "*" ] vm_metric_exclude = [ "*" ] max_query_metrics = 256 collect_concurrency = 3 [[processors.rename]] namepass = [ "kubernetes_ingress" ] [[processors.rename.replace]] tag = "host" dest = "ingress_host""

Telegraf config advanced sample on Windows

Full Attributes Template: https://github.com/snetsystems/telegraf/blob/release-1.19-snet/etc/telegraf_windows.conf

# Global tags can be specified here in key="value" format. [global_tags] dc = "five.sensory.lab" rack = "RACK-A-01" # Configuration for telegraf agent [agent] ## Default data collection interval for all inputs interval = "10s" ## Rounds collection interval to 'interval' ## ie, if interval="10s" then always collect on :00, :10, :20, etc. round_interval = true ## Telegraf will send metrics to outputs in batches of at most ## metric_batch_size metrics. ## This controls the size of writes that Telegraf sends to output plugins. metric_batch_size = 1000 ## Maximum number of unwritten metrics per output. Increasing this value ## allows for longer periods of output downtime without dropping metrics at the ## cost of higher maximum memory usage. metric_buffer_limit = 10000 ## Collection jitter is used to jitter the collection by a random amount. ## Each plugin will sleep for a random time within jitter before collecting. ## This can be used to avoid many plugins querying things like sysfs at the ## same time, which can have a measurable effect on the system. collection_jitter = "0s" ## Default flushing interval for all outputs. Maximum flush_interval will be ## flush_interval + flush_jitter flush_interval = "10s" ## Jitter the flush interval by a random amount. This is primarily to avoid ## large write spikes for users running a large number of telegraf instances. ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s flush_jitter = "0s" ## By default or when set to "0s", precision will be set to the same ## timestamp order as the collection interval, with the maximum being 1s. ## ie, when interval = "10s", precision will be "1s" ## when interval = "250ms", precision will be "1ms" ## Precision will NOT be used for service inputs. It is up to each individual ## service input to set the timestamp at the appropriate precision. ## Valid time units are "ns", "us" (or "µs"), "ms", "s". precision = "" ## Log at debug level. # debug = false ## Log only error level messages. # quiet = false ## Log target controls the destination for logs and can be one of "file", ## "stderr" or, on Windows, "eventlog". When set to "file", the output file ## is determined by the "logfile" setting. logtarget = "eventlog" ## Name of the file to be logged to when using the "file" logtarget. If set to ## the empty string then logs are written to stderr. # logfile = "" ## The logfile will be rotated after the time interval specified. When set ## to 0 no time based rotation is performed. Logs are rotated only when ## written to, if there is no log activity rotation may be delayed. # logfile_rotation_interval = "0d" ## The logfile will be rotated when it becomes larger than the specified ## size. When set to 0 no size based rotation is performed. # logfile_rotation_max_size = "0MB" ## Maximum number of rotated archives to keep, any older logs are deleted. ## If set to -1, no archives are removed. # logfile_rotation_max_archives = 5 ## Pick a timezone to use when logging or type 'local' for local time. Example: 'America/Chicago'. ## See https://socketloop.com/tutorials/golang-display-list-of-timezones-with-gmt for timezone formatting options. # log_with_timezone = "" ## Override default hostname, if empty use os.Hostname() # hostname = "" ## If set to true, do no set the "host" tag in the telegraf agent. omit_hostname = false ############################################################################### # OUTPUT PLUGINS # ############################################################################### # Configuration for sending metrics to InfluxDB [[outputs.influxdb]] ## The full HTTP or UDP URL for your InfluxDB instance. ## ## Multiple URLs can be specified for a single cluster, only ONE of the ## urls will be written to each interval. urls = ["http://10.20.2.51:8086"] ## The target database for metrics; will be created as needed. ## For UDP url endpoint database needs to be configured on server side. database = "RnD" ## The value of this tag will be used to determine the database. If this ## tag is not set the 'database' option is used as the default. # database_tag = "" ## If true, the 'database_tag' will not be included in the written metric. # exclude_database_tag = false ## If true, no CREATE DATABASE queries will be sent. Set to true when using ## Telegraf with a user without permissions to create databases or when the ## database already exists. # skip_database_creation = false ## Name of existing retention policy to write to. Empty string writes to ## the default retention policy. Only takes effect when using HTTP. # retention_policy = "" ## The value of this tag will be used to determine the retention policy. If this ## tag is not set the 'retention_policy' option is used as the default. # retention_policy_tag = "" ## If true, the 'retention_policy_tag' will not be included in the written metric. # exclude_retention_policy_tag = false ## Write consistency (clusters only), can be: "any", "one", "quorum", "all". ## Only takes effect when using HTTP. # write_consistency = "any" ## Timeout for HTTP messages. timeout = "10s" ## HTTP Basic Auth # username = "telegraf" # password = "metricsmetricsmetricsmetrics" ## HTTP User-Agent # user_agent = "telegraf" ## UDP payload size is the maximum packet size to send. # udp_payload = "512B" ## Optional TLS Config for use on HTTP connections. # tls_ca = "/etc/telegraf/ca.pem" # tls_cert = "/etc/telegraf/cert.pem" # tls_key = "/etc/telegraf/key.pem" ## Use TLS but skip chain & host verification # insecure_skip_verify = false ## HTTP Proxy override, if unset values the standard proxy environment ## variables are consulted to determine which proxy, if any, should be used. # http_proxy = "http://corporate.proxy:3128" ## Additional HTTP headers # http_headers = {"X-Special-Header" = "Special-Value"} ## HTTP Content-Encoding for write request body, can be set to "gzip" to ## compress body or "identity" to apply no encoding. # content_encoding = "identity" ## When true, Telegraf will output unsigned integers as unsigned values, ## i.e.: "42u". You will need a version of InfluxDB supporting unsigned ## integer values. Enabling this option will result in field type errors if ## existing data has been written. # influx_uint_support = false ############################################################################### # INPUT PLUGINS # ############################################################################### # Windows Performance Counters plugin. # These are the recommended method of monitoring system metrics on windows, # as the regular system plugins (inputs.cpu, inputs.mem, etc.) rely on WMI, # which utilize more system resources. # # See more configuration examples at: # https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_perf_counters [[inputs.win_perf_counters]] [[inputs.win_perf_counters.object]] # Processor usage, alternative to native, reports on a per core. ObjectName = "Processor" Instances = ["*"] Counters = [ "% Idle Time", "% Interrupt Time", "% Privileged Time", "% User Time", "% Processor Time", "% DPC Time", ] Measurement = "win_cpu" # Set to true to include _Total instance when querying for all (*). IncludeTotal=true [[inputs.win_perf_counters.object]] # Disk times and queues ObjectName = "LogicalDisk" Instances = ["*"] Counters = [ "% Idle Time", "% Disk Time", "% Disk Read Time", "% Disk Write Time", "% Free Space", "Current Disk Queue Length", "Free Megabytes", ] Measurement = "win_disk" # Set to true to include _Total instance when querying for all (*). #IncludeTotal=false [[inputs.win_perf_counters.object]] ObjectName = "PhysicalDisk" Instances = ["*"] Counters = [ "Disk Read Bytes/sec", "Disk Write Bytes/sec", "Current Disk Queue Length", "Disk Reads/sec", "Disk Writes/sec", "% Disk Time", "% Disk Read Time", "% Disk Write Time", ] Measurement = "win_diskio" [[inputs.win_perf_counters.object]] ObjectName = "Network Interface" Instances = ["*"] Counters = [ "Bytes Received/sec", "Bytes Sent/sec", "Packets Received/sec", "Packets Sent/sec", "Packets Received Discarded", "Packets Outbound Discarded", "Packets Received Errors", "Packets Outbound Errors", ] Measurement = "win_net" [[inputs.win_perf_counters.object]] ObjectName = "System" Counters = [ "Context Switches/sec", "System Calls/sec", "Processor Queue Length", "System Up Time", ] Instances = ["------"] Measurement = "win_system" # Set to true to include _Total instance when querying for all (*). #IncludeTotal=false [[inputs.win_perf_counters.object]] # Example query where the Instance portion must be removed to get data back, # such as from the Memory object. ObjectName = "Memory" Counters = [ "Available Bytes", "Cache Faults/sec", "Demand Zero Faults/sec", "Page Faults/sec", "Pages/sec", "Transition Faults/sec", "Pool Nonpaged Bytes", "Pool Paged Bytes", "Standby Cache Reserve Bytes", "Standby Cache Normal Priority Bytes", "Standby Cache Core Bytes", ] # Use 6 x - to remove the Instance bit from the query. Instances = ["------"] Measurement = "win_mem" # Set to true to include _Total instance when querying for all (*). #IncludeTotal=false [[inputs.win_perf_counters.object]] # Example query where the Instance portion must be removed to get data back, # such as from the Paging File object. ObjectName = "Paging File" Counters = [ "% Usage", ] Instances = ["_Total"] Measurement = "win_swap" # Windows system plugins using WMI (disabled by default, using # win_perf_counters over WMI is recommended) # Read metrics about cpu usage [[inputs.cpu]] ## Whether to report per-cpu stats or not percpu = false ## Whether to report total system cpu stats or not totalcpu = true ## If true, collect raw CPU time metrics. collect_cpu_time = false ## If true, compute and report the sum of all non-idle CPU states. report_active = false # Read metrics about memory usage [[inputs.mem]] # Read metrics about swap memory usage [[inputs.swap]]

Related content