README

Welcome to the yb_stats book.

This is documentation of the yb_stats project.

yb_stats is a CLI tool to query, investigate and extract all facts from a YugabyteDB cluster, and store these in CSV files to persist the facts, and make them easy to transport.

The way yb_stats works is that it tries to obtain information from the YugabyteDB and node-exporter http endpoints using the least amount of dependencies, which means using only the yb_stats tool and the http endpoints.

The tools provides 5 general ways of usage:

Ad-hoc mode: request and store performance metrics and status data into memory, wait for enter to allow the actions to be investigated to be performed. Pressing enter will request and store the performance metrics and status data into memory again, for which the differences will be shown.
Snapshot mode: request and store all performance metrics and status data to CSV files and display the snapshot number.
Snapshot-diff mode: request the performance metrics and status data from the CSV files from snapshots of two different snapshots, and show the differences. This performs the same output as ad-hoc mode, but the metrics taken from snapshots instead of ad-hoc.
Print snapshot mode: use the print functions with a snapshot number for data to be displayed that is stored in a snapshot, such as version, vars, etc.
Print adhoc mode: use the print functions without a snapshot number for data to be displayed that is stored in a snapshot, such as version, vars, etc.
Raw mode: you can always look into the stored CSV files yourself. However, for two sources of data, there is no function in yb_stats to publish the results: http endpoint /pprof/growth output, and /memz output.

This is an overview of the data sources, the type of daemon where it comes from, the ports (if default), and the http endpoint it is using.

Datasource	Type	Default Port(s)	Endpoint
Metrics	tserver/master	7000,9000,12000,13000	/metrics
Statements	tserver	13000	/statements
Metrics	node-exporter	9300	/metrics
Gflags	tserver/master	7000,9000	/varz
Vars	tserver/master	7000,9000	/api/v1/varz
Threads	tserver/master	7000,9000	/threadz
Mem-trackers	tserver/master	7000,9000	/mem-trackers
Logging	tserver/master	7000,9000	/logs
Version	tserver/master	7000,9000	/api/v1/version
Entities	master	7000	/dump-entities
Masters	master	7000	/api/v1/masters
Tablet-servers	master	7000	/api/v1/tablet-servers
RPCs	tserver/master	7000,9000,12000,13000	/rpcz
Pprof	tserver/master	7000,9000	/pprof/growth
Memory breakdown	tserver/master	7000,9000	/memz

Based on the datasources, these are the options for the differences sources:

Datasource	Snapshot	Ad-hoc	Diff	Print snap	Print ad-hoc	Raw
Metrics	✅	✅	✅	❌	❌	❌
Statement	✅	✅	✅	❌	❌	❌
Metrics	✅	✅	✅	❌	❌	❌
Gflags	✅	❌	❌	✅	❌	❌
Vars	✅	✅	✅	✅	✅	❌
Thread	✅	❌	❌	✅	❌	❌
Mem-trackers	✅	❌	❌	✅	❌	❌
Logging	✅	❌	❌	✅	❌	❌
Version	✅	✅	✅	✅	✅	❌
Entities	✅	✅	✅	✅	✅	❌
Masters	✅	✅	✅	✅	✅	❌
Tablet-servers	✅	✅	✅	✅	✅	❌
RPCs	✅	❌	❌	✅	❌	❌
Pprof	✅	❌	❌	❌	❌	✅
Memory breakdown	✅	❌	❌	❌	❌	✅

Snapshot: datasources is captured with the snapshot command.
Ad-hoc: the datasources that are involved in an ad-hoc (non-snapshot) use.
Diff: the datasources that are involved in a snapshot-diff use. This is identical to ad-hoc, but the data taken from snapshots, instead of 'live'.
Print: the datasources that are queryable via a yb_stats print- command.
Raw: the datasources that are captured in the snapshot directory, but require reading the file directly, no print or diff command exists.

Install

Mac OSX via homebrew

Add the yb_stats "brew tap" (which is a github repository):

brew tap fritshoogland-yugabyte/yb_stats

Install yb_stats:

brew install yb_stats

yb_stats is available in /usr/local/bin, which should normally be in $PATH.

Uninstall yb_stasts via homebrew

Remove yb_stats:

brew uninstall yb_stats

Remove the yb_stats "brew tap":

brew untap fritshoogland-yugabyte/yb_stats

RPM based distributions

Install the provided yb_stats RPM via yum:

EL7:

sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.7.x86_64.rpm

EL8:

sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.8.x86_64.rpm

EL9:

sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.9.x86_64.rpm

After yum install, yb_stats is available in /usr/local/bin, which should normally be in $PATH.

These are current latest versions. Look yb_stats github repository releases page for newer versions.

Mac OSX compile from source

Install Rust via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone the yb_stats repository:

git clone https://github.com/fritshoogland-yugabyte/yb_stats.git

Build yb_stats:

cd yb_stats
cargo build --release

The yb_stats is available in ./target/release/ directory after successful compilation.

Linux compile from source

Install dependencies via yum:

sudo yum install -y git openssl-devel gcc

Install Rust via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.sh | sh

Clone the yb_stats repository:

git clone https://github.com/fritshoogland-yugabyte/yb_stats.git

Build yb_stats:

cd yb_stats
cargo build --release

The yb_stats is available in ./target/release/ directory after successful compilation.

Upgrade

Warning

Be aware that when upgrading from a version before 0.9 that the snapshot format has changed.

Before yb_stats version 0.9, the snapshot data is stored as CSV.
Starting from yb_stats version 0.9, the snapshot data is stored as JSON.

Mac OSX via homebrew

This will upgrade the current installed version to the latest available tap version.

brew upgrade yb_stats

RPM based distributions

EL7:

sudo yum upgrade https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.8.8/yb_stats-0.8.8-1.el7.x86_64.rpm

EL8:

sudo yum upgrade https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.8.8/yb_stats-0.8.8-1.el8.x86_64.rpm

These are the current latest versions. Look at yb_stats github repository releases page for current versions.

Running yb_stats

Once yb_stats is installed, it can be used to get metadata from a YugabyteDB cluster and to create snapshots.

Before yb_stats can be used to create snapshots or perform ad-hoc operations, in other words query data from a YugabyteDB cluster, the addresses of the endpoints must be specified using the --hosts switch, and optionally the --ports switch if ports have been changed (see: Specifying hosts and Specifying ports), or the .env file should exist in the current working directory.

If yb_stats is run for obtaining data, which means running in ad-hoc mode, ad-hoc print mode or snapshot mode, it must be able to access the YugabyteDB cluster nodes, and be allowed to access the different ports.

If yb_stats is run for investigating snapshots using print mode or snapshot-diff mode, there is no need to access the cluster ip addresses or ports, it will use the snapshot CSV data only.

For a more comprehensive view into the objects in the cluster, as well as the tablets, the --extra-data switch can be used. This switch will obtain detailed data about objects as well as tablets, at the cost of performing more work, and thus taking longer time. See the --extra-data switch.

Specifying hosts

When yb_stats is used to run in ad-hoc mode, ad-hoc print mode or in snapshot mode, the YugabyteDB cluster nodes must be reachable, and yb_stats must be configured to understand the list of nodes. This is done via the switch --hosts, and requires a comma separated list of ip addresses or hostnames.

The default hosts for yb_stats are 192.168.66.80,192.168.66.81,192.168.66.82. This is unlikely to be the same for your use.

Please mind there should no be spaces between the hostnames or ipaddresses and the comma's.

Example:

yb_stats --hosts 10.1.0.1,10.1.0.2,10.1.03

The set hosts (as well as ports and parallellism) is stored in the file .env in the current working directory. This means that if you have different YugabyteDB clusters you want to use, you either:

Need to specify the hosts every usage, which will still create a .env file, but is not used because of the specification.
Use a different directory for each cluster.

Using a different directory is highly recommended, especially because then the snapshots will be about the cluster specified only too.

Specifying ports

When yb_stats is used to run in ad-hoc mode, ad-hoc print mode or in snapshot mode, the YugabyteDB cluster nodes must be reachable, and yb_stats must be configured to understand the list of nodes. However, it must also connect to the correct port numbers.

In most cases, the port numbers are kept default. When the port numbers are kept default, there is no need to specify ports, and the default list will be okay. This means that unless one or more default ports have been changed, there is no need to specify ports.

Another reason to specify ports is when for example node-exporter is not installed: specifying the list excluding the node-exporter port YugabyteDB uses (9300) should be done in such a case, although not required.

Please mind the default node-exporter port number is 9000, but that port is taken by the tablet server http endpoint, which is why YugabyteDB uses a different port number for node-exporter.

Example:

yb_stats --ports 7000,9000,12000,13000

As an example, this list excludes the 9300 port for node-exporter.

The set ports (as well as hosts and parallellism) is stored in the file .env in the current working directory. This means that if you have different YugabyteDB clusters you want to use, you either:

Need to speicify the ports every usage, which will still create a .env file, but is not used because of the specification.
use a different directory for each cluster.

Using a different directory is highly recommended, especially because then the snapshots will be about the cluster specified only too.

Specifying parallel

The amount of parallelism can be set using the --parallel switch. By default, yb_stats uses a single thread for performing the work it is doing. For the work done in the code for each different type of datasource and for fetching data from the hosts fetching a specific type of data, it can perform work in parallel.

There is no parallelism for reading the data from the snapshots and presenting it.

Please mind the above description of the way parallelism works in yb_stats means that parallelism is performed at two different points. This means that specifying parallelism should be done with great care not to make the parallelism too high.

This is how that works; threads are used for:

Snapshot. When a snapshot is created, each different datasource is executed by an individual thread in parallel.
Requesting a hostname:port combination in a datasource thread. The thread for each datasource will scan each hostnamer:port combination using a thread.

Step 1 is always done in parallel, and limited by the number of concurrent active parallel threads the executable allows to create. This means that setting parallelism to will start 3 threads for each datasource, performing requests to 3 hostname:port combinations in each of the threads. The reason for this combination is that most of the time in requesting data is spent idle waiting for response.

On MacOS, setting parallel to a too high number can throw errors

Number of files/file descriptors (tcp connect error: Too many open files (os error 24))
Number of threads (Err value: ThreadPoolBuildError) In both cases this can be solved by lowering the value for parallel. For the Too many open files error, another solution can be to increase the OS/user limit for the number of open files.

MacOS: https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c

.env

Whenever the yb_stats executable is run with --hosts, --ports or --parallel specified, it will use the value of the flag(s) set, but it will also write the specification of hosts, ports or parallellism into a file called .env.

When yb_stats is executed with a .env file in the current working directory, it will set the environment variables listed in that file. If any of:

YBSTATS_HOSTS
YBSTATS_PORTS
YBSTATS_PARALLEL

Are set in the .env file, it will use the set value. This way, specifying either or all of the hosts list, ports list or parallellism only need to be done once, and will be set automatically for every next invocation executed from the same directory.

If a flag is specified with yb_stats that is also set in .env, the flag specified with yb_stats will be given preference, and the entry in .env will be overwritten with this new value.

WARNING

Please mind that if something outside of yb_stats is using .env for its own purposes and sets its own values in the file, running yb_stats with the current working directory holding a .env file used by a third party, will overwrite the file and only set the yb_stats values, removing any settings not used with yb_stats.

yb_stats snapshot results

Directory `yb_stats.snapshots`

When yb_stats is run with the --snapshot mode switch, it will try to find a directory in the current working directory by the name of yb_stats.snaphots. If it can't find the directory, it will try to create it. If opening the directory in the case of existence, or creating in the case of non-existence fails, execution will stop with an error indicating the issue.

The `yb_stats.snapshots/snapshot.index` file

Inside the yb_stats.snapshots directory the first snapshot will create a CSV file called snapshot.index. This file lists the snapshots taken, comma separated with the following fields:

snapshot number
snapshot timestamp
snapshot comments

Using this file, yb_stats can understand the snapshot numbers, and add one to the highest number for a new snapshot.

Another use of this file is to obtain the first snapshot timestamp for use with snapshot diffs: if a second snapshot inserts a metric that is not present in the first snapshot, it can safely assume the first snapshot value of that metric to be 0. However, it must make a guess about the approximate first snapshot time; which is where the timestamp is used.

A last use of this file is when yb_stats is run with the --snapshot-list switch: yb_stats will list this information, and quit.

Example

This is how a typical snapshot is performed:

% yb_stats --snapshot
snapshot number 0

Error

If the current working directory is not writable, it will provide the following error:

[2022-11-30T10:59:18Z ERROR yb_stats::snapshot] Fatal: error creating directory /Users/fritshoogland/Downloads/t/yb_stats.snapshots: Permission denied (os error 13)

yb_stats --snapshot --extra-data switch

When the --extra-data switch is added to the --snapshot switch, then yb_stats will get detailed data for each "user table" (actual table or materialized view), "index table" (index) and "system table" (PostgreSQL catalog table), as well as for every tablet.

This requires more work to be done by yb_stats, which is why this is separate switch. It will request more data from all masters and tablet servers based on the number of tables and the number of tablets.

Output modes

The output that yb_stat provides can roughly be divided into 3 categories:

Information that is obtained from stored snapshots, which is formatted for readability.
Information that is obtained from the live cluster, which is formatted for readability.
Raw output provided in a text file, which is obtained at snapshot time. This is not printed, the resulting file can be used.

Snapshot usage

The locally stored snapshots must be available in the following way:

The directory yb_stats.snapshots must be present and executable in the current working directory.
The directory yb_stats.snapshots must contain a file called snapshot.index.
The snapshot.index file must list all the available snapshots.
The snapshots listed in snapshot.index must be available as directory with the snapshot number as directory name inside the yb_stats.snapshots directory.
Inside the snapshot directory, the files containing the JSON data that make the snapshot must be present.

All the files in all the snapshots are UTF8 JSON files, and therefore can be easily transported. This means you don't have to use the same computer to view the snapshot output: if the yb_stats.snapshot directory and its contents are zipped, they can be copied and shared. This means you can unzip a snapshots file, and investigate it on your own computer, without the need for access to the cluster where the snapshots came from.

Security restrictions

Because the snapshots are UTF8 JSON files, the files can be inspected by security officers to inspect existence of security issues or secret data.

yb_stats does only store cluster metadata, no actual data. yb_stats does store (part of) the logfiles, so these potentially can report actual data.

filters

Filter can be used for showing data, and are never used for creating snapshots.

To be more concrete, filters can be used with the following usage:

No switches/ad-hoc mode.
Snapshot-diff mode.
Print modes, depending on the print topic.

Filters usage regular expressions for optimal flexibility.

--hostname-match

Most yb_stats query options allow to filter on hostname. This is done with the --hostname-match switch. The hostname match switch uses a regex to filter out entries using the hostname:port specification.

Very simple use of --hostname-match is to filter on the port number, to specify only the tablet servers:

yb_stats --hostname-match 9000

Or specify to use only the tablet server and master servers, thereby filtering the node-exporter, YEDIS, YCQL and YSQL output:

yb_stats --hostname-match '(7000|9000)'

Because the filter is based on regex pattern matching, it is also easy to specify a master (=7000) for a class C ip network with hostnumber 82 (class C means for example 192.168.66/24, so one octet remains, so '82'), which means you can filter in this way:

yb_stats --hostname-match '82:7000'

--stat-name-match

For the statistic names with the performance metrics for a memory or snapshot diff, as well as for the gflags name and mem-trackers id values, it is possible to filter. The filter name for that is --stat-name-match. This is a regex for filtering.

For example if you are only interested in master and tserver cpu statistics:

yb_stats --stat-name-match 'cpu_.time'
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    0.900 seconds
192.168.66.80:12000  server   cpu_stime                              15 ms              16.502 /s
192.168.66.80:7000   server   cpu_stime                               2 ms               2.200 /s
192.168.66.80:7000   server   cpu_utime                               8 ms               8.801 /s
192.168.66.80:9000   server   cpu_stime                              14 ms              15.402 /s
192.168.66.81:12000  server   cpu_stime                               6 ms               6.608 /s
192.168.66.81:12000  server   cpu_utime                               9 ms               9.912 /s
192.168.66.81:7000   server   cpu_stime                               8 ms               8.801 /s
192.168.66.81:9000   server   cpu_stime                               7 ms               7.709 /s
192.168.66.81:9000   server   cpu_utime                               9 ms               9.912 /s
192.168.66.82:12000  server   cpu_stime                               6 ms               6.645 /s
192.168.66.82:12000  server   cpu_utime                              10 ms              11.074 /s
192.168.66.82:7000   server   cpu_utime                               8 ms               8.840 /s
192.168.66.82:9000   server   cpu_stime                               6 ms               6.637 /s
192.168.66.82:9000   server   cpu_utime                              11 ms              12.168 /s

Or a more sophisticated filter to look at tserver and master cpu time, as well as voluntary and involuntary context switches:

yb_stats --stat-name-match '(cpu_.time|voluntary_context_switches)'
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    0.915 seconds
192.168.66.80:12000  server   cpu_utime                               12 ms              13.086 /s
192.168.66.80:12000  server   voluntary_context_switches             234 csws           255.180 /s
192.168.66.80:7000   server   cpu_stime                                2 ms               2.181 /s
192.168.66.80:7000   server   cpu_utime                                9 ms               9.815 /s
192.168.66.80:7000   server   voluntary_context_switches             105 csws           114.504 /s
192.168.66.80:9000   server   cpu_utime                                13 ms              14.177 /s
192.168.66.80:9000   server   voluntary_context_switches             235 csws           256.270 /s
192.168.66.81:12000  server   cpu_stime                                6 ms               6.550 /s
192.168.66.81:12000  server   cpu_utime                                8 ms               8.734 /s
192.168.66.81:12000  server   voluntary_context_switches             262 csws           286.026 /s
192.168.66.81:7000   server   cpu_utime                                7 ms               7.625 /s
192.168.66.81:7000   server   voluntary_context_switches              70 csws            76.253 /s
192.168.66.81:9000   server   cpu_stime                                6 ms               6.543 /s
192.168.66.81:9000   server   cpu_utime                                8 ms               8.724 /s
192.168.66.81:9000   server   voluntary_context_switches             263 csws           286.805 /s
192.168.66.82:12000  server   cpu_stime                                6 ms               6.565 /s
192.168.66.82:12000  server   cpu_utime                                8 ms               8.753 /s
192.168.66.82:12000  server   voluntary_context_switches             272 csws           297.593 /s
192.168.66.82:7000   server   cpu_utime                                8 ms               8.762 /s
192.168.66.82:7000   server   voluntary_context_switches              65 csws            71.194 /s
192.168.66.82:9000   server   cpu_stime                                6 ms               6.565 /s
192.168.66.82:9000   server   cpu_utime                                8 ms               8.753 /s
192.168.66.82:9000   server   voluntary_context_switches             269 csws           294.311 /s

--table-name-match

yb_stats by default sums up per table and tablet statistics per hostname-port number combination, to try to sensibly reduce the amount of output. However, yb_stats can be set to display per table and tablet statistics individually using the --details-enable switch.

If the --details-enable switch is set, the table name is stored with the statistics for both the table and tablet data. The --table-name-match switch allows you to filter on the table name, to have the ability to look at the statistics of only the tables of interest.

Additionally, the --table-name-match switch can also be used for the printing the details of the entities data, which also are tables.

Example: filter for the sys.catalog (postgres catalog) entries in the master. Please mind the --hostname-match option is also used, because otherwise the node-exporter output would still be shown, because that data is not filtered by --table-name-match

yb_stats --details-enable --table-name-match catalog --hostname-match 7000
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    0.679 seconds

Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    1.621 seconds
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_bytes_read                                                 4095007 bytes      2545063.393 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_data_hit                                                        97 blocks          60.286 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_hit                                                            172 blocks         106.899 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_index_hit                                                       75 blocks          46.613 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_single_touch_bytes_read                                    4095007 bytes      2545063.393 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_single_touch_hit                                               172 blocks         106.899 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_db_iter_bytes_read                                                     1353537 bytes       841228.713 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_no_table_cache_iterators                                                    48 iters           29.832 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_next                                                            2180 keys          1354.879 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_next_found                                                      2225 keys          1382.846 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_seek                                                             131 keys            81.417 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_seek_found                                                       122 keys            75.823 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_superversion_acquires                                                 4 nr               2.486 /s

Data enrichment

Despite yb_stats providing a lot of output, it still is trying to reduce the total amount of data. There are two options for increasing the amount of data: by adding gauge data, and by showing the original, non-summarized data for tables, tablets and cdc (change data capture) statistics.

--gauges-enable

The default output for metrics shows the values of metrics of the type counter. Counters are ever increasing values. In most cases, a counter value on itself is not that useful, but the difference between two points in time is. That is the reason metrics requires two snapshots: to have the ability to calculate the difference, and thus understand how much a counter has changed.

But, there are also metrics that show absolute values, for which the value is not ever increasing, but instead just shows the current situation. This means that the difference between two points in time does not provide the same meaning as for a counter. The difference between the two points in time for absolute values can still be important. Such as value is called a gauge.

By default, yb_stats does NOT show gauge values. To make yb_stats show gauge values additional to the counters, is using the --gauges-enable switch.

This is yb_stats in ad-hoc mode (not showing gauges):

yb_stats
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    1.019 seconds
192.168.66.80:12000  server   cpu_stime                                                                            2 ms               2.200 /s
192.168.66.80:12000  server   cpu_utime                                                                           13 ms              14.301 /s
192.168.66.80:12000  server   server_uptime_ms                                                                   910 ms            1001.100 /s
192.168.66.80:12000  server   voluntary_context_switches                                                         233 csws           256.326 /s

This is yb_stats in ad-hoc mode showing gauges:

yb_stats --gauges-enable
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

Time between snapshots:    1.053 seconds
192.168.66.80:12000  server   cpu_stime                                                                            2 ms               1.927 /s
192.168.66.80:12000  server   cpu_utime                                                                           13 ms              12.524 /s
192.168.66.80:12000  server   generic_current_allocated_bytes                                               26914872 bytes            +1120
192.168.66.80:12000  server   generic_heap_size                                                             43188224 bytes               +0
192.168.66.80:12000  server   hybrid_clock_error                                                              500000 us                  +0
192.168.66.80:12000  server   hybrid_clock_hybrid_time                                               6824678428388270080 us         +4253745152
192.168.66.80:12000  server   server_uptime_ms                                                                  1038 ms            1000.000 /s
192.168.66.80:12000  server   tcmalloc_current_total_thread_cache_bytes                                      2560384 bytes          +137976
192.168.66.80:12000  server   tcmalloc_max_total_thread_cache_bytes                                         33554432 bytes               +0
192.168.66.80:12000  server   tcmalloc_pageheap_free_bytes                                                   3579904 bytes           -90112
192.168.66.80:12000  server   tcmalloc_pageheap_unmapped_bytes                                               7847936 bytes            -8192
192.168.66.80:12000  server   threads_running                                                                     45 threads              +0
192.168.66.80:12000  server   threads_running_CQLServer_reactor                                                    1 threads              +0
192.168.66.80:12000  server   threads_running_acceptor                                                             1 threads              +0
192.168.66.80:12000  server   threads_running_iotp_CQLServer                                                       4 threads              +0
192.168.66.80:12000  server   threads_running_rpc_thread_pool                                                     15 threads              +0
192.168.66.80:12000  server   voluntary_context_switches                                                         294 csws           283.237 /s

The gauge values can be spotted because they do not end with '/s', because they do not show its value per second. Instead, the first value is the value of the END snapshot, and the second value is the difference with the FIRST snapshot.

--details-enable

By default, statistics of the metric_types table, tablet and cdc are summed by metric_type and statistic name. This is done to reduce the amount of output in an as sensible way as possible without harming facts and truth.

However, sometimes it is necessary to split up the statistics per metric_type to understand metrics about specific table, tablet or cdc metric_types statistics. This can be done using the --details-enable switch. This switch introduces a few more columns in the output to detail the table, tablet or cdc metric type statistics.

This is how the regular output looks like:

yb_stats --snapshot-diff -b 0 -e 1 --hostname-match 82:7000 --stat-name-match rocksdb
192.168.66.81:7000   tablet   rocksdb_block_cache_bytes_read                                                11760999 bytes       865351.998 /s
192.168.66.81:7000   tablet   rocksdb_block_cache_data_hit                                                       255 blocks          18.762 /s
192.168.66.81:7000   tablet   rocksdb_block_cache_hit                                                            548 blocks          40.321 /s
192.168.66.81:7000   tablet   rocksdb_block_cache_index_hit                                                      293 blocks          21.558 /s
192.168.66.81:7000   tablet   rocksdb_block_cache_single_touch_bytes_read                                   11760999 bytes       865351.998 /s
192.168.66.81:7000   tablet   rocksdb_block_cache_single_touch_hit                                               548 blocks          40.321 /s
192.168.66.81:7000   tablet   rocksdb_db_iter_bytes_read                                                     1695826 bytes       124775.660 /s
192.168.66.81:7000   tablet   rocksdb_no_table_cache_iterators                                                   193 iters           14.201 /s
192.168.66.81:7000   tablet   rocksdb_number_db_next                                                            2783 keys           204.768 /s
192.168.66.81:7000   tablet   rocksdb_number_db_next_found                                                      2783 keys           204.768 /s
192.168.66.81:7000   tablet   rocksdb_number_db_seek                                                             268 keys            19.719 /s
192.168.66.81:7000   tablet   rocksdb_number_db_seek_found                                                       257 keys            18.910 /s
192.168.66.81:7000   tablet   rocksdb_number_superversion_acquires                                                 3 nr               0.221 /s

This is filtered down to statistics that are happening on the tablet metric type.

This is how that output looks like when --details-enable is added:

yb_stats --snapshot-diff -b 0 -e 1 --hostname-match 82:7000 --stat-name-match rocksdb --details-enable
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_bytes_read                                                11760999 bytes       865351.998 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_data_hit                                                       255 blocks          18.762 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_hit                                                            548 blocks          40.321 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_index_hit                                                      293 blocks          21.558 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_single_touch_bytes_read                                   11760999 bytes       865351.998 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_block_cache_single_touch_hit                                               548 blocks          40.321 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_db_iter_bytes_read                                                     1695826 bytes       124775.660 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_no_table_cache_iterators                                                   193 iters           14.201 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_next                                                            2783 keys           204.768 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_next_found                                                      2783 keys           204.768 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_seek                                                             268 keys            19.719 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_db_seek_found                                                       257 keys            18.910 /s
192.168.66.81:7000   tablet   00000000000000000000000000000000 sys.catalog                    rocksdb_number_superversion_acquires                                                 3 nr               0.221 /s

In this case, the statistics were generated by a single tablet. However if multiple tablets were involved, the statistics for all of these would be shown. The --details-enable switch introduces a couple of extra columns:

A column for the UUID or number of the metric object
A column that shows the namespace and the object name.

yb_stats ad-hoc results

When yb_stats is run without any switch, or only the data or filter switches, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference. This is called 'ad-hoc mode'.

This will not only show the difference for performance based statistics (the metric counters and optionally gauges), but also any change in the cluster, such as:

The addition or removal of tablet servers or masters.
Restarts of tablet servers or masters.
The creation or removal of database objects (tables, indexes, materialized views, databases/keyspaces).
The change of any gflags of the tablet servers or masters.
Any change for a replica, notably the LEADER or FOLLOWER state.
Role changes for the masters.

The usage of either ad-hoc mode or snapshot mode should be carefully considered. ad-hoc alias in-memory snapshots does not write anything. In most cases performing snapshots persisting all the available information is the best way, so results can be reviewed later, and cannot get lost, because they are stored. However, if you are performing repeated tests where storing all snapshot information would simply be too much and would require you to remove all the snapshots after testing anyways AND you are shure what to look for, then ad-hoc mode might be used.

Example

This is how the first snapshot looks like in ad-hoc mode:

% yb_stats
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:


Time between snapshots:   70.166 seconds
192.168.66.80:12000  server   cpu_stime                                                                          874 ms              12.458 /s
192.168.66.80:12000  server   cpu_utime                                                                            7 ms               0.100 /s
...etcetera

snapshot-diff mode

The purpose of snapshot-diff mode is to read two snapshots which must be locally stored, and show a difference report.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match: filter by statistic name regular expression.
--table-name-match: filter by table name regular expression (requires --details-enable to split table and tablet statistics out).
--details-enable: split table and tablet statistics, instead of summarizing these per server.
--gauges-enable: add non-counter statistics to the output.
-b/--begin: set the begin snapshot number.
-e/--end: set the end snapshot number.

snapshot-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the snapshot-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke snapshot-diff mode is to use the --snapshot-diff switch.

If --snapshot-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the snapshot-diff is directly created and shown.

snapshot-diff without begin/end specification:

yb_stats --snapshot-diff
  0 2022-10-17 19:50:58.048195 +02:00
  1 2022-10-17 19:52:34.413494 +02:00 second snap
  2 2022-10-18 15:26:20.061213 +02:00
Enter begin snapshot: 0
Enter end snapshot: 1
192.168.66.80:12000  server   cpu_stime                                                                          654 ms               6.792 /s
192.168.66.80:12000  server   cpu_utime                                                                          311 ms               3.230 /s
192.168.66.80:12000  server   involuntary_context_switches                                                         1 csws             0.010 /s
192.168.66.80:12000  server   server_uptime_ms                                                                 96292 ms            1000.000 /s
192.168.66.80:12000  server   threads_started                                                                      4 threads           0.042 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                          4 threads           0.042 /s
192.168.66.80:12000  server   voluntary_context_switches                                                       21821 csws           226.613 /s

snapshot-diff with begin/end specification:

yb_stats --snapshot-diff -b 0 -e 1
192.168.66.80:12000  server   cpu_stime                                                                          654 ms               6.792 /s
192.168.66.80:12000  server   cpu_utime                                                                          311 ms               3.230 /s
192.168.66.80:12000  server   involuntary_context_switches                                                         1 csws             0.010 /s
192.168.66.80:12000  server   server_uptime_ms                                                                 96292 ms            1000.000 /s
192.168.66.80:12000  server   threads_started                                                                      4 threads           0.042 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                          4 threads           0.042 /s
192.168.66.80:12000  server   voluntary_context_switches                                                       21821 csws           226.613 /s
...etc...

The --snapshot-diff shows all different data points for showing differences:

Metrics
(YSQL) statements
Node-exporter
Versions (master and tablet server software versions)
Entities (YSQL and YCQL objects (tables, indexes and materialized views), databases/keyspaces, tablets and replicas)
Master status
Tablet server status
Vars (gflags)
Health check (from the master)

metrics-diff mode

The purpose of metrics-diff mode is to read two snapshots which must be locally stored, and show a difference report.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match: filter by statistic name regular expression.
--table-name-match: filter by table name regular expression (requires --details-enable to split table and tablet statistics out).
--details-enable: split table and tablet statistics, instead of summarizing these per server.
--gauges-enable: add non-counter statistics to the output.
-b/--begin: set the begin snapshot number.
-e/--end: set the end snapshot number.

metrics-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the metrics-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke versions-diff mode is to use the --metrics-diff switch.

If --metrics-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the metrics-diff is directly created and shown.

versions-diff without begin/end specification:

yb_stats --metrics-diff
  0 2023-03-18 14:13:01.407795 +01:00
  1 2023-03-18 14:13:53.959694 +01:00
  2 2023-03-18 14:14:05.162338 +01:00
  3 2023-03-19 14:20:50.977417 +01:00
  4 2023-03-19 14:21:14.418544 +01:00
  5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:12000  server   cpu_stime                                                                         2669 ms              14.531 /s
192.168.66.80:12000  server   cpu_utime                                                                           80 ms               0.436 /s
192.168.66.80:12000  server   involuntary_context_switches                                                        36 csws             0.196 /s
192.168.66.80:12000  server   server_uptime_ms                                                                183203 ms             997.441 /s
192.168.66.80:12000  server   spinlock_contention_time                                                       1018673 us            5546.123 /s
192.168.66.80:12000  server   threads_started                                                                     12 threads           0.065 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                         12 threads           0.065 /s
192.168.66.80:12000  server   voluntary_context_switches                                                       52096 csws           283.635 /s
...lots more output

versions-diff with begin/end specification:

yb_stats --snapshot-diff -b 4 -e 5
192.168.66.80:12000  server   cpu_stime                                                                         2669 ms              14.531 /s
192.168.66.80:12000  server   cpu_utime                                                                           80 ms               0.436 /s
192.168.66.80:12000  server   involuntary_context_switches                                                        36 csws             0.196 /s
192.168.66.80:12000  server   server_uptime_ms                                                                183203 ms             997.441 /s
192.168.66.80:12000  server   spinlock_contention_time                                                       1018673 us            5546.123 /s
192.168.66.80:12000  server   threads_started                                                                     12 threads           0.065 /s
192.168.66.80:12000  server   threads_started_thread_pool                                                         12 threads           0.065 /s
192.168.66.80:12000  server   voluntary_context_switches                                                       52096 csws           283.635 /s
...lots more output

Metrics

The metrics related output always uses two snapshots.

Ad-hoc mode performs the snapshots live and stores the results in memory, and doesn't write any file.
Snapshot-diff mode takes local available previously taken snapshots.

Value statistics

In the output generated by ad-hoc or snapshot-diff mode, the first group of statistics shown are value statistics. The captured statistics are essentially a statistic name, and a statistic value. The values to be displayed are ordered by hostname, metric_type and statistic name.

By default, counters are shown, for which the value is the difference between the end and begin values.

If a counter is zero during both the begin and end snapshot, the statistic is skipped.
If a counter is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
If a counter is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
If a counter is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
If a counter is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.

counters

This is how value statistic output looks like:

192.168.66.80:12000  server   cpu_stime                                                                            5 ms               6.188 /s
192.168.66.80:12000  server   cpu_utime                                                                            9 ms              11.139 /s
192.168.66.80:12000  server   server_uptime_ms                                                                   807 ms             998.762 /s
192.168.66.80:12000  server   voluntary_context_switches                                                         217 csws           268.564 /s

Explanation:

hostname:port	metric_type	statistic name	value	unit	value / snapshot time (s)
192.168.66.80:12000	server	cpu_stime	5	ms	6.188 /s
192.168.66.80:12000	server	cpu_utime	9	ms	11.139 /s
192.168.66.80:12000	server	server_uptime_ms	807	ms	998.762 /s
192.168.66.80:12000	server	voluntary_context_switches	217	csws	268.564 /s

gauges

If the --gauges-enable switch is used, gauge values are shown alongside counter values. A gauge value is a value that can get higher and lower during its runtime. Therefore, we show the end value of such a value, and show the difference with the begin snapshot value with plus and minus.

If a gauge is zero during both the begin and end snapshot, the statistic is skipped.
If a gauge is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
If a gauge is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
if a gauge is non-zero in the begin and end snapshots, and subtracting leads to zero, the value is printed(!).

This is how that looks like:

192.168.66.80:12000  server   cpu_stime                                                                           10 ms              10.893 /s
192.168.66.80:12000  server   cpu_utime                                                                            5 ms               5.447 /s
192.168.66.80:12000  server   generic_current_allocated_bytes                                               26908008 bytes           +25472
192.168.66.80:12000  server   generic_heap_size                                                             43188224 bytes               +0
192.168.66.80:12000  server   hybrid_clock_error                                                              500000 us                  +0
192.168.66.80:12000  server   hybrid_clock_hybrid_time                                               6824687165143556096 us         +3762429952
192.168.66.80:12000  server   server_uptime_ms                                                                   918 ms            1000.000 /s
192.168.66.80:12000  server   tcmalloc_current_total_thread_cache_bytes                                      2675304 bytes          +124184
192.168.66.80:12000  server   tcmalloc_max_total_thread_cache_bytes                                         33554432 bytes               +0
192.168.66.80:12000  server   tcmalloc_pageheap_free_bytes                                                   1228800 bytes           -90112
192.168.66.80:12000  server   tcmalloc_pageheap_unmapped_bytes                                               9977856 bytes               +0
192.168.66.80:12000  server   threads_running                                                                     47 threads              +0
192.168.66.80:12000  server   threads_running_CQLServer_reactor                                                    1 threads              +0
192.168.66.80:12000  server   threads_running_acceptor                                                             1 threads              +0
192.168.66.80:12000  server   threads_running_iotp_CQLServer                                                       4 threads              +0
192.168.66.80:12000  server   threads_running_rpc_thread_pool                                                     15 threads              +0
192.168.66.80:12000  server   voluntary_context_switches                                                         262 csws           285.403 /s

These are a gauge values:

192.168.66.80:12000  server   generic_current_allocated_bytes                                               26908008 bytes           +25472
192.168.66.80:12000  server   generic_heap_size                                                             43188224 bytes               +0
192.168.66.80:12000  server   hybrid_clock_error                                                              500000 us                  +0

Explanation:

hostname:port	metric_type	statistic name	value	unit	end value - begin value
192.168.66.80:12000	server	generic_current_allocated_bytes	26908008	bytes	+25472
192.168.66.80:12000	server	generic_heap_size	43188224	bytes	+0
192.168.66.80:12000	server	hybrid_clock_error	500000	us	+0

details

For the metric_types of table, tablet and cdc, the statistics are kept per table, tablet or cdc object. To reduce the amount of data shown, these by default are summed together per server. If the --details-enable switch is used, the output changes to include metric_id, namespace and object_name. This allows seeing the statistics per individual object. This is how that looks like:

192.168.66.80:9000   server   -               -               -                              tcp_bytes_received                                                               75765 bytes         4293.608 /s
192.168.66.80:9000   server   -               -               -                              tcp_bytes_sent                                                                   80901 bytes         4584.665 /s
192.168.66.80:9000   server   -               -               -                              threads_started                                                                      5 threads           0.283 /s
192.168.66.80:9000   server   -               -               -                              transaction_pool_cache_queries                                                       1 qry              0.057 /s
192.168.66.80:9000   server   -               -               -                              voluntary_context_switches                                                        6232 csws           353.168 /s
192.168.66.80:9000   tablet   d3265ac130b2b1f yugabyte        t                              log_bytes_logged                                                                   403 bytes           22.838 /s
192.168.66.80:9000   tablet   d3265ac130b2b1f yugabyte        t                              rocksdb_bytes_written                                                               12 bytes            0.677 /s
192.168.66.80:9000   tablet   d3265ac130b2b1f yugabyte        t                              rocksdb_sequence_number                                                              2 rows             0.113 /s
192.168.66.80:9000   tablet   d3265ac130b2b1f yugabyte        t                              rocksdb_write_self                                                                   1 writes           0.056 /s
192.168.66.80:9000   tablet   d3265ac130b2b1f yugabyte        t                              rows_inserted                                                                        2 rows             0.113 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         log_bytes_logged                                                                   389 bytes           22.045 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rocksdb_bytes_written                                                               12 bytes            0.677 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rocksdb_number_db_seek                                                               1 keys             0.057 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rocksdb_number_superversion_acquires                                                 1 nr               0.057 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rocksdb_sequence_number                                                              1 rows             0.057 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rocksdb_write_self                                                                   1 writes           0.056 /s
192.168.66.80:9000   tablet   97122de784c10a3 yugabyte        i_t_f1                         rows_inserted                                                                        1 rows             0.056 /s
192.168.66.80:9000   tablet   654e97ca348833d system          transactions                   log_bytes_logged                                                                  1199 bytes           67.947 /s
192.168.66.80:9000   tablet   04d3aadfcc0c75e system          transactions                   log_bytes_logged                                                                   395 bytes           22.385 /s

Explanation:

hostname:port	metric_type	object_id	namespace	object name	statistic name	value	unit	value / snapshot time (s)
192.168.66.80:9000	server	-	-	-	tcp_bytes_received	75765	bytes	4293.608 /s
192.168.66.80:9000	tablet	d3265ac130b2b1f	yugabyte	t	log_bytes_logged	403	bytes	22.838 /s
192.168.66.80:9000	tablet	654e97ca348833d	system	transactions	log_bytes_logged	1199	bytes	67.947 /s

The columns added are the third, fourth and fifth columns.

The third column shows the metric_id, which for the tablet is the tablet UUID, for the a table is table_id, and for cdc is ?. The snapshot stores the full metric_id, the length shown is 15 characters.
The sixth column shows the object_name (it says 'table_name' in the attributes in the metric page, but an object can be an index or materialized view too).
A 'server' metric_type does not carry meaningful a meaningful value in 'metric_id', and the namespace_name and object name is not present. Therefore, for server a '-' is shown.

details and gauges

The switches --details-enable and --gauges-enable work individually, but do influence each other. This means that when --gauges-enable is set, --details-enable will also show gauge data per table, tablet or cdc object:

192.168.66.80:9000   server   -               -               -                              tcp_bytes_received                                                                4694 bytes         3684.458 /s
192.168.66.80:9000   server   -               -               -                              tcp_bytes_sent                                                                    4101 bytes         3218.995 /s
192.168.66.80:9000   server   -               -               -                              threads_running                                                                     46 threads              +0
192.168.66.80:9000   server   -               -               -                              ts_split_compaction_added                                                           15 reqs                +0
192.168.66.80:9000   server   -               -               -                              voluntary_context_switches                                                         413 csws           324.176 /s
192.168.66.80:9000   tablet   a06ff106f2b846d system          transactions                   follower_lag_ms                                                                     97 ms                -722
192.168.66.80:9000   tablet   a06ff106f2b846d system          transactions                   in_progress_ops                                                                      1 ops                 +0
192.168.66.80:9000   tablet   a06ff106f2b846d system          transactions                   log_wal_size                                                                   1048576 bytes               +0
192.168.66.80:9000   tablet   a06ff106f2b846d system          transactions                   raft_term                                                                            9 terms               +0
192.168.66.80:9000   tablet   cf45509727f9601 system          transactions                   follower_lag_ms                                                                    281 ms                -339

hostname:port	metric_type	object_id	namespace	object name	statistic name	value	unit	end value - begin value
192.168.66.80:9000	server	-	-	-	threads_running	46	threads	+0
192.168.66.80:9000	tablet	a06ff106f2b846d	system	transactions	log_wal_size	1048576	bytes	+0

CountSum statistics

In the output genered by ad-hoc or snapshot-diff mode, the second group of statistics shown are 'countsum' statistics. These statistics are named in this way, because for the use of yb_stats, the count (total_count) and sum (total_sum) fields are the only usable statistical values. The way 'countsum' statistics work is that an event in the code that is tracked by 'countsum' statistics keeps a count for the number of times the event was triggered, and a sum for what it measures. In a lot of cases the unit the sum is taking is time (to capture the latency of the event), but can also be bytes (to capture the size of for example an IO), or something else.

The count and sum statistics are counters, for which the value that is used is the difference between the end and begin values. For the count value difference:

If the value is zero during both the begin and end snapshot, the statistic is skipped.
If the value is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
If the value is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
If the value is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
If the value is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.

This is how countsum statistic output looks like:

192.168.66.80:7000   server   handler_latency_outbound_call_queue_time                                             3                  2.899 /s avg:         0 tot:               0 us
192.168.66.80:7000   server   handler_latency_outbound_call_send_time                                              3                  2.899 /s avg:         0 tot:               0 us
192.168.66.80:7000   server   handler_latency_outbound_call_time_to_response                                       3                  2.899 /s avg:      2666 tot:            8000 us
192.168.66.80:7000   server   handler_latency_yb_master_MasterHeartbeat_TSHeartbeat                                3                  2.899 /s avg:       128 tot:             386 us
192.168.66.80:7000   server   rpc_incoming_queue_time                                                              3                  2.899 /s avg:       146 tot:             439 us

Explanation:

hostname:port	metric_type	statistic name	count	count / snapshot time (s)	sum / count	sum total	sum unit
192.168.66.80:7000	server	handler_latency_outbound_call_queue_time	3	2.899 /s	avg: 0	tot: 0	us
192.168.66.80:7000	server	handler_latency_outbound_call_send_time	3	2.899 /s	avg: 0	tot: 0	us
192.168.66.80:7000	server	handler_latency_outbound_call_time_to_response	3	2.899 /s	avg: 2666	tot: 8000	us
192.168.66.80:7000	server	handler_latency_yb_master_MasterHeartbeat_TSHeartbeat	3	2.899 /s	avg: 128	tot: 386	us
192.168.66.80:7000	server	rpc_incoming_queue_time	3	2.899 /s	avg: 146	tot: 439	us

gauges

There is no gauges-like statistic type in 'countsum' statistics.

details enable

192.168.66.80:9000   server   -               -               -                              rpc_incoming_queue_time                                                            143                 13.877 /s avg:       103 tot:           14807 us
192.168.66.80:9000   server   -               -               -                              transaction_pool_cache                                                               1                  0.097 /s avg:         0 tot:               0 us
192.168.66.80:9000   table    000000000004000 yugabyte        t                              log_append_latency                                                                   4                  0.388 /s avg:        45 tot:             182 us
192.168.66.80:9000   table    000000000004000 yugabyte        t                              log_entry_batches_per_group                                                          3                  0.291 /s avg:         1 tot:               4 requests
192.168.66.80:9000   table    000000000004000 yugabyte        t                              log_group_commit_latency                                                             3                  0.291 /s avg:      2319 tot:            6958 us
192.168.66.80:9000   table    000000000004000 yugabyte        t                              log_sync_latency                                                                     1                  0.097 /s avg:      6706 tot:            6706 us
192.168.66.80:9000   table    000000000004000 yugabyte        t                              rocksdb_bytes_per_write                                                              3                  0.291 /s avg:        12 tot:              36 bytes
192.168.66.80:9000   table    000000000004000 yugabyte        t                              rocksdb_db_write_micros                                                              3                  0.291 /s avg:        11 tot:              34 us

Explanation:

hostname:port	metric_type	object_id	namespace	object name	statistic name	count	count snapshot time (s)	sum / count	sum total	sum unit
192.168.66.80:9000	server	-	-	-	rpc_incoming_queue_time	143	13.877 /s	avg: 103	tot: 14807	us
192.168.66.80:9000	server	-	-	-	transaction_pool_cache	1	0.097 /s	avg: 0	tot: 0	us
192.168.66.80:9000	table	000000000004000	yugabyte	t	log_append_latency	4	0.388 /s	avg: 45	tot: 182	us
192.168.66.80:9000	table	000000000004000	yugabyte	t	log_entry_batches_per_group	3	0.291 /s	avg: 1	tot: 4	requests
192.168.66.80:9000	table	000000000004000	yugabyte	t	log_group_commit_latency	3	0.291 /s	avg: 2319	tot: 6958	us
192.168.66.80:9000	table	000000000004000	yugabyte	t	log_sync_latency	1	0.097 /s	avg: 6706	tot: 6706	us
192.168.66.80:9000	table	000000000004000	yugabyte	t	rocksdb_bytes_per_write	3	0.291 /s	avg: 12	tot: 36	bytes
192.168.66.80:9000	table	000000000004000	yugabyte	t	rocksdb_db_write_micros	3	0.291 /s	avg: 11	tot: 34	us

Countsum statistics are called 'course_histograms' in the YugabyteDB sourcecode, and have the fields count and sum in common with 'summaries' in prometheus, however quantile items are not available. YugabyteDB adds the fields min, mean, max, percentile_75, percentile_95, percentile_99, percentile_99_9, and percentile_99_99 to its metrics. These fields are reset when the metrics are read.

CountSumRows statistics

In the output generated by ad-hoc or snapshot-diff mode, a third group that optionally can be shown are the 'countsumrows' statistics. These statistics are taken from the YSQL (normally port 13000) http endpoint. If no SQL interaction did happen between YSQL/postgres and DocDB, there will be no statistics shown.

If a statistic has a zero count in both the begin and end snapshot, it will be skipped.
If a statistic has a non-zero count in both the begin and end snapshot, and subtracting the values leads to zero, it will be skipped.
if a statistic has a lower value in the end snapshot than in the begin snapshot, currently the statistics will be shown, and might get negative.

This is how countsumrows statistic output looks like:

192.168.66.80:13000  handler_latency_yb_ysqlserver_SQLProcessor_InsertStmt                                1 avg:          18.552 tot:          18.552 ms, avg:               1 tot:               1 rows
192.168.66.80:13000  handler_latency_yb_ysqlserver_SQLProcessor_SingleShardTransactions                   1 avg:          18.552 tot:          18.552 ms, avg:               1 tot:               1 rows
192.168.66.80:13000  handler_latency_yb_ysqlserver_SQLProcessor_Single_Shard_Transactions                 1 avg:          18.552 tot:          18.552 ms, avg:               1 tot:               1 rows
192.168.66.80:13000  handler_latency_yb_ysqlserver_SQLProcessor_Transactions                              1 avg:          18.552 tot:          18.552 ms, avg:               1 tot:               1 rows

Explanation:

hostname:port	statistic name	count	sum / count	sum total	sum unit	rows / count	rows total
192.168.66.80:13000	handler_latency_yb_ysqlserver_SQLProcessor_InsertStmt	1	avg: 18.552	tot: 18.552	ms	avg: 1	tot: 1 rows
192.168.66.80:13000	handler_latency_yb_ysqlserver_SQLProcessor_SingleShardTransactions	1	avg: 18.552	tot: 18.552	ms	avg: 1	tot: 1 rows
192.168.66.80:13000	handler_latency_yb_ysqlserver_SQLProcessor_Single_Shard_Transactions	1	avg: 18.552	tot: 18.552	ms	avg: 1	tot: 1 rows
192.168.66.80:13000	handler_latency_yb_ysqlserver_SQLProcessor_Transactions	1	avg: 18.552	tot: 18.552	ms	avg: 1	tot: 1 rows

node-exporter-diff mode

The purpose of node-exporter-diff mode is to read two snapshots which must be locally stored, and show a difference report.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match: filter by statistic name regular expression.
--details-enable: add the source of summarized counters, and some filtered out counters.
--gauges-enable: add non-counter statistics to the output.
-b/--begin: set the begin snapshot number.
-e/--end: set the end snapshot number.

node-exporter-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the node-exporter-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke versions-diff mode is to use the --node-exporter-diff switch.

If --node-exporter-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the node-exporter-diff is directly created and shown.

versions-diff without begin/end specification:

yb_stats --node-exporter-diff
  0 2023-03-18 14:13:01.407795 +01:00
  1 2023-03-18 14:13:53.959694 +01:00
  2 2023-03-18 14:14:05.162338 +01:00
  3 2023-03-19 14:20:50.977417 +01:00
  4 2023-03-19 14:21:14.418544 +01:00
  5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:9300   counter  node_context_switches_total                                                     169483.000000         926.137 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_idle                                                        169.720000           0.927 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_iowait                                                        0.010000           0.000 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_irq                                                           5.350000           0.029 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_softirq                                                       1.300000           0.007 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_system                                                        2.110000           0.012 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_user                                                          0.090000           0.000 /s
192.168.66.80:9300   counter  node_disk_io_time_seconds_total_sda                                                  0.076000           0.000 /s
...lots more output

versions-diff with begin/end specification:

yb_stats -node-exporter-diff -b 4 -e 5
192.168.66.80:9300   counter  node_context_switches_total                                                     169483.000000         926.137 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_idle                                                        169.720000           0.927 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_iowait                                                        0.010000           0.000 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_irq                                                           5.350000           0.029 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_softirq                                                       1.300000           0.007 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_system                                                        2.110000           0.012 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_user                                                          0.090000           0.000 /s
192.168.66.80:9300   counter  node_disk_io_time_seconds_total_sda                                                  0.076000           0.000 /s
...lots more output

Node-exporter statistics

In the output of ad-hoc or snapshot-diff mode, if node-exporter is installed on the YugabyteDB cluster, the last group of statistics that are shown are the node-exporter statistics. The captured statistics are essentially a statistic name and a statistic value. The values to be displayed are ordered by hostname, metric_type and metric_name.

By default, counter values are shown, for which the value is the difference between the end and begin values.

If a counter is zero during both the begin and end snapshot, the statistic is skipped.
If a counter is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
If a counter is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
If a counter is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
If a counter is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.

This is how node-exporter statistics output looks like:

192.168.66.80:9300   counter  node_context_switches_total                                                       7759.000000         862.111 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_idle                                                          8.150000           0.906 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_irq                                                           0.310000           0.034 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_softirq                                                       0.120000           0.013 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_system                                                        0.170000           0.019 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_user                                                          0.010000           0.001 /s

Explanation:

hostname:port	metric_type	statistic_name	value	value / snapshot time (s)
192.168.66.80:9300	counter	node_context_switches_total	7759.000000	862.111 /s
192.168.66.80:9300	counter	node_cpu_seconds_total_idle	8.150000	0.906 /s
192.168.66.80:9300	counter	node_cpu_seconds_total_irq	0.310000	0.034 /s
192.168.66.80:9300	counter	node_cpu_seconds_total_softirq	0.120000	0.013 /s
192.168.66.80:9300	counter	node_cpu_seconds_total_system	0.170000	0.019 /s
192.168.66.80:9300	counter	node_cpu_seconds_total_user	0.010000	0.001 /s

gauges

If the --gauges enable switch is used, gauge type values are shown alongside counter values.

If a gauge is zero during both the begin and end snapshot, the statistic is skipped.
If a gauge is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
If a gauge is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
if a gauge is non-zero in the begin and end snapshots, and subtracting leads to zero, the value is printed(!).

This is what this looks like:

192.168.66.80:9300   gauge    node_arp_entries_eth0                                                                2.000000              +0
192.168.66.80:9300   gauge    node_arp_entries_eth1                                                                3.000000              +0
192.168.66.80:9300   gauge    node_boot_time_seconds                                                      1666174770.000000              +0
192.168.66.80:9300   counter  node_context_switches_total                                                        994.000000         994.000 /s
192.168.66.80:9300   gauge    node_cooling_device_max_state_1_intel_powerclamp                                    50.000000              +0

Explanation:

hostname:port	metric_type	statistic name	end value	end value - begin value
192.168.66.80:9300	gauge	node_arp_entries_eth0	2.000000	+0
192.168.66.80:9300	gauge	node_arp_entries_eth1	3.000000	+0
192.168.66.80:9300	gauge	node_boot_time_seconds	1666174770.000000	+0

statements-diff mode

The purpose of statements-diff mode is to read two snapshots which must be locally stored, and show a difference report.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--sql-length: the maximum length of a SQL statements, for readability. Default: 80.
-b/--begin: set the begin snapshot number.
-e/--end: set the end snapshot number.

statements-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the statements-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke versions-diff mode is to use the --statements-diff switch.

If --statements-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the statements-diff is directly created and shown.

versions-diff without begin/end specification:

yb_stats --statements-diff
  0 2023-03-18 14:13:01.407795 +01:00
  1 2023-03-18 14:13:53.959694 +01:00
  2 2023-03-18 14:14:05.162338 +01:00
  3 2023-03-19 14:20:50.977417 +01:00
  4 2023-03-19 14:21:14.418544 +01:00
  5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:13000           1 avg:           0.459 tot:           0.459 ms avg:          1 tot:          1 rows: select now()

statements-diff with begin/end specification:

yb_stats --statements-diff -b 4 -e 5
192.168.66.80:13000           1 avg:           0.459 tot:           0.459 ms avg:          1 tot:          1 rows: select now()

Statement statistics

In the output generated by ad-hoc or snapshot-diff mode, a fourth group that optionally can be shown are statement statistics. These statistics are taken from the YSQL (normally port 13000) http endpoint. If no SQL interaction did happen between YSQL/postgres and DocDB, then there will be no statements shown.

This is how the statement output looks like:

192.168.66.80:13000          98 avg:           0.002 tot:           0.156 ms avg:          0 tot:          0 rows: begin
192.168.66.80:13000          98 avg:           0.003 tot:           0.305 ms avg:          0 tot:          0 rows: commit
192.168.66.80:13000          98 avg:          78.722 tot:        7714.786 ms avg:       1020 tot:     100000 rows: copy ysql_bench_accounts from stdin
192.168.66.80:13000           1 avg:         456.096 tot:         456.096 ms avg:          0 tot:          0 rows: create table ysql_bench_accounts(aid    int not null,bid int,abalance int,filler
192.168.66.80:13000           1 avg:         451.798 tot:         451.798 ms avg:          0 tot:          0 rows: create table ysql_bench_branches(bid int not null,bbalance int,filler char(88),P
192.168.66.80:13000           1 avg:         483.839 tot:         483.839 ms avg:          0 tot:          0 rows: create table ysql_bench_history(tid int,bid int,aid    int,delta int,mtime times
192.168.66.80:13000           1 avg:         394.892 tot:         394.892 ms avg:          0 tot:          0 rows: create table ysql_bench_tellers(tid int not null,bid int,tbalance int,filler cha
192.168.66.80:13000           1 avg:         569.624 tot:         569.624 ms avg:          0 tot:          0 rows: drop table if exists ysql_bench_accounts, ysql_bench_branches, ysql_bench_histor
192.168.66.80:13000           1 avg:          11.560 tot:          11.560 ms avg:          1 tot:          1 rows: insert into ysql_bench_branches(bid,bbalance) values($1,$2)
192.168.66.80:13000          10 avg:           8.218 tot:          82.179 ms avg:          1 tot:         10 rows: insert into ysql_bench_tellers(tid,bid,tbalance) values ($1,$2,$3)
192.168.66.80:13000           1 avg:        6641.962 tot:        6641.962 ms avg:          0 tot:          0 rows: truncate table ysql_bench_accounts, ysql_bench_branches, ysql_bench_history, ysq

Explanation:

hostname:port	calls	total_time / calls	total_time	unit total_time	rows / calls	total rows	query
192.168.66.80:13000	98	avg: 0.002	tot: 0.156	ms	avg: 0	tot: 0 rows	begin
192.168.66.80:13000	98	avg: 0.003	tot: 0.305	ms	avg: 0	tot: 0 rows	commit
192.168.66.80:13000	98	avg: 78.722	tot: 7714.786	ms	avg: 1020	tot: 100000 rows	copy ysql_bench_accounts from stdin
192.168.66.80:13000	1	avg: 456.096	tot: 456.096	ms	avg: 0	tot: 0 rows	create table ysql_bench_accounts(aid int not null,bid int,abalance int,filler
192.168.66.80:13000	1	avg: 451.798	tot: 451.798	ms	avg: 0	tot: 0 rows	create table ysql_bench_branches(bid int not null,bbalance int,filler char(88),P
192.168.66.80:13000	1	avg: 483.839	tot: 483.839	ms	avg: 0	tot: 0 rows	create table ysql_bench_history(tid int,bid int,aid int,delta int,mtime times
192.168.66.80:13000	1	avg: 394.892	tot: 394.892	ms	avg: 0	tot: 0 rows	create table ysql_bench_tellers(tid int not null,bid int,tbalance int,filler cha
192.168.66.80:13000	1	avg: 569.624	tot: 569.624	ms	avg: 0	tot: 0 rows	drop table if exists ysql_bench_accounts, ysql_bench_branches, ysql_bench_histor
192.168.66.80:13000	1	avg: 11.560	tot: 11.560	ms	avg: 1	tot: 1 rows	insert into ysql_bench_branches(bid,bbalance) values($1,$2)
192.168.66.80:13000	10	avg: 8.218	tot: 82.179	ms	avg: 1	tot: 10 rows	insert into ysql_bench_tellers(tid,bid,tbalance) values ($1,$2,$3)
192.168.66.80:13000	1	avg: 6641.962	tot: 6641.962	ms	avg: 0	tot: 0 rows	truncate table ysql_bench_accounts, ysql_bench_branches, ysql_bench_history, ysq

For the sake of simplicity, any identical SQL (based on the query text) is summed up, and assumed to be the same statement. This is not correct. (at the time of creation, query_id was not exposed, so this was the only solution)

Please mind the source of the SQL statistics is postgres' pg_stat_statements, and holds a few quirks:

Any SQL that returns an error is not saved in 'pg_stat_statements'.
The 'total_time' is actually the time spent in the execution phase. Especially since YSQL might need to perform RPCs to complete its catalog (in rewrite/semantic parse and plan phases), this can miss some time and therefore show less time than a client sees.
A statement's uniqueness in pg_stat_statements is dependent on query_id, dbid and userid. Currently we don't expose all this fields in the http endpoint.

versions-diff mode

The purpose of versions-diff mode is to read two snapshots which must be locally stored, and show a difference report.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
-b/--begin: set the begin snapshot number.
-e/--end: set the end snapshot number.

versions-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the versions-diff mode only uses the information that is stored in the locally available snapshot (CSV) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke versions-diff mode is to use the --versions-diff switch.

If --versions-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the versions-diff is directly created and shown.

versions-diff without begin/end specification:

yb_stats --versions-diff
  0 2022-12-08 16:42:01.226043 +01:00
  1 2022-12-09 16:34:08.057222 +01:00
  2 2022-12-10 16:07:57.948800 +01:00
  3 2022-12-10 21:39:04.439287 +01:00
  4 2022-12-10 21:39:33.664075 +01:00
  5 2022-12-10 21:42:56.852644 +01:00
  6 2022-12-10 21:43:00.348445 +01:00
Enter begin snapshot: 5
Enter end snapshot: 6

No output means there is no version difference between the two snapshots.

versions-diff with begin/end specification:

yb_stats --snapshot-diff -b 0 -e 1
* 192.168.66.80:7000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.80:9000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.81:7000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.81:9000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.82:7000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.82:9000   Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1

version diff

Whenever yb_stats is used in ad-hoc diff mode or in snapshot-diff mode, it will read the version using the /api/v1/version endpoint. This is executed for both the masters and the tablet servers. This is done during the begin and end snapshots.

Example of changed versions on all servers between and begin and end snapshot:

* ip-172-158-59-19:7000 Versions: 2.17.1.0 184->201 RELEASE 09 Nov 2022 22:48:49 UTC->14 Nov 2022 13:07:31 UTC 971d32f9bf50d11c067bda4b5498d27611583c2c->d9d98761806ef4f37f501e2eb40bb7dcd981bb65
* ip-172-158-59-19:9000 Versions: 2.17.1.0 184->201 RELEASE 09 Nov 2022 22:48:49 UTC->14 Nov 2022 13:07:31 UTC 971d32f9bf50d11c067bda4b5498d27611583c2c->d9d98761806ef4f37f501e2eb40bb7dcd981bb65

This shows that on the servers ip-172-158-59-19:7000 and ip-172-158-59-19:9000 there have been version changes.

The version stayed the same.
The build number changed from 184 to 201.
The build date changed from 09 Nov 2022 22:48:49 UTC to 14 Nov 2022 13:07:31 UTC.
The build git hash changed from 971d32f9bf50d11c067bda4b5498d27611583c2c to d9d98761806ef4f37f501e2eb40bb7dcd981bb65.

entities diff

Whenever yb_stats is used in ad-hoc diff mode or in snapshot-diff mode, it will read the entity data from the master leader during the begin and the end snapshot. Entities are objects known to the master which deal with storing and organising the data for YSQL and YCQL.

The end snapshot will verify if the entities found with the begin snapshot, and show any difference it will find.

By default, it will skip the YugabyteDB default system databases/keyspaces:

"00000000000000000000000000000001" | // ycql system
"00000000000000000000000000000002" | // ycql system_schema
"00000000000000000000000000000003" | // ycql system_auth
"00000001000030008000000000000000" | // ysql template1
"000033e5000030008000000000000000") // ysql template0

When any of the following objects are added or removed, yb_stats will show the change:

Example diff where a YSQL database is created, and a table and an index:

+ Database: ysql.testdb, id: 0000414d000030008000000000000000
+ Object:   ysql.testdb.testtable, state: RUNNING, id: 0000414d00003000800000000000414e
+ Object:   ysql.testdb.testindex, state: RUNNING, id: 0000414d000030008000000000004153
+ Tablet:   ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 state: RUNNING, leader: yb-2.local:9100
+ Tablet:   ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a state: RUNNING, leader: yb-3.local:9100
+ Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-2.local:9100, type: VOTER
+ Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-1.local:9100, type: VOTER
+ Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-3.local:9100, type: VOTER
+ Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-2.local:9100, type: VOTER
+ Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-1.local:9100, type: VOTER
+ Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-3.local:9100, type: VOTER

And an example where the above YSQL database is dropped:

- Database: ysql.testdb, id: 0000414d000030008000000000000000
- Object:   ysql.testdb.testtable, state: RUNNING, id: 0000414d00003000800000000000414e
- Object:   ysql.testdb.testindex, state: RUNNING, id: 0000414d000030008000000000004153
- Tablet:   ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 state: RUNNING, leader: yb-2.local:9100
- Tablet:   ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a state: RUNNING, leader: yb-3.local:9100
- Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-2.local:9100, type: VOTER
- Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-1.local:9100, type: VOTER
- Replica:  ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-3.local:9100, type: VOTER
- Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-2.local:9100, type: VOTER
- Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-1.local:9100, type: VOTER
- Replica:  ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-3.local:9100, type: VOTER

Or detect a leader change for a tablet:

* Tablet:   ysql.testdb.testtable.5350a928953c4eb1aaa9eb0581a3112b state: RUNNING leader: yb-3.local:9100->yb-1.local:9100

masters diff

Whenever yb_stats is used in ad-hoc diff mode or in snapshot-diff mode, it will read the master metadata from the master leader during the begin and the end snapshot.

In this way, any changes to the masters, such as a reboot/restart or a role change can be seen.

Example of a leader change because of a restart:

* Master b460d504c6aa488d97bfe266ab506ab6 FOLLOWER->LEADER Cloud: local, Region: local, Zone: local,
                                          Seqno: 1669798949509360, Start time: 1669798949509360
                                          Http ( yb-3.local:7000 )
                                          Rpc ( yb-3.local:7100 )
* Master d3db2544098b4b808c0c65d4d19f4d3a LEADER->FOLLOWER Cloud: local, Region: local, Zone: local,
                                          Seqno: 1669798888831913->1669824380213235, Start time: 1669798888831913->1669824380213235
                                          Http ( yb-1.local:7000 )
                                          Rpc ( yb-1.local:7100 )

Here the master d3db2544098b4b808c0c65d4d19f4d3a had it's role changed from LEADER to FOLLOWER. In the same snapshot, the master b460d504c6aa488d97bfe266ab506ab6 had it's role changed from FOLLOWER to LEADER.

The reason for role change from LEADER to FOLLOWER can not be seen, but the start time and the seqno properties also did change. The change of start time shows the start time was renewed, indicating the master was restarted. The sequence number is identical to the start time, and therefore changed along with the start time.

Please mind that if no changed happened between the begin and end snapshot, no output will be shown.

tablet-servers-diff

Whenever yb_stats is used in ad-hoc diff mode or in snapshot-diff mode, it will read the tablet server metadata from the master leader during the begin and the end snapshot.

In this way, any changes to the tablet servers, such as a reboot/restart or a role change can be seen.

Example of a tablet server having been restarted:

  0 2023-03-18 14:13:01.407795 +01:00
  1 2023-03-18 14:13:53.959694 +01:00
  2 2023-03-18 14:14:05.162338 +01:00
  3 2023-03-19 14:20:50.977417 +01:00
  4 2023-03-19 14:21:14.418544 +01:00
  5 2023-03-19 14:24:17.733927 +01:00
  6 2023-03-19 15:52:10.007082 +01:00
  7 2023-03-19 15:52:45.711866 +01:00
Enter begin snapshot: 5
Enter end snapshot: 7
= Tserver:  yb-1.local:9000, status: ALIVE, uptime: 819->40

Here the tablet server named yb-1.local:9000 did show a change of it's uptime from 819 to 40 seconds, which indicates a restart. The next thing to look for from here might be --entity-diff, because this can cause replicas to change RAFT role.

Please mind that if no changed happened between the begin and end snapshot, no output will be shown.

vars diff

Whenever yb_stats is used in ad-hoc diff mode or in snapshot-diff mode, it will read the vars (gflags) using the /api/v1/varz endpoint. This is executed for both the masters and the tablet servers. This is done during the begin and end snapshots.

Example of a changed var on all servers between and begin and end snapshot:

* 192.168.66.80:9000   Vars: ysql_enable_packed_row                             false->true Default->Custom
* 192.168.66.81:9000   Vars: ysql_enable_packed_row                             false->true Default->Custom
* 192.168.66.82:9000   Vars: ysql_enable_packed_row                             false->true Default->Custom

This shows that on the servers 192.168.66.80, 192.168.66.81 and 192.168.66.82 servers on endpoint 9000 (the default tablet server port) a change was detected and reported.

The var/gflag is ysql_enabled_packed_row.
The value is changed from false to true.
The change of var changed the type of var from Default (non changed) to Custom (changed).

yb_stats --healthcheck-diff

snapshot-nonmetrics-diff mode

The purpose of snapshot-nonmetrics-diff mode is to read two snapshots which must be locally stored, and show a difference report. The special purpose of nonmetrics is that it excludes the quite numerous detailed statistics, and only shows the differences for:

entities
masters
tablet servers
vars (gflags)
versions
healthcheck

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match: filter by statistic name regular expression.
--details-enable: split table and tablet statistics, instead of summarizing these per server. ➜ yb_stats --snapshot-nonmetrics-diff -b 0 -e 2

Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
Tablet: ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
Replica: yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER = Tserver: yb-1.local:9000, status: ALIVE, uptime: 2619->0 = 192.168.66.80:12000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default = 192.168.66.80:9000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default- -b/--begin: set the begin snapshot number.

-e/--end: set the end snapshot number.

snapshot-nonmetric-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the snapshot-nonmetric-diff mode only uses the information that is stored in the locally available snapshot (CSV) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).

The way to invoke snapshot-diff mode is to use the --snapshot-nonmetric-diff switch.

If --snapshot-nonmetric-diff is used without -b/--begin and -e/--end options, it will present the available snapshot numbers with their timestamp and comments and ask for the option that is not specified. Otherwise the snapshot-diff is directly created and shown.

snapshot-nonmetric-diff without begin/end specification:

➜ yb_stats --snapshot-nonmetrics-diff
  0 2023-03-18 14:13:01.407795 +01:00
  1 2023-03-18 14:13:53.959694 +01:00
  2 2023-03-18 14:14:05.162338 +01:00
Enter begin snapshot: 0
Enter end snapshot: 2
+ Object:   ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
+ Tablet:   ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
+ Replica:  yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER
= Tserver:  yb-1.local:9000, status: ALIVE, uptime: 2619->0
= 192.168.66.80:12000  Vars: heap_profile_path                                  /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
= 192.168.66.80:9000   Vars: heap_profile_path                                  /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default

For an explanation of the changes, see below.

snapshot-nonmetrics-diff with begin/end specification:

➜ yb_stats --snapshot-nonmetrics-diff -b 0 -e 2
+ Object:   ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
+ Tablet:   ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
+ Replica:  yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER
= Tserver:  yb-1.local:9000, status: ALIVE, uptime: 2619->0
= 192.168.66.80:12000  Vars: heap_profile_path                                  /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
= 192.168.66.80:9000   Vars: heap_profile_path                                  /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default

In this example, a begin snapshot (0) and end snapshot (2) are specified, and the differences found between snapshot number 0 and 2 are:

An object ysql.yugabyte.t is added (+).
A tablet ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf is added (+) on yb-1.local:9100.
A replica yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf is added (+).
A change (=) is detected on Tserver yb-1.local:9000, the uptime has changed from 2619->0, indicating it has been restarted.
A change (=) is detected on 192.168.66.80:12000 for the var/gflag heap_profile_path.
A change (=) is detected on 192.168.66.80:9000 identical to port 12000. Port 12000 (the YCQL port) shows identical information to port 9000, the general default tablet server port.

adhoc-metrics-diff

When yb_stats is run with the --adhoc-metrics-diff, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference. This is called 'adhoc mode', however, the --adhoc-metrics-diff mode will only take the metrics (excluding node-exporter), and show the difference.

The usage of either adhoc mode or snapshot mode should be carefully considered. adhoc alias in-memory snapshots does not write anything.

In most cases performing snapshots persisting all the available information is the best way, so results can be reviewed later, and cannot get lost, because they are stored. However, if you are performing repeated tests where storing all snapshot information would simply be too much and would require you to remove all the snapshots after testing anyways AND you are shure what to look for, then adhoc mode might be used.

Example

This is how the first snapshot looks like in ad-hoc mode:

yb_stats --adhoc-metrics-diff
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:

Time between snapshots:    2.910 seconds
192.168.66.80:12000  server   cpu_stime                                                                           49 ms              17.008 /s
192.168.66.80:12000  server   cpu_utime                                                                            2 ms               0.694 /s
192.168.66.80:12000  server   involuntary_context_switches                                                         1 csws             0.347 /s
192.168.66.80:12000  server   server_uptime_ms                                                                  2879 ms             999.306 /s
192.168.66.80:12000  server   threads_started                                                                      2 threads           0.694 /s
...etcetera

adhoc-node-exporter-diff

When yb_stats is run with the --adhoc-node-exporter-diff, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference. This is called 'adhoc mode', however, the --adhoc-node-exporter-diff mode will only take the metrics from node-exporter, and show the difference. Node-exporter shows the operating system statistics.

The usage of either adhoc mode or snapshot mode should be carefully considered. adhoc alias in-memory snapshots does not write anything.

Example

This is how the first snapshot looks like in ad-hoc mode:

yb_stats --adhoc-metrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.

After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:

Time between snapshots:   18.843 seconds
192.168.66.80:9300   counter  node_context_switches_total                                                      14946.000000         830.333 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_idle                                                         17.610000           0.978 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_irq                                                           0.510000           0.028 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_softirq                                                       0.130000           0.007 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_system                                                        0.220000           0.012 /s
192.168.66.80:9300   counter  node_cpu_seconds_total_user                                                          0.010000           0.001 /s
192.168.66.80:9300   counter  node_disk_io_time_seconds_total_sda                                                  0.004000           0.000 /s
192.168.66.80:9300   counter  node_disk_io_time_weighted_seconds_total_sda                                         0.003000           0.000 /s
...etcetera

For an explanation of the fields see: Node-exporter statistics

Print modes

Outside of performance metrics, a snapshot contains a lot more information. Most of the additonal information can be obtained using print commands. All of the print commands take a single snapshot number, some print commands do accept no snapshot number, which makes yb_stats perform a live lookup.

print-version

Print version information from a live cluster or from a snapshot.

--print-version <snapshot number>: print version information from stored snapshot.
--print-version: print version information from a live cluster.

Additional switches:

--hostname-match: filter by hostname or port regular expression.

Example:

% yb_stats --print-version
hostname_port        version_number  build_nr   build_type build_timestamp          git_hash
192.168.66.82:9000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.82:7000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.81:9000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.81:7000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.80:9000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.80:7000   2.15.3.2        1          RELEASE    04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14

print-log

Print log information from a live cluster or from a snapshot.

--print-log <snapshot number>: print log information from a stored snapshot.
--print-log: print log information from a live cluster.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--log-severity filters log lines by the letter that indicates the severity. Default: WEF, optional: I.
--stat-name-match: filters log lines by the sourcefile name and line number (such as "leader_election.cc:216") or message fields via a regular expression.

Explanation of the severity letters (increasing in severity):

I: Informal
W: Warning
E: Error
F: Fatal

The --print-log option prints the log lines based on the timestamp found in the log line. The log lines are taken from the different servers, and ordered on time, which is the local timestamp, so you have to be aware of clock skew. The timestamps in the logs are UTC time, so timezone settings should not require recalculation of the timestamp.

Example:

% yb_stats --print-log --log-severity IWEF --hostname-match '(7000|9000)
...
192.168.66.80:9000   2023-01-29 12:29:45.447729 UTC I raft_consensus.cc:3356 T 5652a6b7a4ea47d198f0142895addf09 P 414ad0910477464fb4c17dfbd912de10 [term 1 FOLLOWER]: Calling mark dirty synchronously for reason code FOLLOWER_NO_OP_COMPLETE
192.168.66.81:9000   2023-01-29 12:29:45.448503 UTC I raft_consensus.cc:3356 T 5652a6b7a4ea47d198f0142895addf09 P 8e317433953244dfbff8dea89bdd7d77 [term 1 FOLLOWER]: Calling mark dirty synchronously for reason code FOLLOWER_NO_OP_COMPLETE
192.168.66.82:7000   2023-01-29 12:29:45.448714 UTC I catalog_manager.cc:7115 Peer 8e317433953244dfbff8dea89bdd7d77 sent incremental report for 5652a6b7a4ea47d198f0142895addf09, prev state op id: -1, prev state term: 1, prev state has_leader_uuid: 1. Consensus state: current_term: 1 leader_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" config { opid_index: -1 peers { permanent_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" member_type: VOTER last_known_private_addr { host: "yb-3.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local3" } } peers { permanent_uuid: "414ad0910477464fb4c17dfbd912de10" member_type: VOTER last_known_private_addr { host: "yb-1.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local1" } } peers { permanent_uuid: "8e317433953244dfbff8dea89bdd7d77" member_type: VOTER last_known_private_addr { host: "yb-2.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local2" } } }
192.168.66.82:7000   2023-01-29 12:29:45.449941 UTC I catalog_manager.cc:7115 Peer 414ad0910477464fb4c17dfbd912de10 sent incremental report for 5652a6b7a4ea47d198f0142895addf09, prev state op id: -1, prev state term: 1, prev state has_leader_uuid: 1. Consensus state: current_term: 1 leader_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" config { opid_index: -1 peers { permanent_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" member_type: VOTER last_known_private_addr { host: "yb-3.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local3" } } peers { permanent_uuid: "414ad0910477464fb4c17dfbd912de10" member_type: VOTER last_known_private_addr { host: "yb-1.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local1" } } peers { permanent_uuid: "8e317433953244dfbff8dea89bdd7d77" member_type: VOTER last_known_private_addr { host: "yb-2.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local2" } } }
192.168.66.82:7000   2023-01-29 12:29:45.457672 UTC I ysql_transaction_ddl.cc:55 Verifying Transaction { transaction_id: c57ef1a4-8917-437d-8fc6-a00328d6fc95 isolation: SNAPSHOT_ISOLATION status_tablet: ecc3d88a304f4a82894229c65831bb9f priority: 15800696521729142229 start_time: { physical: 1674995385026221 } locality: GLOBAL old_status_tablet: }
192.168.66.82:7000   2023-01-29 12:29:45.459448 UTC I ysql_transaction_ddl.cc:110 Got Response for { transaction_id: c57ef1a4-8917-437d-8fc6-a00328d6fc95 isolation: SNAPSHOT_ISOLATION status_tablet: ecc3d88a304f4a82894229c65831bb9f priority: 15800696521729142229 start_time: { physical: 1674995385026221 } locality: GLOBAL old_status_tablet: }, resp: status: PENDING status_hybrid_time: 6860781098837565439 propagated_hybrid_time: 6860781098837594112 aborted_subtxn_set { }
192.168.66.80:9000   2023-01-29 12:29:45.480443 UTC I table_creator.cc:363 Created table yugabyte.t of type PGSQL_TABLE_TYPE
192.168.66.82:7000   2023-01-29 12:29:45.668880 UTC I catalog_manager.cc:4098 T 00000000000000000000000000000000 P 52d4dda57d5740459c014a1a9dc07eae: Table transaction succeeded: t [id=000033e8000030008000000000004000]

These are some informal messages which are produced when a table is created, and show some of the internal dealing with RAFT, and the master (indicated by port 7000, and the network address implicitly shows the master leader), managing the table and therefore tablet creation.

tail-log

Print new log entries from a live cluster, in the same way that tail -f works on a file.

--tail-log: print log information from a live cluster.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--log-severity filters log lines by the letter that indicates the severity. Default: WEF, optional: I.
--stat-name-match: filters log lines by the sourcefile name and line number (such as "log.cc:1516") or message fields via a regular expression.

Explanation of the severity letters (increasing in severity):

I: Informal
W: Warning
E: Error
F: Fatal

The --tail-log option prints the log lines based on the timestamp found in the log line. The log lines are taken from the different servers, and ordered on time, which is the local timestamp, so you have to be aware of clock skew. The timestamps in the logs are UTC time, so timezone settings should not require recalculation of the timestamp.

Example:

% yb_stats --tail-log --log-severity IWEF --hostname-match 9000 --stat-name-match '(leader_election|Granting vote|Granting yes vote)'
192.168.66.80:9000   2022-12-19 14:37:16.188669 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Requesting vote from peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000   2022-12-19 14:37:16.188694 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Requesting vote from peer 9e982449051e43f3902a4fb24d639b84
192.168.66.82:9000   2022-12-19 14:37:16.188913 +01:00 I raft_consensus.cc:2375 T 411a6843451a4e87aeca805085a228ba P 64dc6522080c433c9cbd2c83efccb025 [term 0 FOLLOWER]: Pre-election. Granting vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1
192.168.66.81:9000   2022-12-19 14:37:16.191915 +01:00 I raft_consensus.cc:2375 T 411a6843451a4e87aeca805085a228ba P 9e982449051e43f3902a4fb24d639b84 [term 0 FOLLOWER]: Pre-election. Granting vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1
192.168.66.80:9000   2022-12-19 14:37:16.192420 +01:00 I leader_election.cc:367 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Vote granted by peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000   2022-12-19 14:37:16.192452 +01:00 I leader_election.cc:242 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate won.
192.168.66.80:9000   2022-12-19 14:37:16.202473 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Requesting vote from peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000   2022-12-19 14:37:16.202504 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Requesting vote from peer 9e982449051e43f3902a4fb24d639b84
192.168.66.82:9000   2022-12-19 14:37:16.210361 +01:00 I raft_consensus.cc:3022 T 411a6843451a4e87aeca805085a228ba P 64dc6522080c433c9cbd2c83efccb025 [term 1 FOLLOWER]:  Leader election vote request: Granting yes vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1.
192.168.66.80:9000   2022-12-19 14:37:16.211731 +01:00 I leader_election.cc:367 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Vote granted by peer 9e982449051e43f3902a4fb24d639b84
192.168.66.80:9000   2022-12-19 14:37:16.211760 +01:00 I leader_election.cc:242 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Election decided. Result: candidate won.
192.168.66.81:9000   2022-12-19 14:37:16.211910 +01:00 I raft_consensus.cc:3022 T 411a6843451a4e87aeca805085a228ba P 9e982449051e43f3902a4fb24d639b84 [term 1 FOLLOWER]:  Leader election vote request: Granting yes vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1.

The --tail-log command has the additional switches --log-severity set to include 'I' severity loglines, has --hostname-match set to '9000' in order to only allow tablet server loglines, and has --stat-name-match in order to allow only the regex '(leader_election|Granting vote|Granting yes vote)' for sourcefile name and number and message.

Press interrupt (ctrl-c) to terminate tailing the logs.

print-entities

Print entities (database, database object, tablet, replica) from a live cluster or from a snapshot.

--print-entities <snapshot number>: print entities from a stored snapshot.
--print-entities: print entities from a live cluster.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--table-name-match: filter by object name regular expression.
--details-enable: print the entity information from all masters.

The entity information is available on the masters, on the leader and the followers. In order to get the current information, yb_stats fetches information to learn the master leader first, and then obtains the entity information from the master leader, unless the --details-enable switch is set.

For YSQL objects/entities, yb_stats takes the OID from the object id and filters out OIDs lower than 16384, because these are system OIDs.

Example:

% yb_stats --print-entities
Object:   ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
Tablet:   ysql.yugabyte.t.7f4fc16eba28432e8ed2baf4603f9590 state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.t.ce1302dada834f619e67dffc847a80fe state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.t.f035128dd43d43d3a1a9d4c44727df99 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.yugabyte.tt, state: RUNNING, id: 000033e8000030008000000000004100
Tablet:   ysql.yugabyte.tt.3ae53662d5374897b8a55899f7ceb9c4 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.tt.880a7b69ae4a474b96d3ff0b7117867b state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.tt.e844bce904794c9799301a2a95cdbe82 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.yugabyte.ysql_bench_history, state: RUNNING, id: 000033e800003000800000000000413b
Tablet:   ysql.yugabyte.ysql_bench_history.11476b9ff3bd4cdeb89a6b188de44b51 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_history.1f01c2e8a9ba467b8495b304649bcbde state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_history.fd9094233dc04e9fa17084b99c42fea6 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.yugabyte.ysql_bench_tellers, state: RUNNING, id: 000033e800003000800000000000413e
Tablet:   ysql.yugabyte.ysql_bench_tellers.918cba44a4d34b699aab6a53eb2399bf state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_tellers.a0a6a3f68cfd4ce697f9c412b74cf84d state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_tellers.d2b6e484972c4443868e1887d05bc7a4 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.yugabyte.ysql_bench_accounts, state: RUNNING, id: 000033e8000030008000000000004143
Tablet:   ysql.yugabyte.ysql_bench_accounts.43e33bb5a7a34a4c8b631d08f4544165 state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_accounts.634d9612c86e4a98a9ffdba70a76227f state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_accounts.da848f4a61ea43c7a7d903b1c28b6942 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.yugabyte.ysql_bench_branches, state: RUNNING, id: 000033e8000030008000000000004148
Tablet:   ysql.yugabyte.ysql_bench_branches.b243250ea9f145ccbb68119be37f540d state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_branches.b948c3f199954959b295c78d6f3f99c7 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet:   ysql.yugabyte.ysql_bench_branches.d456575fb4a04b9ea9e438b93129aa2f state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object:   ysql.testdb.testtable, state: RUNNING, id: 00004154000030008000000000004155
Tablet:   ysql.testdb.testtable.5350a928953c4eb1aaa9eb0581a3112b state: RUNNING
            ( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Object:   ysql.testdb.testindex, state: RUNNING, id: 0000415400003000800000000000415a
Tablet:   ysql.testdb.testindex.16c89cf34d054f0fb9116534d366ec33 state: RUNNING
            ( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )

Using the /dump-entities endpoint it's not possible to make a distinction between the object types of table, index or materialized view.

For every object, a full name is shown in the form of [database type].[database/keyspace].[object name], the state and the id of the object.
An object contains one or more tablets.
For every tablet: a full name is shown in the form of [database type].[database/keyspace].[object name].[tablet id].
A tablet has no name, only an id.
For every tablet there is also the replica information in between brackets. For every replica, the type, RPC hostname and port and follower or leader designation.

print-masters

Print the current masters from a live cluster or from a snapshot.

--print-masters <snapshot number>: print masters from a stored snapshot.
--print-masters: print masters from a live cluster.

Additional switches:

--details-enable: print the master information from all masters.

In order to get the current information, yb_stats fetches information to learn the master leader first, and then obtains the entity information from the master leader, unless the --details-enable switch is set.

Example:

% yb_stats --print-masters
d3db2544098b4b808c0c65d4d19f4d3a LEADER   Cloud: local, Region: local, Zone: local
                                 Seqno: 1669886426374545 Start time: 1669886426374545
                                 RPC addresses: ( yb-1.local:7100  )
                                 HTTP addresses: ( yb-1.local:7000  )
5334e8170e74496c9780d64e09177010 FOLLOWER Cloud: local, Region: local, Zone: local
                                 Seqno: 1669886456237856 Start time: 1669886456237856
                                 RPC addresses: ( yb-2.local:7100  )
                                 HTTP addresses: ( yb-2.local:7000  )
b460d504c6aa488d97bfe266ab506ab6 FOLLOWER Cloud: local, Region: local, Zone: local
                                 Seqno: 1669886489682609 Start time: 1669886489682609
                                 RPC addresses: ( yb-3.local:7100  )
                                 HTTP addresses: ( yb-3.local:7000  )

print-tablet-servers

Print the current tablet servers from a live cluster or from a snapshot.

--print-tablet-servers <snapshot number>: print tablet servers from a stored snapshot.
--print-tablet-servers: print tablet servers from a live cluster.

Additional switches:

--details-enable: print the tablet servers information from all masters.

In order to get the current information, yb_stats fetches to learn the master leader first, and then obtain the tablet servers information from the master leader, unless the --details-enable switch is set.

Example:

yb-2.local:9000      ALIVE Cloud: local, Region: local, Zone: local2
                     HB time: 2.1s, Uptime: 0, Ram 8.39 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 14, user (leader/total): 7/20, system (leader/total): 0/12
                     Path: /mnt/d0, total: 10724835328, used: 992555008 (9.25%)
yb-1.local:9000      ALIVE Cloud: local, Region: local, Zone: local1
                     HB time: 0.0s, Uptime: 0, Ram 9.44 MB
                     SST files: nr: 0, size: 0 B, uncompressed: 0 B
                     ops read: 0, write: 0
                     tablets: active: 16, user (leader/total): 6/20, system (leader/total): 0/12
                     Path: /mnt/d0, total: 10724835328, used: 441438208 (4.12%)
yb-3.local:9000      ALIVE Cloud: local, Region: local, Zone: local3
                     HB time: 0.1s, Uptime: 1118, Ram 62.46 MB
                     SST files: nr: 13, size: 4.46 MB, uncompressed: 15.24 MB
                     ops read: 0, write: 0
                     tablets: active: 32, user (leader/total): 6/20, system (leader/total): 5/12
                     Path: /mnt/d0, total: 10724835328, used: 922292224 (8.60%)

print-vars

Print the current vars (gflags) from a live cluster or from a snapshot.

--print-vars <snapshot number>: print variables/gflags from every server (endpoint) in a stored snapshot.
--print-vars: print variables/gflags from every server (endpoint) in the cluster.

Additional switches:

--details-enable: print variables with 'Default' type too.
--hostname-match: filter by hostname or port regular expression.
--stat-name-filter: filter by variable name regular expression.

Example:

% yb_stats --print-vars --hostname-match 192.168.66.82:9000
192.168.66.82:9000   log_filename                                       yb-tserver                               NodeInfo
192.168.66.82:9000   placement_cloud                                    local                                    NodeInfo
192.168.66.82:9000   placement_region                                   local                                    NodeInfo
192.168.66.82:9000   placement_zone                                     local3                                   NodeInfo
192.168.66.82:9000   rpc_bind_addresses                                 0.0.0.0                                  NodeInfo
192.168.66.82:9000   webserver_interface                                                                         NodeInfo
192.168.66.82:9000   webserver_port                                     9000                                     NodeInfo
192.168.66.82:9000   client_read_write_timeout_ms                       600000                                   Custom
192.168.66.82:9000   cql_proxy_bind_address                             0.0.0.0:9042                             Custom
192.168.66.82:9000   db_block_cache_size_percentage                     10                                       Custom
192.168.66.82:9000   default_memory_limit_to_ram_ratio                  0.59999999999999998                      Custom
192.168.66.82:9000   flagfile                                           /opt/yugabyte/conf/tserver.conf          Custom
192.168.66.82:9000   fs_data_dirs                                       /mnt/d0                                  Custom
192.168.66.82:9000   global_log_cache_size_limit_mb                     32                                       Custom
192.168.66.82:9000   leader_lease_duration_ms                           4000                                     Custom
192.168.66.82:9000   log_cache_size_limit_mb                            16                                       Custom
192.168.66.82:9000   mem_tracker_tcmalloc_gc_release_bytes              5062950                                  Custom
192.168.66.82:9000   pg_yb_session_timeout_ms                           600000                                   Custom
192.168.66.82:9000   pgsql_proxy_bind_address                           0.0.0.0:5433                             Custom
192.168.66.82:9000   raft_heartbeat_interval_ms                         1000                                     Custom
192.168.66.82:9000   redis_proxy_bind_address                           0.0.0.0:6379                             Custom
192.168.66.82:9000   server_tcmalloc_max_total_thread_cache_bytes       33554432                                 Custom
192.168.66.82:9000   start_pgsql_proxy                                  true                                     Custom
192.168.66.82:9000   tserver_master_addrs                               yb-1.local:7100,yb-2.local:7100,yb-3.local:7100 Custom
192.168.66.82:9000   yb_num_shards_per_tserver                          2                                        Custom
192.168.66.82:9000   ysql_num_shards_per_tserver                        1                                        Custom
192.168.66.82:9000   regular_tablets_data_block_key_value_encoding      three_shared_parts                       Auto
192.168.66.82:9000   TEST_auto_flags_initialized                        true                                     Auto

This is using the new /api/v1/varz endpoint. For older versions, use --print-gflags.

The new variables/gflags page shows a classification or 'type' per variable/gflag.

Default
NodeInfo
Custom
Auto

Variables of the type Auto are not changed, and therefore are not shown by default.

print-memtrackers

Print the memtracker page from a snapshot.

--print-memtrackers <snapshot number>: print memtrackers information from a stored snapshot.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match: filter by memory area name (id) regular expression.

Example:

% yb_stats --print-memtrackers 2 --hostname-match 82:700 --stat-name-match '(root|TCMalloc|server)'
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.82:7000, Snapshot number: 2, Snapshot time: 2022-10-18 15:26:20.125948 +02:00
--------------------------------------------------------------------------------------------------------------------------------------
hostname_port        id                                                  current_consumption     peak_consumption                limit
--------------------------------------------------------------------------------------------------------------------------------------
192.168.66.82:7000   root                                                             26.86M               28.09M              241.44M
192.168.66.82:7000   TCMalloc Central Cache                                           812.7K               961.9K                 none
192.168.66.82:7000   TCMalloc PageHeap Free                                            2.26M                4.73M                 none
192.168.66.82:7000   TCMalloc Thread Cache                                            716.9K                1.97M                 none
192.168.66.82:7000   TCMalloc Transfer Cache                                          985.5K                1.41M                 none
192.168.66.82:7000   server                                                  197.5K (12.62M)               232.1K                 none

print-rpcs

Print RPC (remote procedure call) information from a snapshot.

--print-rpcs <snapshot number>: print rpc information from a stored snapshot.

By default, --print-rpcs prints out a summary (a count of RPCs) per host and port. Use --details-enable to see all individual RPCs. --print-rpcs includes port 13000 (YSQL), which means the YSQL connections.

Additional flags:

--details-enable: print all individual RPC connections, instead of a summary.
--hostname-match: filter by hostname or port regular expression.

Example:

% yb_stats --print-rpcs 2
----------------------------------------------------------------------------------------------------
Host: 192.168.66.80; port: 13000, count: 2; port: 7000, count: 20; port: 9000, count: 49
----------------------------------------------------------------------------------------------------
Host: 192.168.66.81; port: 13000, count: 1; port: 7000, count: 19; port: 9000, count: 41
----------------------------------------------------------------------------------------------------
Host: 192.168.66.82; port: 13000, count: 1; port: 7000, count: 57; port: 9000, count: 40
----------------------------------------------------------------------------------------------------

With --details-enable and --hostname-match it's posssible to see the current connections to YSQL for a node, for example:

% yb_stats --print-rpcs 27 --details-enable --hostname-match 80:13000

----------------------------------------------------------------------------------------------------
Host: 192.168.66.80; port: 13000, count: 2
----------------------------------------------------------------------------------------------------
192.168.66.80:13000 idle yugabyte client backend ysqlsh 127.0.0.1
192.168.66.80:13000   checkpointer

print-threads

Print current threads from a snapshot.

--print-threads <snapshot number>: print current threads from a snapshot.

Additional switches:

--hostname-match: filter by hostname or port regular expression.

Example:

% yb_stats --print-threads 27 --hostname-match 80:9000
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.80:9000, Snapshot number: 27, Snapshot time: 2022-12-01 11:24:08.436165 +01:00
--------------------------------------------------------------------------------------------------------------------------------------
hostname_port        thread_name                          cum_user_cpu_s     cum_kernel_cpu_s     cum_iowait_cpu_s stack
--------------------------------------------------------------------------------------------------------------------------------------
192.168.66.80:9000   pg_supervisorxx-7789                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::pgwrapper::PgSupervisor::RunThread();yb::Subprocess::DoWait();__GI___waitpid
192.168.66.80:9000   CQLServer_reactor-7801                       0.070s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   RedisServer_reactor-7794                     0.280s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   TabletServer_reactor-7732                    0.070s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   acceptorxxxxxxx-7803                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   acceptorxxxxxxx-7796                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   acceptorxxxxxxx-7739                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000   heartbeatxxxxxx-7743                         0.190s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::tserver::Heartbeater::Thread::RunThread();__pthread_cond_timedwait
192.168.66.80:9000   iotp_CQLServer_3-7800                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_CQLServer_2-7799                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_CQLServer_1-7798                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_CQLServer_0-7797                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000   iotp_RedisServer_1-7791                      0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_RedisServer_2-7792                      0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_RedisServer_3-7793                      0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_RedisServer_0-7790                      0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000   iotp_TabletServer_0-7728                     0.150s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_TabletServer_2-7730                     0.320s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_TabletServer_3-7731                     0.310s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000   iotp_TabletServer_1-7729                     0.210s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000   iotp_call_home_0-7748                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000   maintenance_scheduler-7745                   0.330s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::MaintenanceManager::RunSchedulerThread();__pthread_cond_timedwait
192.168.66.80:9000   rb-session-expx-7736                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::tserver::RemoteBootstrapServiceImpl::EndExpiredSessions();yb::CountDownLatch::WaitFor();yb::ConditionVariable::WaitUntil();__pthread_cond_timedwait
192.168.66.80:9000   rpc_tp_TabletServer-high-pri_4-7782               0.230s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer-high-pri_3-7774               0.170s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer-high-pri_2-7773               0.250s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer-high-pri_1-7768               0.210s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer-high-pri_0-7767               0.170s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_11-7761                  0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_10-7760                  0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_9-7759                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_8-7758                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_7-7757                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_6-7756                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_5-7755                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_4-7754                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_3-7753                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_2-7747                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_1-7746                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   rpc_tp_TabletServer_0-7741                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000   flush scheduler bgtask-7733                  0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::BackgroundTask::Run();__pthread_cond_wait
192.168.66.80:9000   server_clientcb [worker]-7744                0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   cdc_clientcb [worker]-7737                   0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   MaintenanceMgr [worker]-7129                 0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   log-alloc [worker]-7128                      0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   append [worker]-7127                         0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   prepare [worker]-7126                        0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   log-sync [worker]-7125                       0.000s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000   consensus [worker]-7124                      0.830s               0.000s               0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait

print-gflags

Print gflags information from a snapshot.

--print-gflags <snapshot number>: print gflags from a stored snapshot.

Additional switches:

--hostname-match: filter by hostname or port regular expression.
--stat-name-match filter by gflag name regular expression.

Example:

% yb_stats --print-gflags 2 --hostname-match 82:700 --stat-name-match wal
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.82:7000, Snapshot number: 2, Snapshot time: 2022-10-18 15:26:20.128980 +02:00
--------------------------------------------------------------------------------------------------------------------------------------
cdc_wal_retention_time_secs                                                      14400
time_based_wal_gc_clock_delta_usec                                               0
bytes_durable_wal_write_mb                                                       1
durable_wal_write                                                                true
interval_durable_wal_write_ms                                                    1000
require_durable_wal_write                                                        false
save_index_into_wal_segments                                                     false
fs_wal_dirs                                                                      /mnt/d0
skip_wal_rewrite                                                                 true
TEST_download_partial_wal_segments                                               false
TEST_pause_rbs_before_download_wal                                               false
TEST_fault_crash_after_wal_deleted                                               0

print-cluster-config

Print the current cluster configuration from a live cluster or from a snapshot.

--print-cluster-config <snapshot number>: print the cluster configuration from a stored snapshot.
--print-cluster-config: print the cluster configuration from a live cluster.

Example:

➜ yb_stats --print-cluster-config
{
  "hostname_port": "192.168.66.80:7000",
  "timestamp": "2023-03-18T14:29:18.929159+01:00",
  "version": 0,
  "replication_info": null,
  "server_blacklist": null,
  "cluster_uuid": "e9e7c5bb-9494-4a56-b3c0-c3b1d9a7caf7",
  "encryption_info": null,
  "consumer_registry": null,
  "leader_blacklist": null
}

This is using the /api/v1/cluster-config endpoint, and uses the information from the current master leader.

print-health-check

Print the health check output from the master leader from a live cluster or from a snapshot.

--print-health-check <snapshot number>: print the health check output from a stored snapshot.
--print-health-check: print the health check output from a live cluster.

Example:

➜ yb_stats --print-health-check
{
  "hostname_port": "192.168.66.82:7000",
  "timestamp": "2023-03-18T14:42:09.572026+01:00",
  "dead_nodes": [],
  "most_recent_uptime": 261,
  "under_replicated_tablets": [],
  "failed_tablets": null
}

This is using the /api/v1/health-check endpoint, and uses the information from the current master leader.

print-drives

Print the drives usage information from the masters and tablet servers from a live cluster or from a snapshot.

--print-drives <snapshot number>: print the drives usage information from a stored snapshot.
--print-drives: print the drives usage information from a live cluster.

Example:

➜ yb_stats --print-drives
192.168.66.80:7000   /mnt/d0                                  9.99G                149.83M
192.168.66.80:9000   /mnt/d0                                  9.99G                149.83M
192.168.66.80:12000  /mnt/d0                                  9.99G                149.83M
192.168.66.82:12000  /mnt/d0                                  9.99G                146.08M
192.168.66.82:9000   /mnt/d0                                  9.99G                146.08M
192.168.66.81:9000   /mnt/d0                                  9.99G                145.90M
192.168.66.81:7000   /mnt/d0                                  9.99G                145.90M
192.168.66.82:7000   /mnt/d0                                  9.99G                146.08M
192.168.66.81:12000  /mnt/d0                                  9.99G                145.90M

The output shows the following fields:

endpoint (hostname/ip address and port number)
path
total space
used space

This is using the /drives URL, which is present on each master and tablet server.

print-tablet-server-operations

Print the current tablet server operations from each tablet server in a live cluster or from a snapshot.

--print-tablet-server-operations <snapshot number>: print the tablet server operations from a stored snapshot.
--print-tablet-server-operations: print the tablet server operations from a live cluster.

Example:

➜ yb_stats --print-tablet-server-operations
192.168.66.82:9000   acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 34 WRITE_OP   53479 us.  R-P { type: kWrite consensus_round: 0x0000561862d89700 -> id { term: 1 index: 34
192.168.66.81:9000   acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 35 WRITE_OP   40356 us.  R-P { type: kWrite consensus_round: 0x00005628f5fc9b00 -> id { term: 1 index: 35
192.168.66.80:9000   56263119ab85438481a3b5865dfc5787 term: 1 index: 34 WRITE_OP   18758 us.  R-P { type: kWrite consensus_round: 0x00005605e07ed000 -> id { term: 1 index: 34
192.168.66.80:9000   acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 34 WRITE_OP   149000 us. R-P { type: kWrite consensus_round: 0x00005605e57224c0 -> id { term: 1 index: 34
192.168.66.80:9000   999219c1c7ce4b03bb41b568719f7c26 term: 1 index: 11 UPDATE_TRANSACTION_OP 11560 us.  R-P { type: kUpdateTransaction consensus_round: 0x00005605e1b093c0 -> id { term:

The output shows the following fields:

endpoint (hostname/ip address and port number)
tablet id
operation id (term: N, index: N)
transaction type (WRITE_OP, UPDATE_TRANSACTION_OP)
total time in flight (N us.)
description (R-P { type: ...)

This is using the /operations URL, which is present on each tablet server.

print-master-tasks

Print the current master tasks, last 20 user-initiated tasks and last 100 tasks started in the last 300 seconds from the master leader.

--print-master-tasks <snapshot number>: print the master tasks from a stored snapshot.
--print-master-tasks: print the master tasks from a live cluster.

Example:

➜ yb_stats --print-master-tasks
task done Truncate Tablet kComplete 3.75 min ago 473 ms Truncate Tablet RPC for tablet 0x000056279f4eeb00 -> 56263119ab85438481a3b5865dfc5787 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])
task done Truncate Tablet kComplete 3.75 min ago 575 ms Truncate Tablet RPC for tablet 0x000056279f4ee840 -> 4abf56bde0e843cfa9de8f48ca0e6a71 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])
task done Truncate Tablet kComplete 3.75 min ago 1.39 s Truncate Tablet RPC for tablet 0x000056279f4ee580 -> acc6cf799c22457fb722ba9a361c54b9 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])

The output shows the following fields:

task category (task done)
task name (Truncate Tablet)
task state (kComplete)
start time (2.75 min ago)
duration (N ms)
description (Truncate Tablet RPC for ...)

This is using the /tasks URL, which is present on each master server.

--print-table-detail

The --print-table-detail switch takes no argument in order to lookup the table-detail from a live cluster, or a snapshot number to lookup the table-detail from a snapshot.

Because the table detail information is fetched separately for each individual table, it is NOT fetched by default. To let yb_stats fetch the additional data for --print-table-detail, you must add the --extra-data switch!

The --print-table-detail switch also requires the extra --uuid switch to set the UUID for the table to print the details. In order to obtain the UUID to use for this switch, use the --print-entities option to obtain a list of tables with their UUIDs. For YSQL tables, the UUID is not really a UUID, but a large hexadecimal number that composited from several components, such as the database OID and the table OID.

Get list of entities to obtain table UUID:

yb_stats --print-entities

Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.t.4abf56bde0e843cfa9de8f48ca0e6a71 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER),)
  Tablet:     ysql.yugabyte.t.56263119ab85438481a3b5865dfc5787 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
  Tablet:     ysql.yugabyte.t.acc6cf799c22457fb722ba9a361c54b9 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER),)

For the table ysql.yugabyte.t, the UUID is 000033e8000030008000000000004000.

yb_stats --print-table-detail --extra-data --uuid 000033e8000030008000000000004000

Table UUID: 000033e8000030008000000000004000, version: 0, type: PGSQL_TABLE_TYPE, state: Running, keyspace: yugabyte, object_type: User tables, name: t
On disk size: Total: 191.69M WAL Files: 132.00M SST Files: 59.69M SST Files Uncompressed: 569.71M
Replication info:
Columns:
0    id                               int32 NOT NULL PARTITION KEY
1    f1                               string NULLABLE NOT A PARTITION KEY
Tablets:
acc6cf799c22457fb722ba9a361c54b9 hash_split: [0x0000, 0x5554], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-1.local LEADER: yb-3.local FOLLOWER: yb-2.local
56263119ab85438481a3b5865dfc5787 hash_split: [0xAAAA, 0xFFFF], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-1.local FOLLOWER: yb-3.local LEADER: yb-2.local
4abf56bde0e843cfa9de8f48ca0e6a71 hash_split: [0x5555, 0xAAA9], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: LEADER: yb-1.local FOLLOWER: yb-3.local FOLLOWER: yb-2.local
Tasks:

This shows:

The table UUID again.
The version. A table gets a new version if it's modified.
The type. This is a PGSQL_TABLE_TYPE, which means it's a postgres (YSQL) type object. Materialized views and indexes are also PGSQL_TABLE_TYPE objects.
The state.
The keyspace (database).
The object_type. This will tell if this object is a table (user tables), index (index tables) or catalog table (system tables). A materialized view is listed as user table.
The name.
The on disk size. This is the total size (up to YugabyteDB accuracy) of all tablets.
Replication info. This will show replication settings as JSON.
The columns on the DocDB level, along with the DocDB column time, and what columns are part of of the partition key (the primary key).
The tablets. This not only shows the amount of tablets, but also how they are split. Above shows the (default) hash split.
Tasks. Tasks that can be happening at the tablet level are for example index backfills.

--print-tablet-detail

The --print-tablet-detail switch takes no argument in order to lookup the tablet-detail from a live cluster, or a snapshot number to lookup the tablet-detail from a snapshot.

Because the tablet detail information is fetched separately for each individual tablet, it is NOT fetched by default. To let yb_stats fetch the additional data for --print-tablet-detail, you must add the --extra-data switch!

The --print-tablet-detail switch also requires the extra --uuid switch to set the UUID for the tablet to print the details. In order to obtain the UUID to use for this switch, use the --print-entities option to obtain a list of tablets with their UUIDs.

Get a list of entities to obtain tablet UUID:

yb_stats --print-entities

Keyspace:     ysql.postgres id: 000033e6000030008000000000000000
Keyspace:     ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace:     ysql.system_platform id: 000033e9000030008000000000000000
Object:       ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
  Tablet:     ysql.yugabyte.t.4abf56bde0e843cfa9de8f48ca0e6a71 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER),)
  Tablet:     ysql.yugabyte.t.56263119ab85438481a3b5865dfc5787 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
  Tablet:     ysql.yugabyte.t.acc6cf799c22457fb722ba9a361c54b9 state: RUNNING
    Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER),)

For the table ysql.yugabyte.t, there are 3 tablets. Let's pick 4abf56bde0e843cfa9de8f48ca0e6a71.

yb_stats --print-tablet-detail --extra-data --uuid 4abf56bde0e843cfa9de8f48ca0e6a71

192.168.66.80:9000
 General info:
  Keyspace:       yugabyte
  Object name:    t
  On disk sizes:  Total: 21.87M Consensus Metadata: 1.5K WAL Files: 2.00M SST Files: 19.87M SST Files Uncompressed: 189.64M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Queue overview: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Watermark:
  - { peer: 370474e547cc422ab838282184367b9b is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 4.659 }
  - { peer: 3fc85b37b3fd4332bc7ed0bcf128b5de is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 2 last_applied: 4.659 }
  - { peer: 35db27f008cb4a3ba7c7c5b224bacb7a is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 4.659 }
  Messages:
  - Entry: 0, Opid: 0.0, mesg. type: REPLICATE UNKNOWN_OP, size: 6, status: term: 0 index: 0
 LogAnchor:
  Latest log entry op id: 4.659
  Min retryable request op id: 9223372036854775807.9223372036854775807
  Last committed op id: 4.659
  Earliest needed log index: 659
 Transactions:
  - { safe_time_for_participant: { physical: 1679241188204604 logical: 1 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 2051458, uncompressed_size: 19655884, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6273456, uncompressed_size: 59741998, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6220244, uncompressed_size: 59741913, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6291230, uncompressed_size: 59709058, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
192.168.66.82:9000
 General info:
  Keyspace:       yugabyte
  Object name:    t
  On disk sizes:  Total: 84.87M Consensus Metadata: 1.5K WAL Files: 65.00M SST Files: 19.87M SST Files Uncompressed: 189.64M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 1, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
  Queue overview:
  Watermark:
  Messages:
 LogAnchor:
  Latest log entry op id: 4.659
  Min retryable request op id: 9223372036854775807.9223372036854775807
  Last committed op id: 4.659
  Earliest needed log index: 659
 Transactions:
  - { safe_time_for_participant: { physical: 1679241188201243 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 2051458, uncompressed_size: 19655884, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6273456, uncompressed_size: 59741998, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6220244, uncompressed_size: 59741913, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 6291230, uncompressed_size: 59709058, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
192.168.66.81:9000
 General info:
  Keyspace:       yugabyte
  Object name:    t
  On disk sizes:  Total: 84.87M Consensus Metadata: 1.5K WAL Files: 65.00M SST Files: 19.87M SST Files Uncompressed: 189.63M
  State:          RUNNING
 Consensus:
  State:          Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 2, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=1)
  Queue overview:
  Watermark:
  Messages:
 LogAnchor:
  Latest log entry op id: 4.659
  Min retryable request op id: 9223372036854775807.9223372036854775807
  Last committed op id: 4.659
  Earliest needed log index: 659
 Transactions:
  - { safe_time_for_participant: { physical: 1679241187538899 } remove_queue_size: 0 }
 Rocksdb:
  IntentDB:
  RegularDB:
   total_size: 5205489, uncompressed_size: 49526687, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
   total_size: 15627161, uncompressed_size: 149320345, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71

This shows that a single tablet in the case of replication factor 3 is stored at multiple tablet servers. Therefore the first thing to show for a tablet (replica) is the hostname/ip address of the tablet server.

General info:
- Keyspace: The YCQL keyspace or YSQL database name.
- Object name: the table or index name.
- On disk sizes: the sizes (with YugabyteDB accuracy) of this single tablet.
- State.
Consensus:
- The state.
- Queue overview: this field only contains data if the replica is leader.
- Watermark: this will list the peers (including the local replica) for the tablet on the leader.
- Messages: this shows entries on the leader.
Log Anchor:
- The log anchor data is internal logging state administration.
Transactions:
- The transactions show HLC safe time, and might list other transactions on the leader when these are happening.
Rocksdb:
- IntentDB: the file status of the SST files for the IntentDB.
- RegularDB: the file status of the SST files for the RegularDB. This is the actual data storage.

This is low level detail for troubleshooting.

Crontab

In lots of cases it might be really convenient to have historical snapshots, so you can "look back in time". This can be done by scheduling running yb_stats via the crontab.

Install yb_stats using the RPM. See install
Create a directory on a filesystem with enough diskspace

mkdir /home/yugabyte/yb_stats-history

In this example a directory in the home directory of the yugabyte user is used. It is hard to specify 'enough' for the above statement of 'with enough diskspace'. It depends on the size of the cluster. In general it's best to run yb_stats in a separate directory, so that it can use it's own .env file.

Setup run configuration with .env

cd /home/yugabyte/yb_stats-history
yb_stats --hosts 10.1.2.0,10.1.2.1,10.1.2.3 --parallel 3

This will trigger ad-hoc mode, press enter and validate if it fetched the correct hosts and endpoints. Be conservative with parallellism. See parallel

Schedule yb_stats in crontab

crontab -e
5 */1 * * * yb_stats_path="/home/yugabyte/yb_stats-history" && (date && cd $yb_stats_path && /usr/local/bin/yb_stats --snapshot) >> $yb_stats_path/yb_stats_run.out 2>&1

-----experimental

Let logrotate cleanup old yb-stats snapshots:

vi /etc/logrotate.d/yb_stats

File contents:

/home/yugabyte/yb_stats-history/yb_stats.snapshots {
	daily
	rotate 7
	missingok
}

Mind the place from where logrotate scans the files (/home/yugabyte/yb_stats-history/yb_stats.snapshots), and the cleaning schedule: daily logrotate will allow files that are not touched for 7 times: so files are kept for a week.

This currently will leave the removed snapshots in the snapshot.index file.

Troubleshooting

yb_stats by default prints as little as it can to the screen, and will therefore NOT show issues that it can overcome.

In order to let yb_stats provide more information about what it encountering, you can increase the logging level. The default logging level is error, which will also terminate execution. This is how that is done:

Logging level warning: warn:

RUST_LOG=warn yb_stats
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
        name: "threads_running_thread_pool",
        value: 18446744073709551614,
    }
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
        name: "is_load_balancing_enabled",
        value: false,
    }
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.81:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
        name: "threads_running_thread_pool",
        value: 18446744073709551614,
    }
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.81:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
        name: "is_load_balancing_enabled",
        value: true,
    }
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
        name: "threads_running_thread_pool",
        value: 18446744073709551614,
    }
[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
        name: "is_load_balancing_enabled",
        value: false,
    }
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.

These are some statistics that do not fit the type (both have been solved, but not yet made it to a public release). The warn level still is reasonably silent.

All available log levels, with increased verbosity are:

error (default)
warn
info
debug
trace

Please be aware that beyond info, the amount of output can be high.

Advanced logging

If you want to understand more about a specific module, you can enable a logging level for that module only. This requires an understanding of the module system in rust, however, the module can be seen in the output such as seen previously:

[2022-10-21T09:38:56Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:7000, type: ...

yb_stats::metrics is the module and submodule here.

In order to produce trace level logging for yb_stats::metrics only use:

RUST_LOG="yb_stats::metrics=trace" yb_stats

To set different logging levels for different modules, separate them with a comma:

RUST_LOG="yb_stats::metrics=trace,yb_stats::rpcs=info" ./target/release/yb_stats --snapshot
[2022-10-21T10:01:13Z INFO  yb_stats::rpcs] perform_rpcs_snapshot
[2022-10-21T10:01:13Z INFO  yb_stats::metrics] perform_snapshot (metrics)
[2022-10-21T10:01:13Z INFO  yb_stats::rpcs] Could not parse 192.168.66.82:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO  yb_stats::metrics] (192.168.66.82:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO  yb_stats::rpcs] Could not parse 192.168.66.81:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO  yb_stats::rpcs] Could not parse 192.168.66.80:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO  yb_stats::metrics] (192.168.66.81:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO  yb_stats::metrics] (192.168.66.80:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.ysqlserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.cqlserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z WARN  yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
        name: "threads_running_thread_pool",
        value: 18446744073709551613,
    }
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.tabletserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: 5abf0e6155ea4f64860325f6cfd2332a, metric_attribute_namespace_name: system, metric_attribute_table_name: transactions
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: e0ac5a9011874a668654e97ca348833d, metric_attribute_namespace_name: system, metric_attribute_table_name: transactions
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: 21962b8b5dbd4a6f99c3f3d5bc0780a6, metric_attribute_namespace_name: yugabyte, metric_attribute_table_name: t

yb_stats book