README
Welcome to the yb_stats
book.
This is documentation of the yb_stats project.
yb_stats
is a CLI tool to query, investigate and extract all facts from a YugabyteDB cluster, and store these in CSV files to persist the facts, and make them easy to transport.
The way yb_stats
works is that it tries to obtain information from the YugabyteDB and node-exporter http endpoints using the least amount of dependencies, which means using only the yb_stats
tool and the http endpoints.
The tools provides 5 general ways of usage:
- Ad-hoc mode: request and store performance metrics and status data into memory, wait for enter to allow the actions to be investigated to be performed. Pressing enter will request and store the performance metrics and status data into memory again, for which the differences will be shown.
- Snapshot mode: request and store all performance metrics and status data to CSV files and display the snapshot number.
- Snapshot-diff mode: request the performance metrics and status data from the CSV files from snapshots of two different snapshots, and show the differences. This performs the same output as ad-hoc mode, but the metrics taken from snapshots instead of ad-hoc.
- Print snapshot mode: use the print functions with a snapshot number for data to be displayed that is stored in a snapshot, such as version, vars, etc.
- Print adhoc mode: use the print functions without a snapshot number for data to be displayed that is stored in a snapshot, such as version, vars, etc.
- Raw mode: you can always look into the stored CSV files yourself. However, for two sources of data, there is no function in yb_stats to publish the results: http endpoint
/pprof/growth
output, and/memz
output.
This is an overview of the data sources, the type of daemon where it comes from, the ports (if default), and the http endpoint it is using.
Datasource | Type | Default Port(s) | Endpoint |
---|---|---|---|
Metrics | tserver/master | 7000,9000,12000,13000 | /metrics |
Statements | tserver | 13000 | /statements |
Metrics | node-exporter | 9300 | /metrics |
Gflags | tserver/master | 7000,9000 | /varz |
Vars | tserver/master | 7000,9000 | /api/v1/varz |
Threads | tserver/master | 7000,9000 | /threadz |
Mem-trackers | tserver/master | 7000,9000 | /mem-trackers |
Logging | tserver/master | 7000,9000 | /logs |
Version | tserver/master | 7000,9000 | /api/v1/version |
Entities | master | 7000 | /dump-entities |
Masters | master | 7000 | /api/v1/masters |
Tablet-servers | master | 7000 | /api/v1/tablet-servers |
RPCs | tserver/master | 7000,9000,12000,13000 | /rpcz |
Pprof | tserver/master | 7000,9000 | /pprof/growth |
Memory breakdown | tserver/master | 7000,9000 | /memz |
Based on the datasources, these are the options for the differences sources:
Datasource | Snapshot | Ad-hoc | Diff | Print snap | Print ad-hoc | Raw |
---|---|---|---|---|---|---|
Metrics | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
Statement | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
Metrics | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
Gflags | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
Vars | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Thread | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
Mem-trackers | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
Logging | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
Version | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Entities | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Masters | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Tablet-servers | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
RPCs | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
Pprof | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
Memory breakdown | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
- Snapshot: datasources is captured with the snapshot command.
- Ad-hoc: the datasources that are involved in an ad-hoc (non-snapshot) use.
- Diff: the datasources that are involved in a snapshot-diff use. This is identical to ad-hoc, but the data taken from snapshots, instead of 'live'.
- Print: the datasources that are queryable via a yb_stats print-
command. - Raw: the datasources that are captured in the snapshot directory, but require reading the file directly, no print or diff command exists.
Install
Mac OSX via homebrew
Add the yb_stats "brew tap" (which is a github repository):
brew tap fritshoogland-yugabyte/yb_stats
Install yb_stats:
brew install yb_stats
yb_stats is available in /usr/local/bin, which should normally be in $PATH.
Uninstall yb_stasts via homebrew
Remove yb_stats:
brew uninstall yb_stats
Remove the yb_stats "brew tap":
brew untap fritshoogland-yugabyte/yb_stats
RPM based distributions
Install the provided yb_stats RPM via yum:
EL7:
sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.7.x86_64.rpm
EL8:
sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.8.x86_64.rpm
EL9:
sudo yum install https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.8/yb_stats-0.9.8-el.9.x86_64.rpm
After yum install, yb_stats is available in /usr/local/bin, which should normally be in $PATH.
These are current latest versions. Look yb_stats github repository releases page for newer versions.
Mac OSX compile from source
Install Rust via rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Clone the yb_stats repository:
git clone https://github.com/fritshoogland-yugabyte/yb_stats.git
Build yb_stats:
cd yb_stats
cargo build --release
The yb_stats
is available in ./target/release/ directory after successful compilation.
Linux compile from source
Install dependencies via yum:
sudo yum install -y git openssl-devel gcc
Install Rust via rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.sh | sh
Clone the yb_stats repository:
git clone https://github.com/fritshoogland-yugabyte/yb_stats.git
Build yb_stats:
cd yb_stats
cargo build --release
The yb_stats
is available in ./target/release/ directory after successful compilation.
Upgrade
Warning
Be aware that when upgrading from a version before 0.9 that the snapshot format has changed.
- Before
yb_stats
version 0.9, the snapshot data is stored as CSV. - Starting from
yb_stats
version 0.9, the snapshot data is stored as JSON.
Mac OSX via homebrew
This will upgrade the current installed version to the latest available tap version.
brew upgrade yb_stats
RPM based distributions
EL7:
sudo yum upgrade https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.8.8/yb_stats-0.8.8-1.el7.x86_64.rpm
EL8:
sudo yum upgrade https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.8.8/yb_stats-0.8.8-1.el8.x86_64.rpm
These are the current latest versions. Look at yb_stats github repository releases page for current versions.
Running yb_stats
Once yb_stats
is installed, it can be used to get metadata from a YugabyteDB cluster and to create snapshots.
Before yb_stats
can be used to create snapshots or perform ad-hoc operations, in other words query data from a YugabyteDB cluster,
the addresses of the endpoints must be specified using the --hosts
switch,
and optionally the --ports
switch if ports have been changed
(see: Specifying hosts and Specifying ports),
or the .env
file should exist in the current working directory.
If yb_stats
is run for obtaining data, which means running in ad-hoc mode, ad-hoc print mode or snapshot mode,
it must be able to access the YugabyteDB cluster nodes, and be allowed to access the different ports.
If yb_stats
is run for investigating snapshots using print mode or snapshot-diff mode,
there is no need to access the cluster ip addresses or ports, it will use the snapshot CSV data only.
For a more comprehensive view into the objects in the cluster, as well as the tablets, the --extra-data
switch can be used.
This switch will obtain detailed data about objects as well as tablets, at the cost of performing more work, and thus taking longer time. See the --extra-data
switch.
Specifying hosts
When yb_stats
is used to run in ad-hoc mode, ad-hoc print mode or in snapshot mode, the YugabyteDB cluster nodes must be reachable, and yb_stats
must be configured to understand the list of nodes. This is done via the switch --hosts
, and requires a comma separated list of ip addresses or hostnames.
The default hosts for yb_stats
are 192.168.66.80,192.168.66.81,192.168.66.82. This is unlikely to be the same for your use.
Please mind there should no be spaces between the hostnames or ipaddresses and the comma's.
Example:
yb_stats --hosts 10.1.0.1,10.1.0.2,10.1.03
The set hosts (as well as ports and parallellism) is stored in the file .env
in the current working directory.
This means that if you have different YugabyteDB clusters you want to use, you either:
- Need to specify the hosts every usage, which will still create a
.env
file, but is not used because of the specification. - Use a different directory for each cluster.
Using a different directory is highly recommended, especially because then the snapshots will be about the cluster specified only too.
Specifying ports
When yb_stats
is used to run in ad-hoc mode, ad-hoc print mode or in snapshot mode, the YugabyteDB cluster nodes must be reachable, and yb_stats
must be configured to understand the list of nodes. However, it must also connect to the correct port numbers.
In most cases, the port numbers are kept default. When the port numbers are kept default, there is no need to specify ports, and the default list will be okay. This means that unless one or more default ports have been changed, there is no need to specify ports.
Another reason to specify ports is when for example node-exporter is not installed: specifying the list excluding the node-exporter port YugabyteDB uses (9300) should be done in such a case, although not required.
Please mind the default node-exporter port number is 9000, but that port is taken by the tablet server http endpoint, which is why YugabyteDB uses a different port number for node-exporter.
Example:
yb_stats --ports 7000,9000,12000,13000
As an example, this list excludes the 9300 port for node-exporter.
The set ports (as well as hosts and parallellism) is stored in the file .env
in the current working directory. This means that if you have different YugabyteDB clusters you want to use, you either:
- Need to speicify the ports every usage, which will still create a
.env
file, but is not used because of the specification. - use a different directory for each cluster.
Using a different directory is highly recommended, especially because then the snapshots will be about the cluster specified only too.
Specifying parallel
The amount of parallelism can be set using the --parallel
switch. By default, yb_stats
uses a single thread for performing the work it is doing. For the work done in the code for each different type of datasource and for fetching data from the hosts fetching a specific type of data, it can perform work in parallel.
There is no parallelism for reading the data from the snapshots and presenting it.
Please mind the above description of the way parallelism works in yb_stats means that parallelism is performed at two different points. This means that specifying parallelism should be done with great care not to make the parallelism too high.
This is how that works; threads are used for:
-
Snapshot. When a snapshot is created, each different datasource is executed by an individual thread in parallel.
-
Requesting a hostname:port combination in a datasource thread. The thread for each datasource will scan each hostnamer:port combination using a thread.
Step 1 is always done in parallel, and limited by the number of concurrent active parallel threads the executable allows to create. This means that setting parallelism to will start 3 threads for each datasource, performing requests to 3 hostname:port combinations in each of the threads. The reason for this combination is that most of the time in requesting data is spent idle waiting for response.
On MacOS, setting parallel to a too high number can throw errors
- Number of files/file descriptors (tcp connect error: Too many open files (os error 24))
- Number of threads (
Err
value: ThreadPoolBuildError) In both cases this can be solved by lowering the value for parallel. For theToo many open files
error, another solution can be to increase the OS/user limit for the number of open files.
MacOS: https://gist.github.com/tombigel/d503800a282fcadbee14b537735d202c
.env
Whenever the yb_stats
executable is run with --hosts
, --ports
or --parallel
specified, it will use the value of the flag(s) set, but it will also write the specification of hosts, ports or parallellism into a file called .env
.
When yb_stats
is executed with a .env
file in the current working directory, it will set the environment variables listed in that file. If any of:
- YBSTATS_HOSTS
- YBSTATS_PORTS
- YBSTATS_PARALLEL
Are set in the .env
file, it will use the set value. This way, specifying either or all of the hosts list, ports list or parallellism only need to be done once, and will be set automatically for every next invocation executed from the same directory.
If a flag is specified with yb_stats
that is also set in .env
, the flag specified with yb_stats
will be given preference, and the entry in .env
will be overwritten with this new value.
WARNING
Please mind that if something outside of yb_stats
is using .env
for its own purposes and sets its own values in the file,
running yb_stats
with the current working directory holding a .env
file used by a third party, will overwrite the file and only set the yb_stats
values, removing any settings not used with yb_stats
.
yb_stats snapshot results
Directory yb_stats.snapshots
When yb_stats
is run with the --snapshot
mode switch, it will try to find a directory in the current working directory by the name of yb_stats.snaphots
.
If it can't find the directory, it will try to create it. If opening the directory in the case of existence, or creating in the case of non-existence fails, execution will stop with an error indicating the issue.
The yb_stats.snapshots/snapshot.index
file
Inside the yb_stats.snapshots
directory the first snapshot will create a CSV file called snapshot.index
. This file lists the snapshots taken, comma separated with the following fields:
- snapshot number
- snapshot timestamp
- snapshot comments
Using this file, yb_stats
can understand the snapshot numbers, and add one to the highest number for a new snapshot.
Another use of this file is to obtain the first snapshot timestamp for use with snapshot diffs: if a second snapshot inserts a metric that is not present in the first snapshot, it can safely assume the first snapshot value of that metric to be 0. However, it must make a guess about the approximate first snapshot time; which is where the timestamp is used.
A last use of this file is when yb_stats
is run with the --snapshot-list
switch: yb_stats
will list this information, and quit.
Example
This is how a typical snapshot is performed:
% yb_stats --snapshot
snapshot number 0
Error
If the current working directory is not writable, it will provide the following error:
[2022-11-30T10:59:18Z ERROR yb_stats::snapshot] Fatal: error creating directory /Users/fritshoogland/Downloads/t/yb_stats.snapshots: Permission denied (os error 13)
yb_stats --snapshot --extra-data switch
When the --extra-data
switch is added to the --snapshot
switch, then yb_stats
will get detailed data for each "user table" (actual table or materialized view), "index table" (index) and "system table" (PostgreSQL catalog table), as well as for every tablet.
This requires more work to be done by yb_stats
, which is why this is separate switch. It will request more data from all masters and tablet servers based on the number of tables and the number of tablets.
Output modes
The output that yb_stat
provides can roughly be divided into 3 categories:
- Information that is obtained from stored snapshots, which is formatted for readability.
- Information that is obtained from the live cluster, which is formatted for readability.
- Raw output provided in a text file, which is obtained at snapshot time. This is not printed, the resulting file can be used.
Snapshot usage
The locally stored snapshots must be available in the following way:
- The directory
yb_stats.snapshots
must be present and executable in the current working directory. - The directory
yb_stats.snapshots
must contain a file calledsnapshot.index
. - The
snapshot.index
file must list all the available snapshots. - The snapshots listed in
snapshot.index
must be available as directory with the snapshot number as directory name inside theyb_stats.snapshots
directory. - Inside the snapshot directory, the files containing the JSON data that make the snapshot must be present.
All the files in all the snapshots are UTF8 JSON files, and therefore can be easily transported.
This means you don't have to use the same computer to view the snapshot output: if the yb_stats.snapshot
directory and its contents are zipped, they can be copied and shared.
This means you can unzip a snapshots file, and investigate it on your own computer, without the need for access to the cluster where the snapshots came from.
Security restrictions
Because the snapshots are UTF8 JSON files, the files can be inspected by security officers to inspect existence of security issues or secret data.
yb_stats
does only store cluster metadata, no actual data. yb_stats
does store (part of) the logfiles, so these potentially can report actual data.
filters
Filter can be used for showing data, and are never used for creating snapshots.
To be more concrete, filters can be used with the following usage:
- No switches/ad-hoc mode.
- Snapshot-diff mode.
- Print modes, depending on the print topic.
Filters usage regular expressions for optimal flexibility.
--hostname-match
Most yb_stats
query options allow to filter on hostname. This is done with the --hostname-match
switch. The hostname match switch uses a regex to filter out entries using the hostname:port specification.
Very simple use of --hostname-match
is to filter on the port number, to specify only the tablet servers:
yb_stats --hostname-match 9000
Or specify to use only the tablet server and master servers, thereby filtering the node-exporter, YEDIS, YCQL and YSQL output:
yb_stats --hostname-match '(7000|9000)'
Because the filter is based on regex pattern matching, it is also easy to specify a master (=7000) for a class C ip network with hostnumber 82 (class C means for example 192.168.66/24, so one octet remains, so '82'), which means you can filter in this way:
yb_stats --hostname-match '82:7000'
--stat-name-match
For the statistic names with the performance metrics for a memory or snapshot diff, as well as for the gflags name and mem-trackers id values, it is possible to filter. The filter name for that is --stat-name-match
. This is a regex for filtering.
For example if you are only interested in master and tserver cpu statistics:
yb_stats --stat-name-match 'cpu_.time'
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 0.900 seconds
192.168.66.80:12000 server cpu_stime 15 ms 16.502 /s
192.168.66.80:7000 server cpu_stime 2 ms 2.200 /s
192.168.66.80:7000 server cpu_utime 8 ms 8.801 /s
192.168.66.80:9000 server cpu_stime 14 ms 15.402 /s
192.168.66.81:12000 server cpu_stime 6 ms 6.608 /s
192.168.66.81:12000 server cpu_utime 9 ms 9.912 /s
192.168.66.81:7000 server cpu_stime 8 ms 8.801 /s
192.168.66.81:9000 server cpu_stime 7 ms 7.709 /s
192.168.66.81:9000 server cpu_utime 9 ms 9.912 /s
192.168.66.82:12000 server cpu_stime 6 ms 6.645 /s
192.168.66.82:12000 server cpu_utime 10 ms 11.074 /s
192.168.66.82:7000 server cpu_utime 8 ms 8.840 /s
192.168.66.82:9000 server cpu_stime 6 ms 6.637 /s
192.168.66.82:9000 server cpu_utime 11 ms 12.168 /s
Or a more sophisticated filter to look at tserver and master cpu time, as well as voluntary and involuntary context switches:
yb_stats --stat-name-match '(cpu_.time|voluntary_context_switches)'
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 0.915 seconds
192.168.66.80:12000 server cpu_utime 12 ms 13.086 /s
192.168.66.80:12000 server voluntary_context_switches 234 csws 255.180 /s
192.168.66.80:7000 server cpu_stime 2 ms 2.181 /s
192.168.66.80:7000 server cpu_utime 9 ms 9.815 /s
192.168.66.80:7000 server voluntary_context_switches 105 csws 114.504 /s
192.168.66.80:9000 server cpu_utime 13 ms 14.177 /s
192.168.66.80:9000 server voluntary_context_switches 235 csws 256.270 /s
192.168.66.81:12000 server cpu_stime 6 ms 6.550 /s
192.168.66.81:12000 server cpu_utime 8 ms 8.734 /s
192.168.66.81:12000 server voluntary_context_switches 262 csws 286.026 /s
192.168.66.81:7000 server cpu_utime 7 ms 7.625 /s
192.168.66.81:7000 server voluntary_context_switches 70 csws 76.253 /s
192.168.66.81:9000 server cpu_stime 6 ms 6.543 /s
192.168.66.81:9000 server cpu_utime 8 ms 8.724 /s
192.168.66.81:9000 server voluntary_context_switches 263 csws 286.805 /s
192.168.66.82:12000 server cpu_stime 6 ms 6.565 /s
192.168.66.82:12000 server cpu_utime 8 ms 8.753 /s
192.168.66.82:12000 server voluntary_context_switches 272 csws 297.593 /s
192.168.66.82:7000 server cpu_utime 8 ms 8.762 /s
192.168.66.82:7000 server voluntary_context_switches 65 csws 71.194 /s
192.168.66.82:9000 server cpu_stime 6 ms 6.565 /s
192.168.66.82:9000 server cpu_utime 8 ms 8.753 /s
192.168.66.82:9000 server voluntary_context_switches 269 csws 294.311 /s
--table-name-match
yb_stats
by default sums up per table and tablet statistics per hostname-port number combination, to try to sensibly reduce the amount of output. However, yb_stats
can be set to display per table and tablet statistics individually using the --details-enable
switch.
If the --details-enable
switch is set, the table name is stored with the statistics for both the table and tablet data. The --table-name-match
switch allows you to filter on the table name, to have the ability to look at the statistics of only the tables of interest.
Additionally, the --table-name-match
switch can also be used for the printing the details of the entities data, which also are tables.
Example: filter for the sys.catalog (postgres catalog) entries in the master. Please mind the --hostname-match
option is also used, because otherwise the node-exporter output would still be shown, because that data is not filtered by --table-name-match
yb_stats --details-enable --table-name-match catalog --hostname-match 7000
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 0.679 seconds
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 1.621 seconds
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_bytes_read 4095007 bytes 2545063.393 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_data_hit 97 blocks 60.286 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_hit 172 blocks 106.899 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_index_hit 75 blocks 46.613 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_single_touch_bytes_read 4095007 bytes 2545063.393 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_single_touch_hit 172 blocks 106.899 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_db_iter_bytes_read 1353537 bytes 841228.713 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_no_table_cache_iterators 48 iters 29.832 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_next 2180 keys 1354.879 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_next_found 2225 keys 1382.846 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_seek 131 keys 81.417 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_seek_found 122 keys 75.823 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_superversion_acquires 4 nr 2.486 /s
Data enrichment
Despite yb_stats
providing a lot of output, it still is trying to reduce the total amount of data.
There are two options for increasing the amount of data: by adding gauge data, and by showing the original, non-summarized data for tables, tablets and cdc (change data capture) statistics.
--gauges-enable
The default output for metrics shows the values of metrics of the type counter. Counters are ever increasing values. In most cases, a counter value on itself is not that useful, but the difference between two points in time is. That is the reason metrics requires two snapshots: to have the ability to calculate the difference, and thus understand how much a counter has changed.
But, there are also metrics that show absolute values, for which the value is not ever increasing, but instead just shows the current situation. This means that the difference between two points in time does not provide the same meaning as for a counter. The difference between the two points in time for absolute values can still be important. Such as value is called a gauge.
By default, yb_stats
does NOT show gauge values. To make yb_stats
show gauge values additional to the counters, is using the --gauges-enable
switch.
This is yb_stats
in ad-hoc mode (not showing gauges):
yb_stats
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 1.019 seconds
192.168.66.80:12000 server cpu_stime 2 ms 2.200 /s
192.168.66.80:12000 server cpu_utime 13 ms 14.301 /s
192.168.66.80:12000 server server_uptime_ms 910 ms 1001.100 /s
192.168.66.80:12000 server voluntary_context_switches 233 csws 256.326 /s
This is yb_stats
in ad-hoc mode showing gauges:
yb_stats --gauges-enable
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
Time between snapshots: 1.053 seconds
192.168.66.80:12000 server cpu_stime 2 ms 1.927 /s
192.168.66.80:12000 server cpu_utime 13 ms 12.524 /s
192.168.66.80:12000 server generic_current_allocated_bytes 26914872 bytes +1120
192.168.66.80:12000 server generic_heap_size 43188224 bytes +0
192.168.66.80:12000 server hybrid_clock_error 500000 us +0
192.168.66.80:12000 server hybrid_clock_hybrid_time 6824678428388270080 us +4253745152
192.168.66.80:12000 server server_uptime_ms 1038 ms 1000.000 /s
192.168.66.80:12000 server tcmalloc_current_total_thread_cache_bytes 2560384 bytes +137976
192.168.66.80:12000 server tcmalloc_max_total_thread_cache_bytes 33554432 bytes +0
192.168.66.80:12000 server tcmalloc_pageheap_free_bytes 3579904 bytes -90112
192.168.66.80:12000 server tcmalloc_pageheap_unmapped_bytes 7847936 bytes -8192
192.168.66.80:12000 server threads_running 45 threads +0
192.168.66.80:12000 server threads_running_CQLServer_reactor 1 threads +0
192.168.66.80:12000 server threads_running_acceptor 1 threads +0
192.168.66.80:12000 server threads_running_iotp_CQLServer 4 threads +0
192.168.66.80:12000 server threads_running_rpc_thread_pool 15 threads +0
192.168.66.80:12000 server voluntary_context_switches 294 csws 283.237 /s
The gauge values can be spotted because they do not end with '/s', because they do not show its value per second. Instead, the first value is the value of the END snapshot, and the second value is the difference with the FIRST snapshot.
--details-enable
By default, statistics of the metric_types table, tablet and cdc are summed by metric_type and statistic name. This is done to reduce the amount of output in an as sensible way as possible without harming facts and truth.
However, sometimes it is necessary to split up the statistics per metric_type to understand metrics about specific table, tablet or cdc metric_types statistics. This can be done using the --details-enable
switch. This switch introduces a few more columns in the output to detail the table, tablet or cdc metric type statistics.
This is how the regular output looks like:
yb_stats --snapshot-diff -b 0 -e 1 --hostname-match 82:7000 --stat-name-match rocksdb
192.168.66.81:7000 tablet rocksdb_block_cache_bytes_read 11760999 bytes 865351.998 /s
192.168.66.81:7000 tablet rocksdb_block_cache_data_hit 255 blocks 18.762 /s
192.168.66.81:7000 tablet rocksdb_block_cache_hit 548 blocks 40.321 /s
192.168.66.81:7000 tablet rocksdb_block_cache_index_hit 293 blocks 21.558 /s
192.168.66.81:7000 tablet rocksdb_block_cache_single_touch_bytes_read 11760999 bytes 865351.998 /s
192.168.66.81:7000 tablet rocksdb_block_cache_single_touch_hit 548 blocks 40.321 /s
192.168.66.81:7000 tablet rocksdb_db_iter_bytes_read 1695826 bytes 124775.660 /s
192.168.66.81:7000 tablet rocksdb_no_table_cache_iterators 193 iters 14.201 /s
192.168.66.81:7000 tablet rocksdb_number_db_next 2783 keys 204.768 /s
192.168.66.81:7000 tablet rocksdb_number_db_next_found 2783 keys 204.768 /s
192.168.66.81:7000 tablet rocksdb_number_db_seek 268 keys 19.719 /s
192.168.66.81:7000 tablet rocksdb_number_db_seek_found 257 keys 18.910 /s
192.168.66.81:7000 tablet rocksdb_number_superversion_acquires 3 nr 0.221 /s
This is filtered down to statistics that are happening on the tablet metric type.
This is how that output looks like when --details-enable
is added:
yb_stats --snapshot-diff -b 0 -e 1 --hostname-match 82:7000 --stat-name-match rocksdb --details-enable
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_bytes_read 11760999 bytes 865351.998 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_data_hit 255 blocks 18.762 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_hit 548 blocks 40.321 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_index_hit 293 blocks 21.558 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_single_touch_bytes_read 11760999 bytes 865351.998 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_block_cache_single_touch_hit 548 blocks 40.321 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_db_iter_bytes_read 1695826 bytes 124775.660 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_no_table_cache_iterators 193 iters 14.201 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_next 2783 keys 204.768 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_next_found 2783 keys 204.768 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_seek 268 keys 19.719 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_db_seek_found 257 keys 18.910 /s
192.168.66.81:7000 tablet 00000000000000000000000000000000 sys.catalog rocksdb_number_superversion_acquires 3 nr 0.221 /s
In this case, the statistics were generated by a single tablet. However if multiple tablets were involved, the statistics for all of these would be shown.
The --details-enable
switch introduces a couple of extra columns:
- A column for the UUID or number of the metric object
- A column that shows the namespace and the object name.
yb_stats ad-hoc results
When yb_stats
is run without any switch, or only the data or filter switches, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference.
This is called 'ad-hoc mode'.
This will not only show the difference for performance based statistics (the metric counters and optionally gauges), but also any change in the cluster, such as:
- The addition or removal of tablet servers or masters.
- Restarts of tablet servers or masters.
- The creation or removal of database objects (tables, indexes, materialized views, databases/keyspaces).
- The change of any gflags of the tablet servers or masters.
- Any change for a replica, notably the LEADER or FOLLOWER state.
- Role changes for the masters.
The usage of either ad-hoc mode or snapshot mode should be carefully considered. ad-hoc alias in-memory snapshots does not write anything. In most cases performing snapshots persisting all the available information is the best way, so results can be reviewed later, and cannot get lost, because they are stored. However, if you are performing repeated tests where storing all snapshot information would simply be too much and would require you to remove all the snapshots after testing anyways AND you are shure what to look for, then ad-hoc mode might be used.
Example
This is how the first snapshot looks like in ad-hoc mode:
% yb_stats
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:
Time between snapshots: 70.166 seconds
192.168.66.80:12000 server cpu_stime 874 ms 12.458 /s
192.168.66.80:12000 server cpu_utime 7 ms 0.100 /s
...etcetera
snapshot-diff mode
The purpose of snapshot-diff mode is to read two snapshots which must be locally stored, and show a difference report.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
: filter by statistic name regular expression.--table-name-match
: filter by table name regular expression (requires--details-enable
to split table and tablet statistics out).--details-enable
: split table and tablet statistics, instead of summarizing these per server.--gauges-enable
: add non-counter statistics to the output.-b
/--begin
: set the begin snapshot number.-e
/--end
: set the end snapshot number.
snapshot-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the snapshot-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke snapshot-diff mode is to use the --snapshot-diff
switch.
If --snapshot-diff
is used without -b
/--begin
-e
/--end
snapshot-diff without begin/end specification:
yb_stats --snapshot-diff
0 2022-10-17 19:50:58.048195 +02:00
1 2022-10-17 19:52:34.413494 +02:00 second snap
2 2022-10-18 15:26:20.061213 +02:00
Enter begin snapshot: 0
Enter end snapshot: 1
192.168.66.80:12000 server cpu_stime 654 ms 6.792 /s
192.168.66.80:12000 server cpu_utime 311 ms 3.230 /s
192.168.66.80:12000 server involuntary_context_switches 1 csws 0.010 /s
192.168.66.80:12000 server server_uptime_ms 96292 ms 1000.000 /s
192.168.66.80:12000 server threads_started 4 threads 0.042 /s
192.168.66.80:12000 server threads_started_thread_pool 4 threads 0.042 /s
192.168.66.80:12000 server voluntary_context_switches 21821 csws 226.613 /s
snapshot-diff with begin/end specification:
yb_stats --snapshot-diff -b 0 -e 1
192.168.66.80:12000 server cpu_stime 654 ms 6.792 /s
192.168.66.80:12000 server cpu_utime 311 ms 3.230 /s
192.168.66.80:12000 server involuntary_context_switches 1 csws 0.010 /s
192.168.66.80:12000 server server_uptime_ms 96292 ms 1000.000 /s
192.168.66.80:12000 server threads_started 4 threads 0.042 /s
192.168.66.80:12000 server threads_started_thread_pool 4 threads 0.042 /s
192.168.66.80:12000 server voluntary_context_switches 21821 csws 226.613 /s
...etc...
The --snapshot-diff
shows all different data points for showing differences:
- Metrics
- (YSQL) statements
- Node-exporter
- Versions (master and tablet server software versions)
- Entities (YSQL and YCQL objects (tables, indexes and materialized views), databases/keyspaces, tablets and replicas)
- Master status
- Tablet server status
- Vars (gflags)
- Health check (from the master)
metrics-diff mode
The purpose of metrics-diff mode is to read two snapshots which must be locally stored, and show a difference report.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
: filter by statistic name regular expression.--table-name-match
: filter by table name regular expression (requires--details-enable
to split table and tablet statistics out).--details-enable
: split table and tablet statistics, instead of summarizing these per server.--gauges-enable
: add non-counter statistics to the output.-b
/--begin
: set the begin snapshot number.-e
/--end
: set the end snapshot number.
metrics-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the metrics-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke versions-diff mode is to use the --metrics-diff
switch.
If --metrics-diff
is used without -b
/--begin
-e
/--end
versions-diff without begin/end specification:
yb_stats --metrics-diff
0 2023-03-18 14:13:01.407795 +01:00
1 2023-03-18 14:13:53.959694 +01:00
2 2023-03-18 14:14:05.162338 +01:00
3 2023-03-19 14:20:50.977417 +01:00
4 2023-03-19 14:21:14.418544 +01:00
5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:12000 server cpu_stime 2669 ms 14.531 /s
192.168.66.80:12000 server cpu_utime 80 ms 0.436 /s
192.168.66.80:12000 server involuntary_context_switches 36 csws 0.196 /s
192.168.66.80:12000 server server_uptime_ms 183203 ms 997.441 /s
192.168.66.80:12000 server spinlock_contention_time 1018673 us 5546.123 /s
192.168.66.80:12000 server threads_started 12 threads 0.065 /s
192.168.66.80:12000 server threads_started_thread_pool 12 threads 0.065 /s
192.168.66.80:12000 server voluntary_context_switches 52096 csws 283.635 /s
...lots more output
versions-diff with begin/end specification:
yb_stats --snapshot-diff -b 4 -e 5
192.168.66.80:12000 server cpu_stime 2669 ms 14.531 /s
192.168.66.80:12000 server cpu_utime 80 ms 0.436 /s
192.168.66.80:12000 server involuntary_context_switches 36 csws 0.196 /s
192.168.66.80:12000 server server_uptime_ms 183203 ms 997.441 /s
192.168.66.80:12000 server spinlock_contention_time 1018673 us 5546.123 /s
192.168.66.80:12000 server threads_started 12 threads 0.065 /s
192.168.66.80:12000 server threads_started_thread_pool 12 threads 0.065 /s
192.168.66.80:12000 server voluntary_context_switches 52096 csws 283.635 /s
...lots more output
Metrics
The metrics related output always uses two snapshots.
- Ad-hoc mode performs the snapshots live and stores the results in memory, and doesn't write any file.
- Snapshot-diff mode takes local available previously taken snapshots.
Value statistics
In the output generated by ad-hoc or snapshot-diff mode, the first group of statistics shown are value statistics. The captured statistics are essentially a statistic name, and a statistic value. The values to be displayed are ordered by hostname, metric_type and statistic name.
By default, counters
are shown, for which the value is the difference between the end and begin values.
- If a counter is zero during both the begin and end snapshot, the statistic is skipped.
- If a counter is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
- If a counter is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
- If a counter is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
- If a counter is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.
counters
This is how value statistic output looks like:
192.168.66.80:12000 server cpu_stime 5 ms 6.188 /s
192.168.66.80:12000 server cpu_utime 9 ms 11.139 /s
192.168.66.80:12000 server server_uptime_ms 807 ms 998.762 /s
192.168.66.80:12000 server voluntary_context_switches 217 csws 268.564 /s
Explanation:
hostname:port | metric_type | statistic name | value | unit | value / snapshot time (s) |
---|---|---|---|---|---|
192.168.66.80:12000 | server | cpu_stime | 5 | ms | 6.188 /s |
192.168.66.80:12000 | server | cpu_utime | 9 | ms | 11.139 /s |
192.168.66.80:12000 | server | server_uptime_ms | 807 | ms | 998.762 /s |
192.168.66.80:12000 | server | voluntary_context_switches | 217 | csws | 268.564 /s |
gauges
If the --gauges-enable
switch is used, gauge values are shown alongside counter values.
A gauge value is a value that can get higher and lower during its runtime.
Therefore, we show the end value of such a value, and show the difference with the begin snapshot value with plus and minus.
- If a gauge is zero during both the begin and end snapshot, the statistic is skipped.
- If a gauge is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
- If a gauge is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
- if a gauge is non-zero in the begin and end snapshots, and subtracting leads to zero, the value is printed(!).
This is how that looks like:
192.168.66.80:12000 server cpu_stime 10 ms 10.893 /s
192.168.66.80:12000 server cpu_utime 5 ms 5.447 /s
192.168.66.80:12000 server generic_current_allocated_bytes 26908008 bytes +25472
192.168.66.80:12000 server generic_heap_size 43188224 bytes +0
192.168.66.80:12000 server hybrid_clock_error 500000 us +0
192.168.66.80:12000 server hybrid_clock_hybrid_time 6824687165143556096 us +3762429952
192.168.66.80:12000 server server_uptime_ms 918 ms 1000.000 /s
192.168.66.80:12000 server tcmalloc_current_total_thread_cache_bytes 2675304 bytes +124184
192.168.66.80:12000 server tcmalloc_max_total_thread_cache_bytes 33554432 bytes +0
192.168.66.80:12000 server tcmalloc_pageheap_free_bytes 1228800 bytes -90112
192.168.66.80:12000 server tcmalloc_pageheap_unmapped_bytes 9977856 bytes +0
192.168.66.80:12000 server threads_running 47 threads +0
192.168.66.80:12000 server threads_running_CQLServer_reactor 1 threads +0
192.168.66.80:12000 server threads_running_acceptor 1 threads +0
192.168.66.80:12000 server threads_running_iotp_CQLServer 4 threads +0
192.168.66.80:12000 server threads_running_rpc_thread_pool 15 threads +0
192.168.66.80:12000 server voluntary_context_switches 262 csws 285.403 /s
These are a gauge values:
192.168.66.80:12000 server generic_current_allocated_bytes 26908008 bytes +25472
192.168.66.80:12000 server generic_heap_size 43188224 bytes +0
192.168.66.80:12000 server hybrid_clock_error 500000 us +0
Explanation:
hostname:port | metric_type | statistic name | value | unit | end value - begin value |
---|---|---|---|---|---|
192.168.66.80:12000 | server | generic_current_allocated_bytes | 26908008 | bytes | +25472 |
192.168.66.80:12000 | server | generic_heap_size | 43188224 | bytes | +0 |
192.168.66.80:12000 | server | hybrid_clock_error | 500000 | us | +0 |
details
For the metric_types of table
, tablet
and cdc
, the statistics are kept per table
, tablet
or cdc
object.
To reduce the amount of data shown, these by default are summed together per server.
If the --details-enable
switch is used, the output changes to include metric_id, namespace and object_name.
This allows seeing the statistics per individual object.
This is how that looks like:
192.168.66.80:9000 server - - - tcp_bytes_received 75765 bytes 4293.608 /s
192.168.66.80:9000 server - - - tcp_bytes_sent 80901 bytes 4584.665 /s
192.168.66.80:9000 server - - - threads_started 5 threads 0.283 /s
192.168.66.80:9000 server - - - transaction_pool_cache_queries 1 qry 0.057 /s
192.168.66.80:9000 server - - - voluntary_context_switches 6232 csws 353.168 /s
192.168.66.80:9000 tablet d3265ac130b2b1f yugabyte t log_bytes_logged 403 bytes 22.838 /s
192.168.66.80:9000 tablet d3265ac130b2b1f yugabyte t rocksdb_bytes_written 12 bytes 0.677 /s
192.168.66.80:9000 tablet d3265ac130b2b1f yugabyte t rocksdb_sequence_number 2 rows 0.113 /s
192.168.66.80:9000 tablet d3265ac130b2b1f yugabyte t rocksdb_write_self 1 writes 0.056 /s
192.168.66.80:9000 tablet d3265ac130b2b1f yugabyte t rows_inserted 2 rows 0.113 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 log_bytes_logged 389 bytes 22.045 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rocksdb_bytes_written 12 bytes 0.677 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rocksdb_number_db_seek 1 keys 0.057 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rocksdb_number_superversion_acquires 1 nr 0.057 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rocksdb_sequence_number 1 rows 0.057 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rocksdb_write_self 1 writes 0.056 /s
192.168.66.80:9000 tablet 97122de784c10a3 yugabyte i_t_f1 rows_inserted 1 rows 0.056 /s
192.168.66.80:9000 tablet 654e97ca348833d system transactions log_bytes_logged 1199 bytes 67.947 /s
192.168.66.80:9000 tablet 04d3aadfcc0c75e system transactions log_bytes_logged 395 bytes 22.385 /s
Explanation:
hostname:port | metric_type | object_id | namespace | object name | statistic name | value | unit | value / snapshot time (s) |
---|---|---|---|---|---|---|---|---|
192.168.66.80:9000 | server | - | - | - | tcp_bytes_received | 75765 | bytes | 4293.608 /s |
192.168.66.80:9000 | tablet | d3265ac130b2b1f | yugabyte | t | log_bytes_logged | 403 | bytes | 22.838 /s |
192.168.66.80:9000 | tablet | 654e97ca348833d | system | transactions | log_bytes_logged | 1199 | bytes | 67.947 /s |
The columns added are the third, fourth and fifth columns.
- The third column shows the metric_id, which for the tablet is the tablet UUID, for the a table is table_id, and for cdc is ?. The snapshot stores the full metric_id, the length shown is 15 characters.
- The sixth column shows the object_name (it says 'table_name' in the attributes in the metric page, but an object can be an index or materialized view too).
- A 'server' metric_type does not carry meaningful a meaningful value in 'metric_id', and the namespace_name and object name is not present. Therefore, for server a '-' is shown.
details and gauges
The switches --details-enable
and --gauges-enable
work individually, but do influence each other.
This means that when --gauges-enable
is set, --details-enable
will also show gauge data per table
, tablet
or cdc
object:
192.168.66.80:9000 server - - - tcp_bytes_received 4694 bytes 3684.458 /s
192.168.66.80:9000 server - - - tcp_bytes_sent 4101 bytes 3218.995 /s
192.168.66.80:9000 server - - - threads_running 46 threads +0
192.168.66.80:9000 server - - - ts_split_compaction_added 15 reqs +0
192.168.66.80:9000 server - - - voluntary_context_switches 413 csws 324.176 /s
192.168.66.80:9000 tablet a06ff106f2b846d system transactions follower_lag_ms 97 ms -722
192.168.66.80:9000 tablet a06ff106f2b846d system transactions in_progress_ops 1 ops +0
192.168.66.80:9000 tablet a06ff106f2b846d system transactions log_wal_size 1048576 bytes +0
192.168.66.80:9000 tablet a06ff106f2b846d system transactions raft_term 9 terms +0
192.168.66.80:9000 tablet cf45509727f9601 system transactions follower_lag_ms 281 ms -339
hostname:port | metric_type | object_id | namespace | object name | statistic name | value | unit | end value - begin value |
---|---|---|---|---|---|---|---|---|
192.168.66.80:9000 | server | - | - | - | threads_running | 46 | threads | +0 |
192.168.66.80:9000 | tablet | a06ff106f2b846d | system | transactions | log_wal_size | 1048576 | bytes | +0 |
CountSum statistics
In the output genered by ad-hoc or snapshot-diff mode, the second group of statistics shown are 'countsum' statistics.
These statistics are named in this way, because for the use of yb_stats
, the count (total_count) and sum (total_sum) fields are the only usable statistical values.
The way 'countsum' statistics work is that an event in the code that is tracked by 'countsum' statistics keeps a count for the number of times the event was triggered, and a sum for what it measures.
In a lot of cases the unit the sum is taking is time (to capture the latency of the event), but can also be bytes (to capture the size of for example an IO), or something else.
The count and sum statistics are counters, for which the value that is used is the difference between the end and begin values. For the count value difference:
- If the value is zero during both the begin and end snapshot, the statistic is skipped.
- If the value is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
- If the value is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
- If the value is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
- If the value is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.
This is how countsum statistic output looks like:
192.168.66.80:7000 server handler_latency_outbound_call_queue_time 3 2.899 /s avg: 0 tot: 0 us
192.168.66.80:7000 server handler_latency_outbound_call_send_time 3 2.899 /s avg: 0 tot: 0 us
192.168.66.80:7000 server handler_latency_outbound_call_time_to_response 3 2.899 /s avg: 2666 tot: 8000 us
192.168.66.80:7000 server handler_latency_yb_master_MasterHeartbeat_TSHeartbeat 3 2.899 /s avg: 128 tot: 386 us
192.168.66.80:7000 server rpc_incoming_queue_time 3 2.899 /s avg: 146 tot: 439 us
Explanation:
hostname:port | metric_type | statistic name | count | count / snapshot time (s) | sum / count | sum total | sum unit |
---|---|---|---|---|---|---|---|
192.168.66.80:7000 | server | handler_latency_outbound_call_queue_time | 3 | 2.899 /s | avg: 0 | tot: 0 | us |
192.168.66.80:7000 | server | handler_latency_outbound_call_send_time | 3 | 2.899 /s | avg: 0 | tot: 0 | us |
192.168.66.80:7000 | server | handler_latency_outbound_call_time_to_response | 3 | 2.899 /s | avg: 2666 | tot: 8000 | us |
192.168.66.80:7000 | server | handler_latency_yb_master_MasterHeartbeat_TSHeartbeat | 3 | 2.899 /s | avg: 128 | tot: 386 | us |
192.168.66.80:7000 | server | rpc_incoming_queue_time | 3 | 2.899 /s | avg: 146 | tot: 439 | us |
gauges
There is no gauges-like statistic type in 'countsum' statistics.
details enable
192.168.66.80:9000 server - - - rpc_incoming_queue_time 143 13.877 /s avg: 103 tot: 14807 us
192.168.66.80:9000 server - - - transaction_pool_cache 1 0.097 /s avg: 0 tot: 0 us
192.168.66.80:9000 table 000000000004000 yugabyte t log_append_latency 4 0.388 /s avg: 45 tot: 182 us
192.168.66.80:9000 table 000000000004000 yugabyte t log_entry_batches_per_group 3 0.291 /s avg: 1 tot: 4 requests
192.168.66.80:9000 table 000000000004000 yugabyte t log_group_commit_latency 3 0.291 /s avg: 2319 tot: 6958 us
192.168.66.80:9000 table 000000000004000 yugabyte t log_sync_latency 1 0.097 /s avg: 6706 tot: 6706 us
192.168.66.80:9000 table 000000000004000 yugabyte t rocksdb_bytes_per_write 3 0.291 /s avg: 12 tot: 36 bytes
192.168.66.80:9000 table 000000000004000 yugabyte t rocksdb_db_write_micros 3 0.291 /s avg: 11 tot: 34 us
Explanation:
hostname:port | metric_type | object_id | namespace | object name | statistic name | count | count snapshot time (s) | sum / count | sum total | sum unit |
---|---|---|---|---|---|---|---|---|---|---|
192.168.66.80:9000 | server | - | - | - | rpc_incoming_queue_time | 143 | 13.877 /s | avg: 103 | tot: 14807 | us |
192.168.66.80:9000 | server | - | - | - | transaction_pool_cache | 1 | 0.097 /s | avg: 0 | tot: 0 | us |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | log_append_latency | 4 | 0.388 /s | avg: 45 | tot: 182 | us |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | log_entry_batches_per_group | 3 | 0.291 /s | avg: 1 | tot: 4 | requests |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | log_group_commit_latency | 3 | 0.291 /s | avg: 2319 | tot: 6958 | us |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | log_sync_latency | 1 | 0.097 /s | avg: 6706 | tot: 6706 | us |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | rocksdb_bytes_per_write | 3 | 0.291 /s | avg: 12 | tot: 36 | bytes |
192.168.66.80:9000 | table | 000000000004000 | yugabyte | t | rocksdb_db_write_micros | 3 | 0.291 /s | avg: 11 | tot: 34 | us |
Countsum statistics are called 'course_histograms' in the YugabyteDB sourcecode, and have the fields count
and sum
in common with 'summaries' in prometheus, however quantile items are not available. YugabyteDB adds the fields min, mean, max, percentile_75, percentile_95, percentile_99, percentile_99_9, and percentile_99_99 to its metrics. These fields are reset when the metrics are read.
CountSumRows statistics
In the output generated by ad-hoc or snapshot-diff mode, a third group that optionally can be shown are the 'countsumrows' statistics. These statistics are taken from the YSQL (normally port 13000) http endpoint. If no SQL interaction did happen between YSQL/postgres and DocDB, there will be no statistics shown.
- If a statistic has a zero count in both the begin and end snapshot, it will be skipped.
- If a statistic has a non-zero count in both the begin and end snapshot, and subtracting the values leads to zero, it will be skipped.
- if a statistic has a lower value in the end snapshot than in the begin snapshot, currently the statistics will be shown, and might get negative.
This is how countsumrows statistic output looks like:
192.168.66.80:13000 handler_latency_yb_ysqlserver_SQLProcessor_InsertStmt 1 avg: 18.552 tot: 18.552 ms, avg: 1 tot: 1 rows
192.168.66.80:13000 handler_latency_yb_ysqlserver_SQLProcessor_SingleShardTransactions 1 avg: 18.552 tot: 18.552 ms, avg: 1 tot: 1 rows
192.168.66.80:13000 handler_latency_yb_ysqlserver_SQLProcessor_Single_Shard_Transactions 1 avg: 18.552 tot: 18.552 ms, avg: 1 tot: 1 rows
192.168.66.80:13000 handler_latency_yb_ysqlserver_SQLProcessor_Transactions 1 avg: 18.552 tot: 18.552 ms, avg: 1 tot: 1 rows
Explanation:
hostname:port | statistic name | count | sum / count | sum total | sum unit | rows / count | rows total |
---|---|---|---|---|---|---|---|
192.168.66.80:13000 | handler_latency_yb_ysqlserver_SQLProcessor_InsertStmt | 1 | avg: 18.552 | tot: 18.552 | ms | avg: 1 | tot: 1 rows |
192.168.66.80:13000 | handler_latency_yb_ysqlserver_SQLProcessor_SingleShardTransactions | 1 | avg: 18.552 | tot: 18.552 | ms | avg: 1 | tot: 1 rows |
192.168.66.80:13000 | handler_latency_yb_ysqlserver_SQLProcessor_Single_Shard_Transactions | 1 | avg: 18.552 | tot: 18.552 | ms | avg: 1 | tot: 1 rows |
192.168.66.80:13000 | handler_latency_yb_ysqlserver_SQLProcessor_Transactions | 1 | avg: 18.552 | tot: 18.552 | ms | avg: 1 | tot: 1 rows |
node-exporter-diff mode
The purpose of node-exporter-diff mode is to read two snapshots which must be locally stored, and show a difference report.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
: filter by statistic name regular expression.--details-enable
: add the source of summarized counters, and some filtered out counters.--gauges-enable
: add non-counter statistics to the output.-b
/--begin
: set the begin snapshot number.-e
/--end
: set the end snapshot number.
node-exporter-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the node-exporter-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke versions-diff mode is to use the --node-exporter-diff
switch.
If --node-exporter-diff
is used without -b
/--begin
-e
/--end
versions-diff without begin/end specification:
yb_stats --node-exporter-diff
0 2023-03-18 14:13:01.407795 +01:00
1 2023-03-18 14:13:53.959694 +01:00
2 2023-03-18 14:14:05.162338 +01:00
3 2023-03-19 14:20:50.977417 +01:00
4 2023-03-19 14:21:14.418544 +01:00
5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:9300 counter node_context_switches_total 169483.000000 926.137 /s
192.168.66.80:9300 counter node_cpu_seconds_total_idle 169.720000 0.927 /s
192.168.66.80:9300 counter node_cpu_seconds_total_iowait 0.010000 0.000 /s
192.168.66.80:9300 counter node_cpu_seconds_total_irq 5.350000 0.029 /s
192.168.66.80:9300 counter node_cpu_seconds_total_softirq 1.300000 0.007 /s
192.168.66.80:9300 counter node_cpu_seconds_total_system 2.110000 0.012 /s
192.168.66.80:9300 counter node_cpu_seconds_total_user 0.090000 0.000 /s
192.168.66.80:9300 counter node_disk_io_time_seconds_total_sda 0.076000 0.000 /s
...lots more output
versions-diff with begin/end specification:
yb_stats -node-exporter-diff -b 4 -e 5
192.168.66.80:9300 counter node_context_switches_total 169483.000000 926.137 /s
192.168.66.80:9300 counter node_cpu_seconds_total_idle 169.720000 0.927 /s
192.168.66.80:9300 counter node_cpu_seconds_total_iowait 0.010000 0.000 /s
192.168.66.80:9300 counter node_cpu_seconds_total_irq 5.350000 0.029 /s
192.168.66.80:9300 counter node_cpu_seconds_total_softirq 1.300000 0.007 /s
192.168.66.80:9300 counter node_cpu_seconds_total_system 2.110000 0.012 /s
192.168.66.80:9300 counter node_cpu_seconds_total_user 0.090000 0.000 /s
192.168.66.80:9300 counter node_disk_io_time_seconds_total_sda 0.076000 0.000 /s
...lots more output
Node-exporter statistics
In the output of ad-hoc or snapshot-diff mode, if node-exporter is installed on the YugabyteDB cluster, the last group of statistics that are shown are the node-exporter statistics. The captured statistics are essentially a statistic name and a statistic value. The values to be displayed are ordered by hostname, metric_type and metric_name.
By default, counter
values are shown, for which the value is the difference between the end and begin values.
- If a counter is zero during both the begin and end snapshot, the statistic is skipped.
- If a counter is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
- If a counter is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
- If a counter is non-zero in the begin and end snapshots, but subtracting leads to zero, then the statistic is not printed. Supposedly nothing happened, but previously something happened.
- If a counter is non-zero in the begin and end snapshots, but the end value is lower than the begin value: this is a suspicious situation. Currently the resulting negative value is shown.
This is how node-exporter statistics output looks like:
192.168.66.80:9300 counter node_context_switches_total 7759.000000 862.111 /s
192.168.66.80:9300 counter node_cpu_seconds_total_idle 8.150000 0.906 /s
192.168.66.80:9300 counter node_cpu_seconds_total_irq 0.310000 0.034 /s
192.168.66.80:9300 counter node_cpu_seconds_total_softirq 0.120000 0.013 /s
192.168.66.80:9300 counter node_cpu_seconds_total_system 0.170000 0.019 /s
192.168.66.80:9300 counter node_cpu_seconds_total_user 0.010000 0.001 /s
Explanation:
hostname:port | metric_type | statistic_name | value | value / snapshot time (s) |
---|---|---|---|---|
192.168.66.80:9300 | counter | node_context_switches_total | 7759.000000 | 862.111 /s |
192.168.66.80:9300 | counter | node_cpu_seconds_total_idle | 8.150000 | 0.906 /s |
192.168.66.80:9300 | counter | node_cpu_seconds_total_irq | 0.310000 | 0.034 /s |
192.168.66.80:9300 | counter | node_cpu_seconds_total_softirq | 0.120000 | 0.013 /s |
192.168.66.80:9300 | counter | node_cpu_seconds_total_system | 0.170000 | 0.019 /s |
192.168.66.80:9300 | counter | node_cpu_seconds_total_user | 0.010000 | 0.001 /s |
gauges
If the --gauges
enable switch is used, gauge type values are shown alongside counter values.
- If a gauge is zero during both the begin and end snapshot, the statistic is skipped.
- If a gauge is non-zero and existing in the end snapshot, and the statistic is not existing in the begin snapshot, the end snapshot value is taken as value.
- If a gauge is non-zero and existing in the begin snapshot, and not existing in the end snapshot, the value is skipped.
- if a gauge is non-zero in the begin and end snapshots, and subtracting leads to zero, the value is printed(!).
This is what this looks like:
192.168.66.80:9300 gauge node_arp_entries_eth0 2.000000 +0
192.168.66.80:9300 gauge node_arp_entries_eth1 3.000000 +0
192.168.66.80:9300 gauge node_boot_time_seconds 1666174770.000000 +0
192.168.66.80:9300 counter node_context_switches_total 994.000000 994.000 /s
192.168.66.80:9300 gauge node_cooling_device_max_state_1_intel_powerclamp 50.000000 +0
Explanation:
hostname:port | metric_type | statistic name | end value | end value - begin value |
---|---|---|---|---|
192.168.66.80:9300 | gauge | node_arp_entries_eth0 | 2.000000 | +0 |
192.168.66.80:9300 | gauge | node_arp_entries_eth1 | 3.000000 | +0 |
192.168.66.80:9300 | gauge | node_boot_time_seconds | 1666174770.000000 | +0 |
statements-diff mode
The purpose of statements-diff mode is to read two snapshots which must be locally stored, and show a difference report.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--sql-length
: the maximum length of a SQL statements, for readability. Default: 80.-b
/--begin
: set the begin snapshot number.-e
/--end
: set the end snapshot number.
statements-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the statements-diff mode only uses the information that is stored in the locally available snapshot (JSON) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke versions-diff mode is to use the --statements-diff
switch.
If --statements-diff
is used without -b
/--begin
-e
/--end
versions-diff without begin/end specification:
yb_stats --statements-diff
0 2023-03-18 14:13:01.407795 +01:00
1 2023-03-18 14:13:53.959694 +01:00
2 2023-03-18 14:14:05.162338 +01:00
3 2023-03-19 14:20:50.977417 +01:00
4 2023-03-19 14:21:14.418544 +01:00
5 2023-03-19 14:24:17.733927 +01:00
Enter begin snapshot: 4
Enter end snapshot: 5
192.168.66.80:13000 1 avg: 0.459 tot: 0.459 ms avg: 1 tot: 1 rows: select now()
statements-diff with begin/end specification:
yb_stats --statements-diff -b 4 -e 5
192.168.66.80:13000 1 avg: 0.459 tot: 0.459 ms avg: 1 tot: 1 rows: select now()
Statement statistics
In the output generated by ad-hoc or snapshot-diff mode, a fourth group that optionally can be shown are statement statistics. These statistics are taken from the YSQL (normally port 13000) http endpoint. If no SQL interaction did happen between YSQL/postgres and DocDB, then there will be no statements shown.
This is how the statement output looks like:
192.168.66.80:13000 98 avg: 0.002 tot: 0.156 ms avg: 0 tot: 0 rows: begin
192.168.66.80:13000 98 avg: 0.003 tot: 0.305 ms avg: 0 tot: 0 rows: commit
192.168.66.80:13000 98 avg: 78.722 tot: 7714.786 ms avg: 1020 tot: 100000 rows: copy ysql_bench_accounts from stdin
192.168.66.80:13000 1 avg: 456.096 tot: 456.096 ms avg: 0 tot: 0 rows: create table ysql_bench_accounts(aid int not null,bid int,abalance int,filler
192.168.66.80:13000 1 avg: 451.798 tot: 451.798 ms avg: 0 tot: 0 rows: create table ysql_bench_branches(bid int not null,bbalance int,filler char(88),P
192.168.66.80:13000 1 avg: 483.839 tot: 483.839 ms avg: 0 tot: 0 rows: create table ysql_bench_history(tid int,bid int,aid int,delta int,mtime times
192.168.66.80:13000 1 avg: 394.892 tot: 394.892 ms avg: 0 tot: 0 rows: create table ysql_bench_tellers(tid int not null,bid int,tbalance int,filler cha
192.168.66.80:13000 1 avg: 569.624 tot: 569.624 ms avg: 0 tot: 0 rows: drop table if exists ysql_bench_accounts, ysql_bench_branches, ysql_bench_histor
192.168.66.80:13000 1 avg: 11.560 tot: 11.560 ms avg: 1 tot: 1 rows: insert into ysql_bench_branches(bid,bbalance) values($1,$2)
192.168.66.80:13000 10 avg: 8.218 tot: 82.179 ms avg: 1 tot: 10 rows: insert into ysql_bench_tellers(tid,bid,tbalance) values ($1,$2,$3)
192.168.66.80:13000 1 avg: 6641.962 tot: 6641.962 ms avg: 0 tot: 0 rows: truncate table ysql_bench_accounts, ysql_bench_branches, ysql_bench_history, ysq
Explanation:
hostname:port | calls | total_time / calls | total_time | unit total_time | rows / calls | total rows | query |
---|---|---|---|---|---|---|---|
192.168.66.80:13000 | 98 | avg: 0.002 | tot: 0.156 | ms | avg: 0 | tot: 0 rows | begin |
192.168.66.80:13000 | 98 | avg: 0.003 | tot: 0.305 | ms | avg: 0 | tot: 0 rows | commit |
192.168.66.80:13000 | 98 | avg: 78.722 | tot: 7714.786 | ms | avg: 1020 | tot: 100000 rows | copy ysql_bench_accounts from stdin |
192.168.66.80:13000 | 1 | avg: 456.096 | tot: 456.096 | ms | avg: 0 | tot: 0 rows | create table ysql_bench_accounts(aid int not null,bid int,abalance int,filler |
192.168.66.80:13000 | 1 | avg: 451.798 | tot: 451.798 | ms | avg: 0 | tot: 0 rows | create table ysql_bench_branches(bid int not null,bbalance int,filler char(88),P |
192.168.66.80:13000 | 1 | avg: 483.839 | tot: 483.839 | ms | avg: 0 | tot: 0 rows | create table ysql_bench_history(tid int,bid int,aid int,delta int,mtime times |
192.168.66.80:13000 | 1 | avg: 394.892 | tot: 394.892 | ms | avg: 0 | tot: 0 rows | create table ysql_bench_tellers(tid int not null,bid int,tbalance int,filler cha |
192.168.66.80:13000 | 1 | avg: 569.624 | tot: 569.624 | ms | avg: 0 | tot: 0 rows | drop table if exists ysql_bench_accounts, ysql_bench_branches, ysql_bench_histor |
192.168.66.80:13000 | 1 | avg: 11.560 | tot: 11.560 | ms | avg: 1 | tot: 1 rows | insert into ysql_bench_branches(bid,bbalance) values($1,$2) |
192.168.66.80:13000 | 10 | avg: 8.218 | tot: 82.179 | ms | avg: 1 | tot: 10 rows | insert into ysql_bench_tellers(tid,bid,tbalance) values ($1,$2,$3) |
192.168.66.80:13000 | 1 | avg: 6641.962 | tot: 6641.962 | ms | avg: 0 | tot: 0 rows | truncate table ysql_bench_accounts, ysql_bench_branches, ysql_bench_history, ysq |
For the sake of simplicity, any identical SQL (based on the query text) is summed up, and assumed to be the same statement. This is not correct. (at the time of creation, query_id was not exposed, so this was the only solution)
Please mind the source of the SQL statistics is postgres' pg_stat_statements, and holds a few quirks:
- Any SQL that returns an error is not saved in 'pg_stat_statements'.
- The 'total_time' is actually the time spent in the execution phase. Especially since YSQL might need to perform RPCs to complete its catalog (in rewrite/semantic parse and plan phases), this can miss some time and therefore show less time than a client sees.
- A statement's uniqueness in pg_stat_statements is dependent on query_id, dbid and userid. Currently we don't expose all this fields in the http endpoint.
versions-diff mode
The purpose of versions-diff mode is to read two snapshots which must be locally stored, and show a difference report.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.-b
/--begin
: set the begin snapshot number.-e
/--end
: set the end snapshot number.
versions-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the versions-diff mode only uses the information that is stored in the locally available snapshot (CSV) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke versions-diff mode is to use the --versions-diff
switch.
If --versions-diff
is used without -b
/--begin
-e
/--end
versions-diff without begin/end specification:
yb_stats --versions-diff
0 2022-12-08 16:42:01.226043 +01:00
1 2022-12-09 16:34:08.057222 +01:00
2 2022-12-10 16:07:57.948800 +01:00
3 2022-12-10 21:39:04.439287 +01:00
4 2022-12-10 21:39:33.664075 +01:00
5 2022-12-10 21:42:56.852644 +01:00
6 2022-12-10 21:43:00.348445 +01:00
Enter begin snapshot: 5
Enter end snapshot: 6
No output means there is no version difference between the two snapshots.
versions-diff with begin/end specification:
yb_stats --snapshot-diff -b 0 -e 1
* 192.168.66.80:7000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.80:9000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.81:7000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.81:9000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.82:7000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
* 192.168.66.82:9000 Versions: 2.15.3.2->2.17.0.0 b1->b24 RELEASE 04 Nov 2022 16:53:01 UTC->16 Nov 2022 00:21:52 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14->d4f01a5e26b168585e59f9c1a95766ffdd9655b1
version diff
Whenever yb_stats
is used in ad-hoc diff mode or in snapshot-diff mode, it will read the version using the /api/v1/version
endpoint.
This is executed for both the masters and the tablet servers.
This is done during the begin and end snapshots.
Example of changed versions on all servers between and begin and end snapshot:
* ip-172-158-59-19:7000 Versions: 2.17.1.0 184->201 RELEASE 09 Nov 2022 22:48:49 UTC->14 Nov 2022 13:07:31 UTC 971d32f9bf50d11c067bda4b5498d27611583c2c->d9d98761806ef4f37f501e2eb40bb7dcd981bb65
* ip-172-158-59-19:9000 Versions: 2.17.1.0 184->201 RELEASE 09 Nov 2022 22:48:49 UTC->14 Nov 2022 13:07:31 UTC 971d32f9bf50d11c067bda4b5498d27611583c2c->d9d98761806ef4f37f501e2eb40bb7dcd981bb65
This shows that on the servers ip-172-158-59-19:7000
and ip-172-158-59-19:9000
there have been version changes.
- The version stayed the same.
- The build number changed from
184
to201
. - The build date changed from
09 Nov 2022 22:48:49 UTC
to14 Nov 2022 13:07:31 UTC
. - The build git hash changed from
971d32f9bf50d11c067bda4b5498d27611583c2c
tod9d98761806ef4f37f501e2eb40bb7dcd981bb65
.
entities diff
Whenever yb_stats
is used in ad-hoc diff mode or in snapshot-diff mode, it will read the entity data from the master leader during the begin and the end snapshot.
Entities are objects known to the master which deal with storing and organising the data for YSQL and YCQL.
The end snapshot will verify if the entities found with the begin snapshot, and show any difference it will find.
By default, it will skip the YugabyteDB default system databases/keyspaces:
- "00000000000000000000000000000001" | // ycql system
- "00000000000000000000000000000002" | // ycql system_schema
- "00000000000000000000000000000003" | // ycql system_auth
- "00000001000030008000000000000000" | // ysql template1
- "000033e5000030008000000000000000") // ysql template0
When any of the following objects are added or removed, yb_stats
will show the change:
Example diff where a YSQL database is created, and a table and an index:
+ Database: ysql.testdb, id: 0000414d000030008000000000000000
+ Object: ysql.testdb.testtable, state: RUNNING, id: 0000414d00003000800000000000414e
+ Object: ysql.testdb.testindex, state: RUNNING, id: 0000414d000030008000000000004153
+ Tablet: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 state: RUNNING, leader: yb-2.local:9100
+ Tablet: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a state: RUNNING, leader: yb-3.local:9100
+ Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-2.local:9100, type: VOTER
+ Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-1.local:9100, type: VOTER
+ Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-3.local:9100, type: VOTER
+ Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-2.local:9100, type: VOTER
+ Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-1.local:9100, type: VOTER
+ Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-3.local:9100, type: VOTER
And an example where the above YSQL database is dropped:
- Database: ysql.testdb, id: 0000414d000030008000000000000000
- Object: ysql.testdb.testtable, state: RUNNING, id: 0000414d00003000800000000000414e
- Object: ysql.testdb.testindex, state: RUNNING, id: 0000414d000030008000000000004153
- Tablet: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 state: RUNNING, leader: yb-2.local:9100
- Tablet: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a state: RUNNING, leader: yb-3.local:9100
- Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-2.local:9100, type: VOTER
- Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-1.local:9100, type: VOTER
- Replica: ysql.testdb.testtable.13887764dbe74dfd8303d5785e5821a8 server: yb-3.local:9100, type: VOTER
- Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-2.local:9100, type: VOTER
- Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-1.local:9100, type: VOTER
- Replica: ysql.testdb.testindex.3320638518854a2b8c0c5bed30a9861a server: yb-3.local:9100, type: VOTER
Or detect a leader change for a tablet:
* Tablet: ysql.testdb.testtable.5350a928953c4eb1aaa9eb0581a3112b state: RUNNING leader: yb-3.local:9100->yb-1.local:9100
masters diff
Whenever yb_stats
is used in ad-hoc diff mode or in snapshot-diff mode, it will read the master metadata from the master leader during the begin and the end snapshot.
In this way, any changes to the masters, such as a reboot/restart or a role change can be seen.
Example of a leader change because of a restart:
* Master b460d504c6aa488d97bfe266ab506ab6 FOLLOWER->LEADER Cloud: local, Region: local, Zone: local,
Seqno: 1669798949509360, Start time: 1669798949509360
Http ( yb-3.local:7000 )
Rpc ( yb-3.local:7100 )
* Master d3db2544098b4b808c0c65d4d19f4d3a LEADER->FOLLOWER Cloud: local, Region: local, Zone: local,
Seqno: 1669798888831913->1669824380213235, Start time: 1669798888831913->1669824380213235
Http ( yb-1.local:7000 )
Rpc ( yb-1.local:7100 )
Here the master d3db2544098b4b808c0c65d4d19f4d3a
had it's role changed from LEADER to FOLLOWER.
In the same snapshot, the master b460d504c6aa488d97bfe266ab506ab6
had it's role changed from FOLLOWER to LEADER.
The reason for role change from LEADER to FOLLOWER can not be seen, but the start time and the seqno properties also did change. The change of start time shows the start time was renewed, indicating the master was restarted. The sequence number is identical to the start time, and therefore changed along with the start time.
Please mind that if no changed happened between the begin and end snapshot, no output will be shown.
tablet-servers-diff
Whenever yb_stats
is used in ad-hoc diff mode or in snapshot-diff mode, it will read the tablet server metadata from the master leader during the begin and the end snapshot.
In this way, any changes to the tablet servers, such as a reboot/restart or a role change can be seen.
Example of a tablet server having been restarted:
0 2023-03-18 14:13:01.407795 +01:00
1 2023-03-18 14:13:53.959694 +01:00
2 2023-03-18 14:14:05.162338 +01:00
3 2023-03-19 14:20:50.977417 +01:00
4 2023-03-19 14:21:14.418544 +01:00
5 2023-03-19 14:24:17.733927 +01:00
6 2023-03-19 15:52:10.007082 +01:00
7 2023-03-19 15:52:45.711866 +01:00
Enter begin snapshot: 5
Enter end snapshot: 7
= Tserver: yb-1.local:9000, status: ALIVE, uptime: 819->40
Here the tablet server named yb-1.local:9000 did show a change of it's uptime from 819 to 40 seconds, which indicates a restart.
The next thing to look for from here might be --entity-diff
, because this can cause replicas to change RAFT role.
Please mind that if no changed happened between the begin and end snapshot, no output will be shown.
vars diff
Whenever yb_stats
is used in ad-hoc diff mode or in snapshot-diff mode, it will read the vars (gflags) using the /api/v1/varz
endpoint.
This is executed for both the masters and the tablet servers.
This is done during the begin and end snapshots.
Example of a changed var on all servers between and begin and end snapshot:
* 192.168.66.80:9000 Vars: ysql_enable_packed_row false->true Default->Custom
* 192.168.66.81:9000 Vars: ysql_enable_packed_row false->true Default->Custom
* 192.168.66.82:9000 Vars: ysql_enable_packed_row false->true Default->Custom
This shows that on the servers 192.168.66.80, 192.168.66.81 and 192.168.66.82 servers on endpoint 9000 (the default tablet server port) a change was detected and reported.
- The var/gflag is
ysql_enabled_packed_row
. - The value is changed from
false
totrue
. - The change of var changed the type of var from
Default
(non changed) toCustom
(changed).
yb_stats --healthcheck-diff
snapshot-nonmetrics-diff mode
The purpose of snapshot-nonmetrics-diff mode is to read two snapshots which must be locally stored, and show a difference report. The special purpose of nonmetrics is that it excludes the quite numerous detailed statistics, and only shows the differences for:
- entities
- masters
- tablet servers
- vars (gflags)
- versions
- healthcheck
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
: filter by statistic name regular expression.--details-enable
: split table and tablet statistics, instead of summarizing these per server. ➜ yb_stats --snapshot-nonmetrics-diff -b 0 -e 2
- Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
- Tablet: ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
- Replica: yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER
= Tserver: yb-1.local:9000, status: ALIVE, uptime: 2619->0
= 192.168.66.80:12000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
= 192.168.66.80:9000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default-
-b
/--begin
: set the begin snapshot number.
-e
/--end
: set the end snapshot number.
snapshot-nonmetric-diff mode means using already stored snapshots, which can be from a cluster that currently is unavailable or even deleted, because the snapshot-nonmetric-diff mode only uses the information that is stored in the locally available snapshot (CSV) data. This gives a lot of options for investigation that otherwise would be hard or painful, and allows to investigate airgapped clusters (clusters that are not connected to the internet).
The way to invoke snapshot-diff mode is to use the --snapshot-nonmetric-diff
switch.
If --snapshot-nonmetric-diff
is used without -b
/--begin
-e
/--end
snapshot-nonmetric-diff without begin/end specification:
➜ yb_stats --snapshot-nonmetrics-diff
0 2023-03-18 14:13:01.407795 +01:00
1 2023-03-18 14:13:53.959694 +01:00
2 2023-03-18 14:14:05.162338 +01:00
Enter begin snapshot: 0
Enter end snapshot: 2
+ Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
+ Tablet: ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
+ Replica: yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER
= Tserver: yb-1.local:9000, status: ALIVE, uptime: 2619->0
= 192.168.66.80:12000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
= 192.168.66.80:9000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
For an explanation of the changes, see below.
snapshot-nonmetrics-diff with begin/end specification:
➜ yb_stats --snapshot-nonmetrics-diff -b 0 -e 2
+ Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004100
+ Tablet: ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, state: RUNNING, leader: yb-1.local:9100
+ Replica: yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf, Type: VOTER
= Tserver: yb-1.local:9000, status: ALIVE, uptime: 2619->0
= 192.168.66.80:12000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
= 192.168.66.80:9000 Vars: heap_profile_path /tmp/yb-tserver.1052->/tmp/yb-tserver.8389 Default
In this example, a begin snapshot (0
) and end snapshot (2
) are specified, and the differences found between snapshot number 0 and 2 are:
- An object ysql.yugabyte.t is added (
+
). - A tablet ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf is added (
+
) on yb-1.local:9100. - A replica yb-1.local:9100:ysql.yugabyte.t.3b964b2604ae4201a1f865097f3eacdf is added (
+
). - A change (
=
) is detected on Tserver yb-1.local:9000, the uptime has changed from 2619->0, indicating it has been restarted. - A change (
=
) is detected on 192.168.66.80:12000 for the var/gflag heap_profile_path. - A change (
=
) is detected on 192.168.66.80:9000 identical to port 12000. Port 12000 (the YCQL port) shows identical information to port 9000, the general default tablet server port.
adhoc-metrics-diff
When yb_stats
is run with the --adhoc-metrics-diff
, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference.
This is called 'adhoc mode', however, the --adhoc-metrics-diff
mode will only take the metrics (excluding node-exporter), and show the difference.
The usage of either adhoc mode or snapshot mode should be carefully considered. adhoc alias in-memory snapshots does not write anything.
In most cases performing snapshots persisting all the available information is the best way, so results can be reviewed later, and cannot get lost, because they are stored. However, if you are performing repeated tests where storing all snapshot information would simply be too much and would require you to remove all the snapshots after testing anyways AND you are shure what to look for, then adhoc mode might be used.
Example
This is how the first snapshot looks like in ad-hoc mode:
yb_stats --adhoc-metrics-diff
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:
Time between snapshots: 2.910 seconds
192.168.66.80:12000 server cpu_stime 49 ms 17.008 /s
192.168.66.80:12000 server cpu_utime 2 ms 0.694 /s
192.168.66.80:12000 server involuntary_context_switches 1 csws 0.347 /s
192.168.66.80:12000 server server_uptime_ms 2879 ms 999.306 /s
192.168.66.80:12000 server threads_started 2 threads 0.694 /s
...etcetera
adhoc-node-exporter-diff
When yb_stats
is run with the --adhoc-node-exporter-diff
, it will perform a snapshot in memory, and wait for enter to perform the next snapshot and present the difference.
This is called 'adhoc mode', however, the --adhoc-node-exporter-diff
mode will only take the metrics from node-exporter, and show the difference.
Node-exporter shows the operating system statistics.
The usage of either adhoc mode or snapshot mode should be carefully considered. adhoc alias in-memory snapshots does not write anything.
In most cases performing snapshots persisting all the available information is the best way, so results can be reviewed later, and cannot get lost, because they are stored. However, if you are performing repeated tests where storing all snapshot information would simply be too much and would require you to remove all the snapshots after testing anyways AND you are shure what to look for, then adhoc mode might be used.
Example
This is how the first snapshot looks like in ad-hoc mode:
yb_stats --adhoc-metrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
After the snapshot is created, you can perform the task under investigation. Once that is done, press enter:
Time between snapshots: 18.843 seconds
192.168.66.80:9300 counter node_context_switches_total 14946.000000 830.333 /s
192.168.66.80:9300 counter node_cpu_seconds_total_idle 17.610000 0.978 /s
192.168.66.80:9300 counter node_cpu_seconds_total_irq 0.510000 0.028 /s
192.168.66.80:9300 counter node_cpu_seconds_total_softirq 0.130000 0.007 /s
192.168.66.80:9300 counter node_cpu_seconds_total_system 0.220000 0.012 /s
192.168.66.80:9300 counter node_cpu_seconds_total_user 0.010000 0.001 /s
192.168.66.80:9300 counter node_disk_io_time_seconds_total_sda 0.004000 0.000 /s
192.168.66.80:9300 counter node_disk_io_time_weighted_seconds_total_sda 0.003000 0.000 /s
...etcetera
For an explanation of the fields see: Node-exporter statistics
Print modes
Outside of performance metrics, a snapshot contains a lot more information.
Most of the additonal information can be obtained using print commands.
All of the print commands take a single snapshot number, some print commands do accept no snapshot number, which makes yb_stats
perform a live lookup.
print-version
Print version information from a live cluster or from a snapshot.
--print-version <snapshot number>
: print version information from stored snapshot.--print-version
: print version information from a live cluster.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.
Example:
% yb_stats --print-version
hostname_port version_number build_nr build_type build_timestamp git_hash
192.168.66.82:9000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.82:7000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.81:9000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.81:7000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.80:9000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
192.168.66.80:7000 2.15.3.2 1 RELEASE 04 Nov 2022 16:53:01 UTC 5ef608b43a994ab03a12ed3359258ec156d04a14
print-log
Print log information from a live cluster or from a snapshot.
--print-log <snapshot number>
: print log information from a stored snapshot.--print-log
: print log information from a live cluster.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--log-severity
filters log lines by the letter that indicates the severity. Default: WEF, optional: I.--stat-name-match
: filters log lines by the sourcefile name and line number (such as "leader_election.cc:216") or message fields via a regular expression.
Explanation of the severity letters (increasing in severity):
- I: Informal
- W: Warning
- E: Error
- F: Fatal
The --print-log
option prints the log lines based on the timestamp found in the log line.
The log lines are taken from the different servers, and ordered on time, which is the local timestamp, so you have to be aware of clock skew.
The timestamps in the logs are UTC time, so timezone settings should not require recalculation of the timestamp.
Example:
% yb_stats --print-log --log-severity IWEF --hostname-match '(7000|9000)
...
192.168.66.80:9000 2023-01-29 12:29:45.447729 UTC I raft_consensus.cc:3356 T 5652a6b7a4ea47d198f0142895addf09 P 414ad0910477464fb4c17dfbd912de10 [term 1 FOLLOWER]: Calling mark dirty synchronously for reason code FOLLOWER_NO_OP_COMPLETE
192.168.66.81:9000 2023-01-29 12:29:45.448503 UTC I raft_consensus.cc:3356 T 5652a6b7a4ea47d198f0142895addf09 P 8e317433953244dfbff8dea89bdd7d77 [term 1 FOLLOWER]: Calling mark dirty synchronously for reason code FOLLOWER_NO_OP_COMPLETE
192.168.66.82:7000 2023-01-29 12:29:45.448714 UTC I catalog_manager.cc:7115 Peer 8e317433953244dfbff8dea89bdd7d77 sent incremental report for 5652a6b7a4ea47d198f0142895addf09, prev state op id: -1, prev state term: 1, prev state has_leader_uuid: 1. Consensus state: current_term: 1 leader_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" config { opid_index: -1 peers { permanent_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" member_type: VOTER last_known_private_addr { host: "yb-3.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local3" } } peers { permanent_uuid: "414ad0910477464fb4c17dfbd912de10" member_type: VOTER last_known_private_addr { host: "yb-1.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local1" } } peers { permanent_uuid: "8e317433953244dfbff8dea89bdd7d77" member_type: VOTER last_known_private_addr { host: "yb-2.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local2" } } }
192.168.66.82:7000 2023-01-29 12:29:45.449941 UTC I catalog_manager.cc:7115 Peer 414ad0910477464fb4c17dfbd912de10 sent incremental report for 5652a6b7a4ea47d198f0142895addf09, prev state op id: -1, prev state term: 1, prev state has_leader_uuid: 1. Consensus state: current_term: 1 leader_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" config { opid_index: -1 peers { permanent_uuid: "0f41bb1e8bc34afe801d422c3c3064b4" member_type: VOTER last_known_private_addr { host: "yb-3.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local3" } } peers { permanent_uuid: "414ad0910477464fb4c17dfbd912de10" member_type: VOTER last_known_private_addr { host: "yb-1.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local1" } } peers { permanent_uuid: "8e317433953244dfbff8dea89bdd7d77" member_type: VOTER last_known_private_addr { host: "yb-2.local" port: 9100 } cloud_info { placement_cloud: "local" placement_region: "local" placement_zone: "local2" } } }
192.168.66.82:7000 2023-01-29 12:29:45.457672 UTC I ysql_transaction_ddl.cc:55 Verifying Transaction { transaction_id: c57ef1a4-8917-437d-8fc6-a00328d6fc95 isolation: SNAPSHOT_ISOLATION status_tablet: ecc3d88a304f4a82894229c65831bb9f priority: 15800696521729142229 start_time: { physical: 1674995385026221 } locality: GLOBAL old_status_tablet: }
192.168.66.82:7000 2023-01-29 12:29:45.459448 UTC I ysql_transaction_ddl.cc:110 Got Response for { transaction_id: c57ef1a4-8917-437d-8fc6-a00328d6fc95 isolation: SNAPSHOT_ISOLATION status_tablet: ecc3d88a304f4a82894229c65831bb9f priority: 15800696521729142229 start_time: { physical: 1674995385026221 } locality: GLOBAL old_status_tablet: }, resp: status: PENDING status_hybrid_time: 6860781098837565439 propagated_hybrid_time: 6860781098837594112 aborted_subtxn_set { }
192.168.66.80:9000 2023-01-29 12:29:45.480443 UTC I table_creator.cc:363 Created table yugabyte.t of type PGSQL_TABLE_TYPE
192.168.66.82:7000 2023-01-29 12:29:45.668880 UTC I catalog_manager.cc:4098 T 00000000000000000000000000000000 P 52d4dda57d5740459c014a1a9dc07eae: Table transaction succeeded: t [id=000033e8000030008000000000004000]
These are some informal messages which are produced when a table is created, and show some of the internal dealing with RAFT, and the master (indicated by port 7000, and the network address implicitly shows the master leader), managing the table and therefore tablet creation.
tail-log
Print new log entries from a live cluster, in the same way that tail -f
works on a file.
--tail-log
: print log information from a live cluster.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--log-severity
filters log lines by the letter that indicates the severity. Default: WEF, optional: I.--stat-name-match
: filters log lines by the sourcefile name and line number (such as "log.cc:1516") or message fields via a regular expression.
Explanation of the severity letters (increasing in severity):
- I: Informal
- W: Warning
- E: Error
- F: Fatal
The --tail-log
option prints the log lines based on the timestamp found in the log line.
The log lines are taken from the different servers, and ordered on time, which is the local timestamp, so you have to be aware of clock skew.
The timestamps in the logs are UTC time, so timezone settings should not require recalculation of the timestamp.
Example:
% yb_stats --tail-log --log-severity IWEF --hostname-match 9000 --stat-name-match '(leader_election|Granting vote|Granting yes vote)'
192.168.66.80:9000 2022-12-19 14:37:16.188669 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Requesting vote from peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000 2022-12-19 14:37:16.188694 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Requesting vote from peer 9e982449051e43f3902a4fb24d639b84
192.168.66.82:9000 2022-12-19 14:37:16.188913 +01:00 I raft_consensus.cc:2375 T 411a6843451a4e87aeca805085a228ba P 64dc6522080c433c9cbd2c83efccb025 [term 0 FOLLOWER]: Pre-election. Granting vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1
192.168.66.81:9000 2022-12-19 14:37:16.191915 +01:00 I raft_consensus.cc:2375 T 411a6843451a4e87aeca805085a228ba P 9e982449051e43f3902a4fb24d639b84 [term 0 FOLLOWER]: Pre-election. Granting vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1
192.168.66.80:9000 2022-12-19 14:37:16.192420 +01:00 I leader_election.cc:367 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Vote granted by peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000 2022-12-19 14:37:16.192452 +01:00 I leader_election.cc:242 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate won.
192.168.66.80:9000 2022-12-19 14:37:16.202473 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Requesting vote from peer 64dc6522080c433c9cbd2c83efccb025
192.168.66.80:9000 2022-12-19 14:37:16.202504 +01:00 I leader_election.cc:216 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Requesting vote from peer 9e982449051e43f3902a4fb24d639b84
192.168.66.82:9000 2022-12-19 14:37:16.210361 +01:00 I raft_consensus.cc:3022 T 411a6843451a4e87aeca805085a228ba P 64dc6522080c433c9cbd2c83efccb025 [term 1 FOLLOWER]: Leader election vote request: Granting yes vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1.
192.168.66.80:9000 2022-12-19 14:37:16.211731 +01:00 I leader_election.cc:367 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Vote granted by peer 9e982449051e43f3902a4fb24d639b84
192.168.66.80:9000 2022-12-19 14:37:16.211760 +01:00 I leader_election.cc:242 T 411a6843451a4e87aeca805085a228ba P 3152ba00f652431baeeafa1b36336093 [CANDIDATE]: Term 1 election: Election decided. Result: candidate won.
192.168.66.81:9000 2022-12-19 14:37:16.211910 +01:00 I raft_consensus.cc:3022 T 411a6843451a4e87aeca805085a228ba P 9e982449051e43f3902a4fb24d639b84 [term 1 FOLLOWER]: Leader election vote request: Granting yes vote for candidate 3152ba00f652431baeeafa1b36336093 in term 1.
The --tail-log
command has the additional switches --log-severity
set to include 'I' severity loglines,
has --hostname-match
set to '9000' in order to only allow tablet server loglines,
and has --stat-name-match
in order to allow only the regex '(leader_election|Granting vote|Granting yes vote)'
for sourcefile name and number and message.
Press interrupt (ctrl-c) to terminate tailing the logs.
print-entities
Print entities (database, database object, tablet, replica) from a live cluster or from a snapshot.
--print-entities <snapshot number>
: print entities from a stored snapshot.--print-entities
: print entities from a live cluster.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--table-name-match
: filter by object name regular expression.--details-enable
: print the entity information from all masters.
The entity information is available on the masters, on the leader and the followers.
In order to get the current information, yb_stats
fetches information to learn the master leader first, and then obtains the entity information from the master leader, unless the --details-enable
switch is set.
For YSQL objects/entities, yb_stats
takes the OID from the object id and filters out OIDs lower than 16384, because these are system OIDs.
Example:
% yb_stats --print-entities
Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.t.7f4fc16eba28432e8ed2baf4603f9590 state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.t.ce1302dada834f619e67dffc847a80fe state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.t.f035128dd43d43d3a1a9d4c44727df99 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.yugabyte.tt, state: RUNNING, id: 000033e8000030008000000000004100
Tablet: ysql.yugabyte.tt.3ae53662d5374897b8a55899f7ceb9c4 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.tt.880a7b69ae4a474b96d3ff0b7117867b state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.tt.e844bce904794c9799301a2a95cdbe82 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.yugabyte.ysql_bench_history, state: RUNNING, id: 000033e800003000800000000000413b
Tablet: ysql.yugabyte.ysql_bench_history.11476b9ff3bd4cdeb89a6b188de44b51 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_history.1f01c2e8a9ba467b8495b304649bcbde state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_history.fd9094233dc04e9fa17084b99c42fea6 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.yugabyte.ysql_bench_tellers, state: RUNNING, id: 000033e800003000800000000000413e
Tablet: ysql.yugabyte.ysql_bench_tellers.918cba44a4d34b699aab6a53eb2399bf state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_tellers.a0a6a3f68cfd4ce697f9c412b74cf84d state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_tellers.d2b6e484972c4443868e1887d05bc7a4 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.yugabyte.ysql_bench_accounts, state: RUNNING, id: 000033e8000030008000000000004143
Tablet: ysql.yugabyte.ysql_bench_accounts.43e33bb5a7a34a4c8b631d08f4544165 state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_accounts.634d9612c86e4a98a9ffdba70a76227f state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_accounts.da848f4a61ea43c7a7d903b1c28b6942 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.yugabyte.ysql_bench_branches, state: RUNNING, id: 000033e8000030008000000000004148
Tablet: ysql.yugabyte.ysql_bench_branches.b243250ea9f145ccbb68119be37f540d state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_branches.b948c3f199954959b295c78d6f3f99c7 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,LEADER VOTER,yb-3.local:9100,FOLLOWER )
Tablet: ysql.yugabyte.ysql_bench_branches.d456575fb4a04b9ea9e438b93129aa2f state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Object: ysql.testdb.testtable, state: RUNNING, id: 00004154000030008000000000004155
Tablet: ysql.testdb.testtable.5350a928953c4eb1aaa9eb0581a3112b state: RUNNING
( VOTER,yb-2.local:9100,LEADER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER )
Object: ysql.testdb.testindex, state: RUNNING, id: 0000415400003000800000000000415a
Tablet: ysql.testdb.testindex.16c89cf34d054f0fb9116534d366ec33 state: RUNNING
( VOTER,yb-2.local:9100,FOLLOWER VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER )
Using the /dump-entities
endpoint it's not possible to make a distinction between the object types of table, index or materialized view.
- For every object, a full name is shown in the form of
[database type].[database/keyspace].[object name]
, the state and the id of the object. - An object contains one or more tablets.
- For every tablet: a full name is shown in the form of
[database type].[database/keyspace].[object name].[tablet id]
. - A tablet has no name, only an id.
- For every tablet there is also the replica information in between brackets. For every replica, the type, RPC hostname and port and follower or leader designation.
print-masters
Print the current masters from a live cluster or from a snapshot.
--print-masters <snapshot number>
: print masters from a stored snapshot.--print-masters
: print masters from a live cluster.
Additional switches:
--details-enable
: print the master information from all masters.
In order to get the current information, yb_stats
fetches information to learn the master leader first, and then obtains the entity information from the master leader, unless the --details-enable
switch is set.
Example:
% yb_stats --print-masters
d3db2544098b4b808c0c65d4d19f4d3a LEADER Cloud: local, Region: local, Zone: local
Seqno: 1669886426374545 Start time: 1669886426374545
RPC addresses: ( yb-1.local:7100 )
HTTP addresses: ( yb-1.local:7000 )
5334e8170e74496c9780d64e09177010 FOLLOWER Cloud: local, Region: local, Zone: local
Seqno: 1669886456237856 Start time: 1669886456237856
RPC addresses: ( yb-2.local:7100 )
HTTP addresses: ( yb-2.local:7000 )
b460d504c6aa488d97bfe266ab506ab6 FOLLOWER Cloud: local, Region: local, Zone: local
Seqno: 1669886489682609 Start time: 1669886489682609
RPC addresses: ( yb-3.local:7100 )
HTTP addresses: ( yb-3.local:7000 )
print-tablet-servers
Print the current tablet servers from a live cluster or from a snapshot.
--print-tablet-servers <snapshot number>
: print tablet servers from a stored snapshot.--print-tablet-servers
: print tablet servers from a live cluster.
Additional switches:
--details-enable
: print the tablet servers information from all masters.
In order to get the current information, yb_stats
fetches to learn the master leader first, and then obtain the tablet servers information from the master leader, unless the --details-enable
switch is set.
Example:
yb-2.local:9000 ALIVE Cloud: local, Region: local, Zone: local2
HB time: 2.1s, Uptime: 0, Ram 8.39 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 14, user (leader/total): 7/20, system (leader/total): 0/12
Path: /mnt/d0, total: 10724835328, used: 992555008 (9.25%)
yb-1.local:9000 ALIVE Cloud: local, Region: local, Zone: local1
HB time: 0.0s, Uptime: 0, Ram 9.44 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 16, user (leader/total): 6/20, system (leader/total): 0/12
Path: /mnt/d0, total: 10724835328, used: 441438208 (4.12%)
yb-3.local:9000 ALIVE Cloud: local, Region: local, Zone: local3
HB time: 0.1s, Uptime: 1118, Ram 62.46 MB
SST files: nr: 13, size: 4.46 MB, uncompressed: 15.24 MB
ops read: 0, write: 0
tablets: active: 32, user (leader/total): 6/20, system (leader/total): 5/12
Path: /mnt/d0, total: 10724835328, used: 922292224 (8.60%)
print-vars
Print the current vars (gflags) from a live cluster or from a snapshot.
--print-vars <snapshot number>
: print variables/gflags from every server (endpoint) in a stored snapshot.--print-vars
: print variables/gflags from every server (endpoint) in the cluster.
Additional switches:
--details-enable
: print variables with 'Default' type too.--hostname-match
: filter by hostname or port regular expression.--stat-name-filter
: filter by variable name regular expression.
Example:
% yb_stats --print-vars --hostname-match 192.168.66.82:9000
192.168.66.82:9000 log_filename yb-tserver NodeInfo
192.168.66.82:9000 placement_cloud local NodeInfo
192.168.66.82:9000 placement_region local NodeInfo
192.168.66.82:9000 placement_zone local3 NodeInfo
192.168.66.82:9000 rpc_bind_addresses 0.0.0.0 NodeInfo
192.168.66.82:9000 webserver_interface NodeInfo
192.168.66.82:9000 webserver_port 9000 NodeInfo
192.168.66.82:9000 client_read_write_timeout_ms 600000 Custom
192.168.66.82:9000 cql_proxy_bind_address 0.0.0.0:9042 Custom
192.168.66.82:9000 db_block_cache_size_percentage 10 Custom
192.168.66.82:9000 default_memory_limit_to_ram_ratio 0.59999999999999998 Custom
192.168.66.82:9000 flagfile /opt/yugabyte/conf/tserver.conf Custom
192.168.66.82:9000 fs_data_dirs /mnt/d0 Custom
192.168.66.82:9000 global_log_cache_size_limit_mb 32 Custom
192.168.66.82:9000 leader_lease_duration_ms 4000 Custom
192.168.66.82:9000 log_cache_size_limit_mb 16 Custom
192.168.66.82:9000 mem_tracker_tcmalloc_gc_release_bytes 5062950 Custom
192.168.66.82:9000 pg_yb_session_timeout_ms 600000 Custom
192.168.66.82:9000 pgsql_proxy_bind_address 0.0.0.0:5433 Custom
192.168.66.82:9000 raft_heartbeat_interval_ms 1000 Custom
192.168.66.82:9000 redis_proxy_bind_address 0.0.0.0:6379 Custom
192.168.66.82:9000 server_tcmalloc_max_total_thread_cache_bytes 33554432 Custom
192.168.66.82:9000 start_pgsql_proxy true Custom
192.168.66.82:9000 tserver_master_addrs yb-1.local:7100,yb-2.local:7100,yb-3.local:7100 Custom
192.168.66.82:9000 yb_num_shards_per_tserver 2 Custom
192.168.66.82:9000 ysql_num_shards_per_tserver 1 Custom
192.168.66.82:9000 regular_tablets_data_block_key_value_encoding three_shared_parts Auto
192.168.66.82:9000 TEST_auto_flags_initialized true Auto
This is using the new /api/v1/varz
endpoint. For older versions, use --print-gflags
.
The new variables/gflags page shows a classification or 'type' per variable/gflag.
- Default
- NodeInfo
- Custom
- Auto
Variables of the type Auto
are not changed, and therefore are not shown by default.
print-memtrackers
Print the memtracker page from a snapshot.
--print-memtrackers <snapshot number>
: print memtrackers information from a stored snapshot.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
: filter by memory area name (id) regular expression.
Example:
% yb_stats --print-memtrackers 2 --hostname-match 82:700 --stat-name-match '(root|TCMalloc|server)'
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.82:7000, Snapshot number: 2, Snapshot time: 2022-10-18 15:26:20.125948 +02:00
--------------------------------------------------------------------------------------------------------------------------------------
hostname_port id current_consumption peak_consumption limit
--------------------------------------------------------------------------------------------------------------------------------------
192.168.66.82:7000 root 26.86M 28.09M 241.44M
192.168.66.82:7000 TCMalloc Central Cache 812.7K 961.9K none
192.168.66.82:7000 TCMalloc PageHeap Free 2.26M 4.73M none
192.168.66.82:7000 TCMalloc Thread Cache 716.9K 1.97M none
192.168.66.82:7000 TCMalloc Transfer Cache 985.5K 1.41M none
192.168.66.82:7000 server 197.5K (12.62M) 232.1K none
print-rpcs
Print RPC (remote procedure call) information from a snapshot.
--print-rpcs <snapshot number>
: print rpc information from a stored snapshot.
By default, --print-rpcs
prints out a summary (a count of RPCs) per host and port. Use --details-enable
to see all individual RPCs.
--print-rpcs
includes port 13000 (YSQL), which means the YSQL connections.
Additional flags:
--details-enable
: print all individual RPC connections, instead of a summary.--hostname-match
: filter by hostname or port regular expression.
Example:
% yb_stats --print-rpcs 2
----------------------------------------------------------------------------------------------------
Host: 192.168.66.80; port: 13000, count: 2; port: 7000, count: 20; port: 9000, count: 49
----------------------------------------------------------------------------------------------------
Host: 192.168.66.81; port: 13000, count: 1; port: 7000, count: 19; port: 9000, count: 41
----------------------------------------------------------------------------------------------------
Host: 192.168.66.82; port: 13000, count: 1; port: 7000, count: 57; port: 9000, count: 40
----------------------------------------------------------------------------------------------------
With --details-enable
and --hostname-match
it's posssible to see the current connections to YSQL for a node, for example:
% yb_stats --print-rpcs 27 --details-enable --hostname-match 80:13000
----------------------------------------------------------------------------------------------------
Host: 192.168.66.80; port: 13000, count: 2
----------------------------------------------------------------------------------------------------
192.168.66.80:13000 idle yugabyte client backend ysqlsh 127.0.0.1
192.168.66.80:13000 checkpointer
print-threads
Print current threads from a snapshot.
--print-threads <snapshot number>
: print current threads from a snapshot.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.
Example:
% yb_stats --print-threads 27 --hostname-match 80:9000
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.80:9000, Snapshot number: 27, Snapshot time: 2022-12-01 11:24:08.436165 +01:00
--------------------------------------------------------------------------------------------------------------------------------------
hostname_port thread_name cum_user_cpu_s cum_kernel_cpu_s cum_iowait_cpu_s stack
--------------------------------------------------------------------------------------------------------------------------------------
192.168.66.80:9000 pg_supervisorxx-7789 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::pgwrapper::PgSupervisor::RunThread();yb::Subprocess::DoWait();__GI___waitpid
192.168.66.80:9000 CQLServer_reactor-7801 0.070s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 RedisServer_reactor-7794 0.280s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 TabletServer_reactor-7732 0.070s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Reactor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 acceptorxxxxxxx-7803 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 acceptorxxxxxxx-7796 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 acceptorxxxxxxx-7739 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::Acceptor::RunThread();ev_run;epoll_poll;__GI_epoll_wait
192.168.66.80:9000 heartbeatxxxxxx-7743 0.190s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::tserver::Heartbeater::Thread::RunThread();__pthread_cond_timedwait
192.168.66.80:9000 iotp_CQLServer_3-7800 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_CQLServer_2-7799 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_CQLServer_1-7798 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_CQLServer_0-7797 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000 iotp_RedisServer_1-7791 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_RedisServer_2-7792 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_RedisServer_3-7793 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_RedisServer_0-7790 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000 iotp_TabletServer_0-7728 0.150s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_TabletServer_2-7730 0.320s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_TabletServer_3-7731 0.310s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();__pthread_cond_wait
192.168.66.80:9000 iotp_TabletServer_1-7729 0.210s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000 iotp_call_home_0-7748 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::rpc::IoThreadPool::Impl::Execute();boost::asio::detail::scheduler::run();boost::asio::detail::epoll_reactor::run();__GI_epoll_wait
192.168.66.80:9000 maintenance_scheduler-7745 0.330s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::MaintenanceManager::RunSchedulerThread();__pthread_cond_timedwait
192.168.66.80:9000 rb-session-expx-7736 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::tserver::RemoteBootstrapServiceImpl::EndExpiredSessions();yb::CountDownLatch::WaitFor();yb::ConditionVariable::WaitUntil();__pthread_cond_timedwait
192.168.66.80:9000 rpc_tp_TabletServer-high-pri_4-7782 0.230s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer-high-pri_3-7774 0.170s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer-high-pri_2-7773 0.250s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer-high-pri_1-7768 0.210s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer-high-pri_0-7767 0.170s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_11-7761 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_10-7760 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_9-7759 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_8-7758 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_7-7757 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_6-7756 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_5-7755 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_4-7754 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_3-7753 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_2-7747 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_1-7746 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 rpc_tp_TabletServer_0-7741 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();__pthread_cond_wait
192.168.66.80:9000 flush scheduler bgtask-7733 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::BackgroundTask::Run();__pthread_cond_wait
192.168.66.80:9000 server_clientcb [worker]-7744 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 cdc_clientcb [worker]-7737 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 MaintenanceMgr [worker]-7129 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 log-alloc [worker]-7128 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 append [worker]-7127 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 prepare [worker]-7126 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 log-sync [worker]-7125 0.000s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
192.168.66.80:9000 consensus [worker]-7124 0.830s 0.000s 0.000s __clone;start_thread;yb::Thread::SuperviseThread();yb::ThreadPool::DispatchThread();__pthread_cond_wait
print-gflags
Print gflags information from a snapshot.
--print-gflags <snapshot number>
: print gflags from a stored snapshot.
Additional switches:
--hostname-match
: filter by hostname or port regular expression.--stat-name-match
filter by gflag name regular expression.
Example:
% yb_stats --print-gflags 2 --hostname-match 82:700 --stat-name-match wal
--------------------------------------------------------------------------------------------------------------------------------------
Host: 192.168.66.82:7000, Snapshot number: 2, Snapshot time: 2022-10-18 15:26:20.128980 +02:00
--------------------------------------------------------------------------------------------------------------------------------------
cdc_wal_retention_time_secs 14400
time_based_wal_gc_clock_delta_usec 0
bytes_durable_wal_write_mb 1
durable_wal_write true
interval_durable_wal_write_ms 1000
require_durable_wal_write false
save_index_into_wal_segments false
fs_wal_dirs /mnt/d0
skip_wal_rewrite true
TEST_download_partial_wal_segments false
TEST_pause_rbs_before_download_wal false
TEST_fault_crash_after_wal_deleted 0
print-cluster-config
Print the current cluster configuration from a live cluster or from a snapshot.
--print-cluster-config <snapshot number>
: print the cluster configuration from a stored snapshot.--print-cluster-config
: print the cluster configuration from a live cluster.
Example:
➜ yb_stats --print-cluster-config
{
"hostname_port": "192.168.66.80:7000",
"timestamp": "2023-03-18T14:29:18.929159+01:00",
"version": 0,
"replication_info": null,
"server_blacklist": null,
"cluster_uuid": "e9e7c5bb-9494-4a56-b3c0-c3b1d9a7caf7",
"encryption_info": null,
"consumer_registry": null,
"leader_blacklist": null
}
This is using the /api/v1/cluster-config
endpoint, and uses the information from the current master leader.
print-health-check
Print the health check output from the master leader from a live cluster or from a snapshot.
--print-health-check <snapshot number>
: print the health check output from a stored snapshot.--print-health-check
: print the health check output from a live cluster.
Example:
➜ yb_stats --print-health-check
{
"hostname_port": "192.168.66.82:7000",
"timestamp": "2023-03-18T14:42:09.572026+01:00",
"dead_nodes": [],
"most_recent_uptime": 261,
"under_replicated_tablets": [],
"failed_tablets": null
}
This is using the /api/v1/health-check
endpoint, and uses the information from the current master leader.
print-drives
Print the drives usage information from the masters and tablet servers from a live cluster or from a snapshot.
--print-drives <snapshot number>
: print the drives usage information from a stored snapshot.--print-drives
: print the drives usage information from a live cluster.
Example:
➜ yb_stats --print-drives
192.168.66.80:7000 /mnt/d0 9.99G 149.83M
192.168.66.80:9000 /mnt/d0 9.99G 149.83M
192.168.66.80:12000 /mnt/d0 9.99G 149.83M
192.168.66.82:12000 /mnt/d0 9.99G 146.08M
192.168.66.82:9000 /mnt/d0 9.99G 146.08M
192.168.66.81:9000 /mnt/d0 9.99G 145.90M
192.168.66.81:7000 /mnt/d0 9.99G 145.90M
192.168.66.82:7000 /mnt/d0 9.99G 146.08M
192.168.66.81:12000 /mnt/d0 9.99G 145.90M
The output shows the following fields:
- endpoint (hostname/ip address and port number)
- path
- total space
- used space
This is using the /drives
URL, which is present on each master and tablet server.
print-tablet-server-operations
Print the current tablet server operations from each tablet server in a live cluster or from a snapshot.
--print-tablet-server-operations <snapshot number>
: print the tablet server operations from a stored snapshot.--print-tablet-server-operations
: print the tablet server operations from a live cluster.
Example:
➜ yb_stats --print-tablet-server-operations
192.168.66.82:9000 acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 34 WRITE_OP 53479 us. R-P { type: kWrite consensus_round: 0x0000561862d89700 -> id { term: 1 index: 34
192.168.66.81:9000 acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 35 WRITE_OP 40356 us. R-P { type: kWrite consensus_round: 0x00005628f5fc9b00 -> id { term: 1 index: 35
192.168.66.80:9000 56263119ab85438481a3b5865dfc5787 term: 1 index: 34 WRITE_OP 18758 us. R-P { type: kWrite consensus_round: 0x00005605e07ed000 -> id { term: 1 index: 34
192.168.66.80:9000 acc6cf799c22457fb722ba9a361c54b9 term: 1 index: 34 WRITE_OP 149000 us. R-P { type: kWrite consensus_round: 0x00005605e57224c0 -> id { term: 1 index: 34
192.168.66.80:9000 999219c1c7ce4b03bb41b568719f7c26 term: 1 index: 11 UPDATE_TRANSACTION_OP 11560 us. R-P { type: kUpdateTransaction consensus_round: 0x00005605e1b093c0 -> id { term:
The output shows the following fields:
- endpoint (hostname/ip address and port number)
- tablet id
- operation id (term: N, index: N)
- transaction type (WRITE_OP, UPDATE_TRANSACTION_OP)
- total time in flight (N us.)
- description (R-P { type: ...)
This is using the /operations
URL, which is present on each tablet server.
print-master-tasks
Print the current master tasks, last 20 user-initiated tasks and last 100 tasks started in the last 300 seconds from the master leader.
--print-master-tasks <snapshot number>
: print the master tasks from a stored snapshot.--print-master-tasks
: print the master tasks from a live cluster.
Example:
➜ yb_stats --print-master-tasks
task done Truncate Tablet kComplete 3.75 min ago 473 ms Truncate Tablet RPC for tablet 0x000056279f4eeb00 -> 56263119ab85438481a3b5865dfc5787 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])
task done Truncate Tablet kComplete 3.75 min ago 575 ms Truncate Tablet RPC for tablet 0x000056279f4ee840 -> 4abf56bde0e843cfa9de8f48ca0e6a71 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])
task done Truncate Tablet kComplete 3.75 min ago 1.39 s Truncate Tablet RPC for tablet 0x000056279f4ee580 -> acc6cf799c22457fb722ba9a361c54b9 (table t [id=000033e8000030008000000000004000]) (t [id=000033e8000030008000000000004000])
The output shows the following fields:
- task category (task done)
- task name (Truncate Tablet)
- task state (kComplete)
- start time (2.75 min ago)
- duration (N ms)
- description (Truncate Tablet RPC for ...)
This is using the /tasks
URL, which is present on each master server.
--print-table-detail
The --print-table-detail
switch takes no argument in order to lookup the table-detail from a live cluster,
or a snapshot number to lookup the table-detail from a snapshot.
Because the table detail information is fetched separately for each individual table, it is NOT fetched by default.
To let yb_stats
fetch the additional data for --print-table-detail
, you must add the --extra-data
switch!
The --print-table-detail
switch also requires the extra --uuid
switch to set the UUID for the table to print the details.
In order to obtain the UUID to use for this switch, use the --print-entities
option to obtain a list of tables with their UUIDs.
For YSQL tables, the UUID is not really a UUID, but a large hexadecimal number that composited from several components,
such as the database OID and the table OID.
Get list of entities to obtain table UUID:
yb_stats --print-entities
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.t.4abf56bde0e843cfa9de8f48ca0e6a71 state: RUNNING
Replicas: (yb-1.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER),)
Tablet: ysql.yugabyte.t.56263119ab85438481a3b5865dfc5787 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
Tablet: ysql.yugabyte.t.acc6cf799c22457fb722ba9a361c54b9 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER),)
For the table ysql.yugabyte.t, the UUID is 000033e8000030008000000000004000.
yb_stats --print-table-detail --extra-data --uuid 000033e8000030008000000000004000
Table UUID: 000033e8000030008000000000004000, version: 0, type: PGSQL_TABLE_TYPE, state: Running, keyspace: yugabyte, object_type: User tables, name: t
On disk size: Total: 191.69M WAL Files: 132.00M SST Files: 59.69M SST Files Uncompressed: 569.71M
Replication info:
Columns:
0 id int32 NOT NULL PARTITION KEY
1 f1 string NULLABLE NOT A PARTITION KEY
Tablets:
acc6cf799c22457fb722ba9a361c54b9 hash_split: [0x0000, 0x5554], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-1.local LEADER: yb-3.local FOLLOWER: yb-2.local
56263119ab85438481a3b5865dfc5787 hash_split: [0xAAAA, 0xFFFF], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: FOLLOWER: yb-1.local FOLLOWER: yb-3.local LEADER: yb-2.local
4abf56bde0e843cfa9de8f48ca0e6a71 hash_split: [0x5555, 0xAAA9], Split depth: 0, State: Running, Hidden: false, Message: Tablet reported with an active leader, Raft: LEADER: yb-1.local FOLLOWER: yb-3.local FOLLOWER: yb-2.local
Tasks:
This shows:
- The table UUID again.
- The version. A table gets a new version if it's modified.
- The type. This is a PGSQL_TABLE_TYPE, which means it's a postgres (YSQL) type object. Materialized views and indexes are also PGSQL_TABLE_TYPE objects.
- The state.
- The keyspace (database).
- The object_type. This will tell if this object is a table (user tables), index (index tables) or catalog table (system tables). A materialized view is listed as user table.
- The name.
- The on disk size. This is the total size (up to YugabyteDB accuracy) of all tablets.
- Replication info. This will show replication settings as JSON.
- The columns on the DocDB level, along with the DocDB column time, and what columns are part of of the partition key (the primary key).
- The tablets. This not only shows the amount of tablets, but also how they are split. Above shows the (default) hash split.
- Tasks. Tasks that can be happening at the tablet level are for example index backfills.
--print-tablet-detail
The --print-tablet-detail
switch takes no argument in order to lookup the tablet-detail from a live cluster,
or a snapshot number to lookup the tablet-detail from a snapshot.
Because the tablet detail information is fetched separately for each individual tablet, it is NOT fetched by default.
To let yb_stats
fetch the additional data for --print-tablet-detail
, you must add the --extra-data
switch!
The --print-tablet-detail
switch also requires the extra --uuid
switch to set the UUID for the tablet to print the details.
In order to obtain the UUID to use for this switch, use the --print-entities
option to obtain a list of tablets with their UUIDs.
Get a list of entities to obtain tablet UUID:
yb_stats --print-entities
Keyspace: ysql.postgres id: 000033e6000030008000000000000000
Keyspace: ysql.yugabyte id: 000033e8000030008000000000000000
Keyspace: ysql.system_platform id: 000033e9000030008000000000000000
Object: ysql.yugabyte.t, state: RUNNING, id: 000033e8000030008000000000004000
Tablet: ysql.yugabyte.t.4abf56bde0e843cfa9de8f48ca0e6a71 state: RUNNING
Replicas: (yb-1.local:9100(VOTER:LEADER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER),)
Tablet: ysql.yugabyte.t.56263119ab85438481a3b5865dfc5787 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER), yb-2.local:9100(VOTER:LEADER),)
Tablet: ysql.yugabyte.t.acc6cf799c22457fb722ba9a361c54b9 state: RUNNING
Replicas: (yb-1.local:9100(VOTER), yb-3.local:9100(VOTER:LEADER), yb-2.local:9100(VOTER),)
For the table ysql.yugabyte.t, there are 3 tablets. Let's pick 4abf56bde0e843cfa9de8f48ca0e6a71.
yb_stats --print-tablet-detail --extra-data --uuid 4abf56bde0e843cfa9de8f48ca0e6a71
192.168.66.80:9000
General info:
Keyspace: yugabyte
Object name: t
On disk sizes: Total: 21.87M Consensus Metadata: 1.5K WAL Files: 2.00M SST Files: 19.87M SST Files Uncompressed: 189.64M
State: RUNNING
Consensus:
State: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
Queue overview: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 0, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
Watermark:
- { peer: 370474e547cc422ab838282184367b9b is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 4.659 }
- { peer: 3fc85b37b3fd4332bc7ed0bcf128b5de is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 2 last_applied: 4.659 }
- { peer: 35db27f008cb4a3ba7c7c5b224bacb7a is_new: 0 last_received: 4.659 next_index: 660 last_known_committed_idx: 659 is_last_exchange_successful: 1 needs_remote_bootstrap: 0 member_type: VOTER num_sst_files: 4 last_applied: 4.659 }
Messages:
- Entry: 0, Opid: 0.0, mesg. type: REPLICATE UNKNOWN_OP, size: 6, status: term: 0 index: 0
LogAnchor:
Latest log entry op id: 4.659
Min retryable request op id: 9223372036854775807.9223372036854775807
Last committed op id: 4.659
Earliest needed log index: 659
Transactions:
- { safe_time_for_participant: { physical: 1679241188204604 logical: 1 } remove_queue_size: 0 }
Rocksdb:
IntentDB:
RegularDB:
total_size: 2051458, uncompressed_size: 19655884, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6273456, uncompressed_size: 59741998, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6220244, uncompressed_size: 59741913, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6291230, uncompressed_size: 59709058, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
192.168.66.82:9000
General info:
Keyspace: yugabyte
Object name: t
On disk sizes: Total: 84.87M Consensus Metadata: 1.5K WAL Files: 65.00M SST Files: 19.87M SST Files Uncompressed: 189.64M
State: RUNNING
Consensus:
State: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 1, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=0)
Queue overview:
Watermark:
Messages:
LogAnchor:
Latest log entry op id: 4.659
Min retryable request op id: 9223372036854775807.9223372036854775807
Last committed op id: 4.659
Earliest needed log index: 659
Transactions:
- { safe_time_for_participant: { physical: 1679241188201243 } remove_queue_size: 0 }
Rocksdb:
IntentDB:
RegularDB:
total_size: 2051458, uncompressed_size: 19655884, name_id: 14, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6273456, uncompressed_size: 59741998, name_id: 13, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6220244, uncompressed_size: 59741913, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 6291230, uncompressed_size: 59709058, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
192.168.66.81:9000
General info:
Keyspace: yugabyte
Object name: t
On disk sizes: Total: 84.87M Consensus Metadata: 1.5K WAL Files: 65.00M SST Files: 19.87M SST Files Uncompressed: 189.63M
State: RUNNING
Consensus:
State: Consensus queue metrics:Only Majority Done Ops: 0, In Progress Ops: 2, Cache: LogCacheStats(num_ops=0, bytes=0, disk_reads=1)
Queue overview:
Watermark:
Messages:
LogAnchor:
Latest log entry op id: 4.659
Min retryable request op id: 9223372036854775807.9223372036854775807
Last committed op id: 4.659
Earliest needed log index: 659
Transactions:
- { safe_time_for_participant: { physical: 1679241187538899 } remove_queue_size: 0 }
Rocksdb:
IntentDB:
RegularDB:
total_size: 5205489, uncompressed_size: 49526687, name_id: 12, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
total_size: 15627161, uncompressed_size: 149320345, name_id: 11, /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004000/tablet-4abf56bde0e843cfa9de8f48ca0e6a71
This shows that a single tablet in the case of replication factor 3 is stored at multiple tablet servers. Therefore the first thing to show for a tablet (replica) is the hostname/ip address of the tablet server.
- General info:
- Keyspace: The YCQL keyspace or YSQL database name.
- Object name: the table or index name.
- On disk sizes: the sizes (with YugabyteDB accuracy) of this single tablet.
- State.
- Consensus:
- The state.
- Queue overview: this field only contains data if the replica is leader.
- Watermark: this will list the peers (including the local replica) for the tablet on the leader.
- Messages: this shows entries on the leader.
- Log Anchor:
- The log anchor data is internal logging state administration.
- Transactions:
- The transactions show HLC safe time, and might list other transactions on the leader when these are happening.
- Rocksdb:
- IntentDB: the file status of the SST files for the IntentDB.
- RegularDB: the file status of the SST files for the RegularDB. This is the actual data storage.
This is low level detail for troubleshooting.
Crontab
In lots of cases it might be really convenient to have historical snapshots, so you can "look back in time". This can be done by scheduling running yb_stats
via the crontab.
-
Install yb_stats using the RPM. See install
-
Create a directory on a filesystem with enough diskspace
mkdir /home/yugabyte/yb_stats-history
In this example a directory in the home directory of the yugabyte user is used.
It is hard to specify 'enough' for the above statement of 'with enough diskspace'. It depends on the size of the cluster.
In general it's best to run yb_stats
in a separate directory, so that it can use it's own .env
file.
- Setup run configuration with
.env
cd /home/yugabyte/yb_stats-history
yb_stats --hosts 10.1.2.0,10.1.2.1,10.1.2.3 --parallel 3
This will trigger ad-hoc mode, press enter and validate if it fetched the correct hosts and endpoints. Be conservative with parallellism. See parallel
- Schedule
yb_stats
in crontab
crontab -e
5 */1 * * * yb_stats_path="/home/yugabyte/yb_stats-history" && (date && cd $yb_stats_path && /usr/local/bin/yb_stats --snapshot) >> $yb_stats_path/yb_stats_run.out 2>&1
-----experimental
- Let logrotate cleanup old yb-stats snapshots:
vi /etc/logrotate.d/yb_stats
File contents:
/home/yugabyte/yb_stats-history/yb_stats.snapshots {
daily
rotate 7
missingok
}
Mind the place from where logrotate scans the files (/home/yugabyte/yb_stats-history/yb_stats.snapshots
), and the cleaning schedule: daily
logrotate will allow files that are not touched for 7 times: so files are kept for a week.
This currently will leave the removed snapshots in the snapshot.index file.
Troubleshooting
yb_stats
by default prints as little as it can to the screen, and will therefore NOT show issues that it can overcome.
In order to let yb_stats
provide more information about what it encountering, you can increase the logging level. The default logging level is error
, which will also terminate execution. This is how that is done:
Logging level warning: warn
:
RUST_LOG=warn yb_stats
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
name: "threads_running_thread_pool",
value: 18446744073709551614,
}
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
name: "is_load_balancing_enabled",
value: false,
}
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.81:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
name: "threads_running_thread_pool",
value: 18446744073709551614,
}
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.81:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
name: "is_load_balancing_enabled",
value: true,
}
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
name: "threads_running_thread_pool",
value: 18446744073709551614,
}
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:7000, type: cluster, namespace: -, table_name: -: RejectedBooleanMetricValue {
name: "is_load_balancing_enabled",
value: false,
}
Begin metrics snapshot created, press enter to create end snapshot for difference calculation.
These are some statistics that do not fit the type (both have been solved, but not yet made it to a public release).
The warn
level still is reasonably silent.
All available log levels, with increased verbosity are:
error
(default)warn
info
debug
trace
Please be aware that beyond info, the amount of output can be high.
Advanced logging
If you want to understand more about a specific module, you can enable a logging level for that module only. This requires an understanding of the module system in rust, however, the module can be seen in the output such as seen previously:
[2022-10-21T09:38:56Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.80:7000, type: ...
yb_stats::metrics
is the module and submodule here.
In order to produce trace
level logging for yb_stats::metrics
only use:
RUST_LOG="yb_stats::metrics=trace" yb_stats
To set different logging levels for different modules, separate them with a comma:
RUST_LOG="yb_stats::metrics=trace,yb_stats::rpcs=info" ./target/release/yb_stats --snapshot
[2022-10-21T10:01:13Z INFO yb_stats::rpcs] perform_rpcs_snapshot
[2022-10-21T10:01:13Z INFO yb_stats::metrics] perform_snapshot (metrics)
[2022-10-21T10:01:13Z INFO yb_stats::rpcs] Could not parse 192.168.66.82:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO yb_stats::metrics] (192.168.66.82:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO yb_stats::rpcs] Could not parse 192.168.66.81:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO yb_stats::rpcs] Could not parse 192.168.66.80:9300/rpcz json data for rpcs, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO yb_stats::metrics] (192.168.66.81:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z INFO yb_stats::metrics] (192.168.66.80:9300) error parsing /metrics json data for metrics, error: expected value at line 1 column 1
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.ysqlserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.cqlserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z WARN yb_stats::metrics] statistic that is unknown or inconsistent: hostname_port: 192.168.66.82:12000, type: server, namespace: -, table_name: -: RejectedU64MetricValue {
name: "threads_running_thread_pool",
value: 18446744073709551613,
}
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: server, metric_id: yb.tabletserver, metric_attribute_namespace_name: -, metric_attribute_table_name: -
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: 5abf0e6155ea4f64860325f6cfd2332a, metric_attribute_namespace_name: system, metric_attribute_table_name: transactions
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: e0ac5a9011874a668654e97ca348833d, metric_attribute_namespace_name: system, metric_attribute_table_name: transactions
[2022-10-21T10:01:13Z TRACE yb_stats::metrics] metric_type: tablet, metric_id: 21962b8b5dbd4a6f99c3f3d5bc0780a6, metric_attribute_namespace_name: yugabyte, metric_attribute_table_name: t