Skip to content

Conversation

yknoya
Copy link
Contributor

@yknoya yknoya commented Jul 28, 2025

Problem

When loading host.db, it is incorrectly determined as "incompatible," and the file fails to load. This issue does not occur on the master branch.
An example of how to reproduce the issue is shown below:

  1. Set a value greater than 0 for proxy.config.cache.hostdb.sync_frequency in records.config:
CONFIG proxy.config.cache.hostdb.sync_frequency INT 3
  1. Delete the existing host.db file (to ensure that no old version is reused).
  2. Start traffic_server.
  3. Wait for host.db to sync (in this example, wait 3 seconds).
  4. Restart traffic_server.
  5. Check diags.log:
$ grep "host.db" diags.log | tail -n2
[Jul 28 09:52:16.055] traffic_server WARNING: Incompatible cache at /opt/ats/9.2.x-9f6abab/var/trafficserver/host.db, not loading.
[Jul 28 09:52:16.055] traffic_server WARNING: Error loading cache from /opt/ats/9.2.x-9f6abab/var/trafficserver/host.db: -1

Cause

Compatibility is checked when loading host.db using the following logic:
https://ghe.corp.yahoo.co.jp/cdn/trafficserver/blob/16f04bd6a849998d1db9348c5504164328541803/iocore/hostdb/P_RefCountCache.h#L593-L597

The RefCountCacheHeader::compatible function checks whether the following member values of RefCountCacheHeader are equal:

  • magic
  • version
  • object_version

https://ghe.corp.yahoo.co.jp/cdn/trafficserver/blob/16f04bd6a849998d1db9348c5504164328541803/iocore/hostdb/RefCountCache.cc#L49-L53

However, the current implementation mistakenly compares object_version with version, and since the values differ, the cache is deemed incompatible:

  • object_version: 1.1
  • version: 1.0

object_version was updated to 1.1 in the following PR, which likely introduced the issue starting from ATS 9.1.0:
#7264

Fix

This patch updates RefCountCacheHeader::compatible to compare object_version values correctly.

@github-project-automation github-project-automation bot moved this from In progress to Ready to Merge in 9.2.x Branch and Release Jul 28, 2025
@yknoya
Copy link
Contributor Author

yknoya commented Jul 31, 2025

Thank you for the review.
However, I’ve found another bug related to HostDB, so I plan to push an additional commit.
The issue is that the cache loaded from host.db is unintentionally reinitialized and incorrectly marked as stale.

Copy link
Contributor

@JosiahWI JosiahWI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for additional commit. Marking as draft can also prevent accidental merge in the future.

@github-project-automation github-project-automation bot moved this from Ready to Merge to For Review in 9.2.x Branch and Release Jul 31, 2025
@yknoya yknoya marked this pull request as draft July 31, 2025 23:38
@yknoya
Copy link
Contributor Author

yknoya commented Aug 1, 2025

Problem

A bug was discovered where the cache loaded from host.db is not being used.
Upon debugging, it was found that the following logic in the probe function is executed, resulting in the cache being marked as stale:

} else if (!ignore_timeout && r->is_ip_timeout() && !r->serve_stale_but_revalidate()) {
HOSTDB_INCREMENT_DYN_STAT(hostdb_ttl_expires_stat);
return make_ptr((HostDBInfo *)nullptr);
}

Cause

The cause of the stale determination was identified in HostDB::unmarshall, specifically in the following section:

memcpy((void *)ret, buf, size);
// Reset the refcount back to 0, this is a bit ugly-- but I'm not sure we want to expose a method
// to mess with the refcount, since this is a fairly unique use case
ret = new (ret) HostDBInfo();

This code first populates HostDBInfo using memcpy, then calls the constructor via placement new.
As described in the comment, the constructor is invoked to reset the reference count of the base class RefCountObj to zero.
However, calling the constructor also reinitializes members that use default member initializers:

uint64_t key{0};
// Application specific data. NOTE: We need an integral number of
// these per block. This structure is 32 bytes. (at 200k hosts =
// 8 Meg). Which gives us 7 bytes of application information.
HostDBApplicationInfo app;
union {
IpEndpoint ip; ///< IP address / port data.
unsigned int hostname_offset; ///< Some hostname thing.
SRVInfo srv;
} data;
unsigned int hostname_offset{0}; // always maintain a permanent copy of the hostname for non-rev dns records.
unsigned int ip_timestamp{0};
unsigned int ip_timeout_interval{0}; // bounded between 1 and HOST_DB_MAX_TTL (0x1FFFFF, 24 days)

Debug logs added before and after the constructor call showed that fields such as ip_timestamp were unintentionally reset, causing the cache to be incorrectly judged as stale:

[Aug  1 10:29:01.022] traffic_server DEBUG: <I_HostDBProcessor.h:182 (unmarshall)> (hostdb) HostDBInfo before new: key=17900179173150759707 hostname_offset=80, ip_timestamp=1754011708, ip_timeout_interval=900, _iobuffer_index=0, m_refcount=2
[Aug  1 10:29:01.022] traffic_server DEBUG: <I_HostDBProcessor.h:192 (unmarshall)> (hostdb) HostDBInfo after new: key=0 hostname_offset=0, ip_timestamp=0, ip_timeout_interval=0, _iobuffer_index=0, m_refcount=0

The patch used to add the debug logs is shown below:

patch
diff --git a/iocore/hostdb/I_HostDBProcessor.h b/iocore/hostdb/I_HostDBProcessor.h
index 0be0a81b4..2356337d7 100644
--- a/iocore/hostdb/I_HostDBProcessor.h
+++ b/iocore/hostdb/I_HostDBProcessor.h
@@ -178,10 +178,22 @@ struct HostDBInfo : public RefCountObj {
     HostDBInfo *ret = HostDBInfo::alloc(size - sizeof(HostDBInfo));
     int buf_index   = ret->_iobuffer_index;
     memcpy((void *)ret, buf, size);
+
+    Debug("hostdb", "HostDBInfo before new: key=%lu hostname_offset=%u, "
+      "ip_timestamp=%u, ip_timeout_interval=%u, _iobuffer_index=%d, m_refcount=%d",
+      ret->key, ret->hostname_offset, ret->ip_timestamp, ret->ip_timeout_interval,
+      ret->_iobuffer_index, ret->refcount());
+
     // Reset the refcount back to 0, this is a bit ugly-- but I'm not sure we want to expose a method
     // to mess with the refcount, since this is a fairly unique use case
     ret                  = new (ret) HostDBInfo();
     ret->_iobuffer_index = buf_index;
+
+    Debug("hostdb", "HostDBInfo after new: key=%lu hostname_offset=%u, "
+      "ip_timestamp=%u, ip_timeout_interval=%u, _iobuffer_index=%d, m_refcount=%d",
+      ret->key, ret->hostname_offset, ret->ip_timestamp, ret->ip_timeout_interval,
+      ret->_iobuffer_index, ret->refcount());
+
     return ret;
   }
 
diff --git a/iocore/hostdb/RefCountCache.cc b/iocore/hostdb/RefCountCache.cc
index af9cf0fd2..8f58c3767 100644
--- a/iocore/hostdb/RefCountCache.cc
+++ b/iocore/hostdb/RefCountCache.cc
@@ -49,5 +49,5 @@ RefCountCacheHeader::operator==(RefCountCacheHeader const &that) const
 bool
 RefCountCacheHeader::compatible(RefCountCacheHeader *that) const
 {
-  return this->magic == that->magic && this->version == that->version && this->object_version == that->version;
+  return this->magic == that->magic && this->version == that->version && this->object_version == that->object_version;
 };

Fix

To fix this issue, the construction of HostDBInfo in HostDB::unmarshall must satisfy the following conditions:

  • The reference counter (m_refcount) should be initialized.
  • All other member variables should not be reinitialized.

I considered ways to initialize the reference counter without calling the constructor.
However, I concluded that invoking the constructor is also necessary to reassign the virtual function table pointer (vptr):

memcpy((void *)ret, buf, size);
// Reset the refcount back to 0, this is a bit ugly-- but I'm not sure we want to expose a method
// to mess with the refcount, since this is a fairly unique use case
ret = new (ret) HostDBInfo();

As a result, I chose the following workaround:
Back up the member variable values before calling the constructor, then restore them after the constructor has been executed.

@yknoya yknoya marked this pull request as ready for review August 1, 2025 03:18
@yknoya
Copy link
Contributor Author

yknoya commented Aug 1, 2025

Thank you for waiting.
I've pushed the additional commit and added details about the issue I found.

@JosiahWI JosiahWI dismissed their stale review August 1, 2025 11:18

Thanks for the additional commit. Your description of the hostdb issue is very helpful.

@bneradt
Copy link
Contributor

bneradt commented Aug 4, 2025

[approve ci autest]

@github-project-automation github-project-automation bot moved this from For Review to Ready to Merge in 9.2.x Branch and Release Sep 4, 2025
@ezelkow1 ezelkow1 merged commit 6dedd1e into apache:9.2.x Sep 4, 2025
15 checks passed
@ezelkow1 ezelkow1 moved this from Ready to Merge to Done for v9.2.x in 9.2.x Branch and Release Sep 4, 2025
@yknoya yknoya deleted the fix-hostdb-compatibility branch September 5, 2025 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done for v9.2.x
Development

Successfully merging this pull request may close these issues.

4 participants