測試台的登入API突然出現失敗,追至上游在Log發現錯誤訊息:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.)
依經驗,前半段是SQL Server未開機或無法建立網路連線的標準訊息,但後方"because the system lacked sufficient buffer space or because a queue was full."卻很罕見,查到MSND Blog,有兩種可能:OS的TCP Buffer記憶體耗盡,或是TCP的臨時Port號用光了(Windows 2003/XP只有5000個可用)。初步觀察,用netstat檢查當下連線不到20條離上限仍遠,至於TCP Buffer耗盡疑慮則沒找到明確線索證實或排除。找出該API要連線的SQL IP,由其他主機試連OK,因此排除SQL Server端網路出狀況的可能性,聚焦回到API主機端的網路。試著修改API服務加入追蹤Log,再得到不同的誤訊息:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - Only one usage of each socket address (protocol/network address/port) is normally permitted.)
由"Only one usage of each socket address (protocol/network address/port) is normally permitted."查到MSDN文件,問題根源再次指向TCP Port號用完,且於API主機試連其他台IIS(telnet xxx.xxx.xxx.xxx 80)也出現無法建立連線錯誤,確認問題出在API主機無法建立對外連線(但可以順利透過VNC連上它遠端遙控偵錯,顯示對內連線未受影響),反覆測試telent偶爾成功,多數失敗,推測可能剩餘TCP Port號或Buffer資源有限,故成功靠運氣,但netstat並沒看到成千條並存連線,缺乏佐證。
問題在重新開機後消失。重開機時TCP等資源會全部重設,Port或Buffer不足狀況消失,一切恢復正常很合理。但先前未由netstat印證資源被大量連線吃光,真相成謎,先歸入X檔案~