Resolving ORA-25408 in OCCI and RAC with TAF enabled

Environment

DB: Oracle 12.2.0.1 RAC 2 instances
Client <-> APP <-> DB

Symptoms

Shutdown primary instance and when client wants to logout and login, it could receive ORA-25408: can not safely replay call

Diag

Collect tcpdump on APP server, which captures ALL traffic between DB server AND client and here is what really happens:

# Short version:
1. App connects to DB and everything is fine
2. Shutdown current primary instance
3. DB sends FIN request to APP, which puts the socket on APP to CLOSE_WAIT
4. APP did not reply with the 2nd FIN for a long time
5. Client sends login request to APP, APP redirect the validating request (stored procedure) to DB with the CLOSE_WAIT socket
6. Since current instance has already been shutdown, DB Reset the stale connection
7. APP reconnects to another instance and tries to replay the validating request
8. Since our failover type is SELECT and procedure is not supported thus ORA-25408 raised

# Long version:

# DB (199.31.217.81) shutdown, asks APP server (172.17.217.84) to terminate TCP connection

2019-12-12 11:52:18.765076 IP 199.31.217.81.1521 > 172.17.217.84.41340: Flags [F.], seq 135540, ack 17685, win 532, options [nop,nop,TS val 488703983 ecr 2757020596], length 0
0x0000: 0050 5691 4030 6805 ca21 d6e5 0800 4500 .PV.@0h..!....E.
0x0010: 0034 dc12 4000 3906 3fda c71f d951 ac11 .4..@.9.?....Q..
0x0020: d954 05f1 a17c d13d 5be4 a7ee aa95 8011 .T...|.=[.......
0x0030: 0214 92a3 0000 0101 080a 1d21 07ef a454 ...........!...T
0x0040: cbb4

# App Server (172.17.217.84) receives this FIN request

2019-12-12 11:52:18.804896 IP 172.17.217.84.41340 > 199.31.217.81.1521: Flags [.], ack 135541, win 1424, options [nop,nop,TS val 2757207726 ecr 488703983], length 0
0x0000: 6805 ca21 d6e5 0050 5691 4030 0800 4500 h..!...PV.@0..E.
0x0010: 0034 4d38 4000 4006 c7b4 ac11 d954 c71f .4M8@.@......T..
0x0020: d951 a17c 05f1 a7ee aa95 d13d 5be5 8010 .Q.|.......=[...
0x0030: 0590 25fe 0000 0101 080a a457 a6ae 1d21 ..%........W...!
0x0040: 07ef

# However, App Server (172.17.217.84) did NOT send the other FIN request to DB server,
# which means that the TCP socket is in CLOSE_WAIT state on app server.

# The Client session did not interact with DB
# 100 seconds later …

# Client (200.31.139.243) send login request to APP Server (172.17.217.84)

2019-12-12 11:53:58.068139 IP 200.31.139.243.50707 > 172.17.217.84.8006: Flags [P.], seq 1:340, ack 1, win 16425, length 339
0x0000: 0050 5691 4030 6805 ca21 d6e5 0800 4500 .PV.@0h..!....E.

# Interesting things happen here
# App (172.17.217.84) Redirect this login request to DB (199.31.217.81) with the above HALF OPEN socket (CLOSE_WAIT)
# Essentially, it calls procedure PROC_VALIDATE_USER

2019-12-12 11:53:58.069718 IP 172.17.217.84.41340 > 199.31.217.81.1521: Flags [P.], seq 17685:17765, ack 135541, win 1424, options [nop,nop,TS val 2757306990 ecr 488703983], length 80
0x0000: 6805 ca21 d6e5 0050 5691 4030 0800 4500 h..!...PV.@0..E.
0x0010: 0084 4d39 4000 4006 c763 ac11 d954 c71f ..M9@.@..c...T..
0x0020: d951 a17c 05f1 a7ee aa95 d13d 5be5 8018 .Q.|.......=[...
0x0030: 0590 264e 0000 0101 080a a459 2a6e 1d21 ..&N.......Y*n.!
0x0040: 07ef 0000 0050 0600 0000 0000 1169 dafe .....P.......i..
0x0050: ffff ffff ffff ff01 0000 0000 0000 0001 ................
0x0060: 0000 0003 77db feff ffff ffff ffff 1200 ....w...........
0x0070: 0000 0300 0000 0000 0000 0000 0000 0012 ................
0x0080: 5052 4f43 5f56 414c 4944 4154 455f 5553 PROC_VALIDATE_US
0x0090: 4552 ER

# After sending the request to DB (199.31.217.81), APP (172.17.217.84) sends the 2nd FIN request

2019-12-12 11:53:58.069778 IP 172.17.217.84.41340 > 199.31.217.81.1521: Flags [F.], seq 17765, ack 135541, win 1424, options [nop,nop,TS val 2757306990 ecr 488703983], length 0
0x0000: 6805 ca21 d6e5 0050 5691 4030 0800 4500 h..!...PV.@0..E.
0x0010: 0034 4d3a 4000 4006 c7b2 ac11 d954 c71f .4M:@.@......T..
0x0020: d951 a17c 05f1 a7ee aae5 d13d 5be5 8011 .Q.|.......=[...
0x0030: 0590 25fe 0000 0101 080a a459 2a6e 1d21 ..%........Y*n.!
0x0040: 07ef

# DB (199.31.217.81) replies with RST, meaning 1521 port is already closed and app needs to connect to the other instance ..

2019-12-12 11:53:58.070207 IP 199.31.217.81.1521 > 172.17.217.84.41340: Flags [R], seq 3510459365, win 0, length 0
0x0000: 0050 5691 4030 6805 ca21 d6e5 0800 4500 .PV.@0h..!....E.
0x0010: 0028 c988 4000 3906 5270 c71f d951 ac11 .(..@.9.Rp...Q..
0x0020: d954 05f1 a17c d13d 5be5 0000 0000 5004 .T...|.=[.....P.
0x0030: 0000 b578 0000 c368 0000 0000 ...x...h....

# App server (172.17.217.84) connects to the other instance (199.31.217.82)

2019-12-12 11:53:58.071660 IP 172.17.217.84.37322 > 199.31.217.82.1521: Flags [S], seq 60403879, win 29200, options [mss 1460,sackOK,TS val 2757306992 ecr 0,nop,wscale 7], length 0

Conclusion

1. Do NOT reuse CLOSE_WAIT connection in application
2. Close stale sockets timely

Reference:

Implement TAF with OCCI (DOC ID 605544.1)

Why should connections in a custom connection pool be explicitly cleaned/repaired?

Example : Application has 500 connections in a pool, 10 of them are busy (i.e doing a roundtrip) and 490 are free/idle. If the database instance goes down now, then failover will be done for the 10 active connections and the user requests on these connections need to be restarted. When each of the 490 connections are picked up next by the application, failover will be done and the OCCI/OCI API will return one of the above errors, forcing the user request to be restarted. The application can avoid these errors on the 490 idle connections by ‘repairing’ or ‘purging’ these connections by using the knowledge that TAF has occurred on the other 10 connections.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close