Transactd High Availability (THA)
Overview
Since Transactd version 3.5, Transactd High Availability (THA) is available. THA redundant configuration with multiple servers make high availability system which has much shorter downtime than single server configuration. In addition, load balancing, such as doing read operations on the slave is available easily.
THA has following features:
- It has all of functions which are needed for failover. Fault detection, changing master, resolving hostnames, etc.
- Minimize downtime on fault. (Within a few seconds.)
- Automatic master failover.
- Alive monitoring for master server.
- Resolving hostnames between virtual hostname and real hostname.
- Without help from other products such as OS, external devices and DNS.
- Available with changing only a few row of application program.
- Switchover.
- Health checking.
- Available in wide environments. MySQL/MariaDB, Linux/Windows/Mac OSX, C++/PHP/Ruby/COM.
- It does not need libaries other than Transactd plugin and Transactd clients.
Index
- Structures of THA
- Configure THA
- Note of THA applications
- THA administration
- THA with both of Transactd Client and SQL
- haMgr reference
Structures of THA
THA is constructed with a master server and multiple slave servers which use MySQL GTID replication. You can balancing read operations to slave server optionally. You can also receive all access with master server and make the slave(s) spare for failover. One slave is required at least. Two or more slaves are more desirable.
All servers are required same version of MySQL (5.6 or later) / MariaDB (10.0 or later) because THA uses GTID replication. See MySQL / MariaDB documents to set up replication with GTID.
- Multi master replication is not supported now. Single master is available.
- We recommend the semi-sync replication to prevent data loss.
Mechanism
THA is constructed these elements:
- An master server and slave servers
- Hostname resolver (THNR)
- Server role
- Client reconnecting
- Alive monitoring
- Failover program (haMgr)
Hostname resolver (THNR)
THNR (Transactd HostName Resolver) is hostname resolver which is embedded in client library.
You can easily use it in your application.
The application accesses to database with virtual hostname instead of real hostname.
THNR is embedded in tdclc
library. It converts hostname from virtual to real.
Master switching is available with a single product THNR, without some products such as OS, network device and DNS.
One instance of THNR is shared with all threads in one process.
Server role
Typically, the server role is master or slave. It is possible to detect which is the master in servers from configuration of replication. But THA requires to specify role on server variables explicitly, on purpose of supporting more complicated configuration in future.On failover, the server role will be changed correctly by failover program.
THNR detects change of server role by checking sameness between the role which was requested by client and the server's real role. This mechanism prevents troubles such as write to slave by mistake.
The role which was requested by client is identified by which virtual hostname is master or slave.
Client reconnecting
Client has the function to reconnect to server when network error has occurred.
If reconnection was succeeded, the current record or lock status will be recover,
then the application can continue their procedure as if nothing had happened.
But the client will not to try to reconnect to server if transaction has been started with
beginTrn()
or beginSnapshot()
.
In this case, the application will receive an error.
It can Abort
transaction and notice result to user.
Reconnection will be tried on the next operation.
Alive monitoring
THNR do the alive monitoring and calling failover program. If an network error caused by death of master server has been occurred, reconnection will be tried according to error status. At this time THNR will be notified that "this is reconnection".
THNR discards the cache and search correct host when it receives reconnection notify. If it can not connect to master in this process, it detects that the master server has died. Then it will call the failover program, switch master, and search new master.
This alive monitoring is implemented by handling client access errors. It has these advantages:
- Time-base polling is not required.
- There is no server load by polling.
- There is no time lag between the failure and detection of it.
- Any access error does not occur since failure until detection of it.
Failover program (haMgr)
The failover program haMgr
is independent process program. (c.f. haMgr reference)
Failover will be done if haMgr
is callable by THNR (haMgr
is put on PATH).
If it is not callable, the application simply receives a network error.
Only the first client which starts failover process can do it. Clients which try later can not lock HA object on server, then fail to do it.
The clients which has failed to lock start to resolve hostnames by THNR immediately. In this time, they try to lock HA object during up to 60 seconds to check the servers are in failover process. The new hostname will be got by them if the failover process was finished and the lock was released.
Minimize downtime
Above mechanism enables automatic failover to new master and continue process without downtime. The time which takes to failover is depending on the number of slaves, but detecting takes several seconds and switching takes less than 1 second.
Any errors will not occur between detecting master server fault and starting failover program, because actual client monitor the alive of server directly. Only the transaction which has been started at the time of master's down will fail.
THA also supports the switchover which switch master intentionally. It waits for the transaction on current master to be finished. So any data will not be loss.
Configure THA
Now we explain how to set up THA. The examples assumes these servers:
Server | IP address |
---|---|
Master | 192.168.0.2 |
Slave 1 | 192.168.0.3 |
Slave 2 | 192.168.0.4 |
-
Configure replication between a master and multiple slaves, with MySQL 5.6 or MariaDB 10.0 or later. c.f: details of MySQL/MariaDB GTID replication (Japanese).
In this time, set up same configuration on all servers. Such as username and password for Transactd, channel, username and password for replication.
-
Specify hostname or IP address like
report-host=192.168.0.x
inmy.cnf
on each servers. If you use the port other than default (8610), specify it likereport-host=192.168.0.x:8611
.This hostname will be used by
start
function orhealth_check
which are described later. -
Restart master server with boot option
–transactd-startup_ha=1
. -
Put
haMgr
program onto the directory which is callable from Transactd client application (set path). -
Check failover availability with
haMgr
-c health_check
command../haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p abcd 2016-07-05T18:34:28 Starting health check... SLAVE_LIST=192.168.0.3,192.168.0.4 192.168.0.2: Role = Master OK! 192.168.0.2: HA lock OK! 192.168.0.3: Role = Slave OK! 192.168.0.3: Failover is disabled NG! 192.168.0.3: HA lock OK! 192.168.0.3: channel name= 192.168.0.3: SQL thread running OK! 192.168.0.3: IO thread running OK! 192.168.0.3: SQL thread delay=0 192.168.0.4: Role = Slave OK! 192.168.0.4: Failover is disabled NG! 192.168.0.4: HA lock OK! 192.168.0.4: channel name= 192.168.0.4: SQL thread running OK! 192.168.0.4: IO thread running OK! 192.168.0.4: SQL thread delay=0 2 errors detected. 2016-07-05T18:34:29 Done!
There are two
Failover is disabled NG!
errors, but these are no problem at this time. We will set failover enablement later. -
Add THNR
start
function to Transactd client application.haNameResolver::start("master_host", "slave_host", "192.168.0.2,192.168.0.3,192.168.0.4", 1, "root", "abcd");
See haNameResolver SDK document (Japanese) about detail of this function.
-
Replace real hostname to virtual hostname in Transactd client application.
/* Access to the master */ db->open("tdap://root@master_host/db/test?pwd=abcd"); /* Access to the slave */ db->open("tdap://root@slave_host/db/test?pwd=abcd");
-
After the application testing, enable failover with
haMgr
program../haMgr64 -c set_failover_enable -v 1 -o 192.168.0.2 2016-07-06T10:21:04 Done!
transactd-startup_ha
option means the startup server role.
transactd-startup_ha=0
or no description inmy.cnf
: the server start as a slave.transactd-startup_ha=1
: the server start as a master.
It is possible that change the role without restarting server with haMgr
program or Transactd API.
If restart the server after changing the master with switchover,
the server start up with the role which is specified in this boot option, in spite of the master has been changed.
To restore the role at last time, add 4
to transactd-startup_ha
value.
transactd-startup_ha=4
: Restore the role if there is role information at last time. Start as a slave if there is not.transactd-startup_ha=5
: Restore the role if there is role information at last time. Start as a master if there is not.
However, the failed server has the master role information at last time.
It need to specify transactd-startup_ha=0
explicitly as a slave on start after fix.
transactd-startup_ha
value is saved in data_dir/transactd_srv_master.info
.
The role restoring is disabled if delete it.
Note of THA applications
Two database objects and transactions in the same thread
For example, we assume that there are two database
object instances DB_A
and DB_B
in our program.
First, start transaction with DB_A
, then an error occurs with DB_B
during the transaction.
In this case, if DB_A
and DB_B
are in same thread, the reconnection will not be tried.
It need to wait for DB_A
transaction to reconnect with DB_B
,
but the program can not continue to process because DB_A
and DB_B
are in same thread.
THA administration
Keep health
To keep high availability, check daily if failover will go well on fault.
haMgr
has health_check
function to check failover will go well on fault.
Keep health of THA with scheduling this test. c.f: Health check.
This health check function does not validate the account that specified by repl_user
and username
.
Please check that the account has the privilege to CHANGE MASTER TO
operation in the other way.
Check whether failover was done
CHANGE MASTER TO ...
will be logged to MySQL error.log
(or eventlog on Windows) when failover has done by THNR. Monitor this log to check failover has done.
Detailed log of failover process
More detailed log will be logged to client log. (If THNR call haMgr
, the log will be redirected to errorlog.
If you execute haMgr
, the log will be showed in console.)
In particular, the log will be like: (c.f: error log)
2016-07-07T09:29:33 Starting fail over...
SLAVE_LIST=192.168.0.2,192.168.0.3,192.168.0.4
HP6730B:8611: promote to master
HP6730B:8611: channel name=
HP6730B:8611: set role=MASTER
HP6730B:8612: channel name=
HP6730B:8612: stop slave all
HP6730B:8612: change master to new master, pos=slave_pos
HP6730B:8612: start slave
2016-07-07T09:29:33 Done!
THA with both of Transactd Client and SQL
THA shows maximum performance with the application which uses only Transactd API. We recommend to modify your application so, if high availability is the most important probrem on your project.
THA does not disturb SQL access. It is friendly to Transactd API, and it is transparent to SQL access. If you need both of them, there are some combinations of THA elements. I will explain major choices.
Use THA mainly
Set up THA according to above instructions. Modify your application with SQL to use haNameResolver::master() as the master hostname, and use haNameResolver::slave() as the slave hostname. There is no problem in normal operations.
The problem will occur when the first access is from SQL after the master fault. If the first access is from Transactd API, THA will do fault detection, failover and resolve hostname automatically. But if the first access is from SQL, these recovery processes will not be done automatically, and errors will occur. These recovery processes will be done at next Transactd API access.
To improve recovery speed, modify the application to "Try with Transactd API access if an error occurred with SQL access".
If you do not need automatic failover, this way is the best choice. The project which is enough with manually switchover on fault use this way without any problems. In this case, disable failover:
./haMgr64 -c set_failover_enable -v 0 -o 192.168.0.2
"the first access after the master fault" is for each client process. All threads on one client process share THNR.
Use MySQL's HA
Traditional MySQL's HA is constructed like:
- Monitoring with
heartbeat
- Failover with
mysqlfaileover
program - Change hostname with
VirtualIP
- Update router ARP cache
On this way, enable reconnection on client application to reconnect new server after failover.
/* In the first of the application */
nsdatabase::setEnableAutoReconnect(true);
There are some problems on this way:
- Accesses will fail since fault until finish failover.
- Transactions with Transactd API will fail on switchover because failover program other than
haMgr
does not wait for it. - Updating IP address or ARP depend OS or other devices.
Use MySQL's HA mainly, and use haMgr for failover
If you use haMgr
as failover program, transaction with Transactd API will not fail at switchover,
because haMgr
wait for finish of it.
haMgr reference
haMgr
is the program to control THA. It has these features:
- Switchover
- Failover manually
- Demote master to slave
- Change failover enable/disable
- Change server role
- Health check
Options
command line option:
-c [ --command ] command [switchover | failover | demote_to_slave | set_failover_enable | set_server_role | health_check]
-o [ --cur_master ] current master host name
-n [ --new_master ] new master host name
-C [ --channel ] new master channel name
-P [ --repl_port ] new master port
-r [ --repl_user ] new master repl user
-d [ --repl_passwd ] new master repl password
-O [ --repl_option ] option params for change master(ex:MASTER_CONNECT_RETRY=30)
-s [ --slaves ] slave list for failover
-a [ --portmap ] port map ex:3307:8611
-v [ --value ] value (For set_failover_enable or set_server_role)
-u [ --username ] transactd username
-p [ --password ] transactd password
-R [ --readonly ] 0 | 1: When it is 1, the READONLY variable will be set ON to slaves and OFF to a master
-D [ --disable_demote ] 0 | 1: disable old master demote
-command
Specify command.-cur_master
Specify current master.-new_master
Specify the server which will be new master by switchover.-channel
Specify channel name used by connecting to new master on switchover. Default value is""
. It is available on MySQL 5.7 or MariaDB 10.0 or later.-repl_port
Specify replication port number used byCHANGE MASTER TO
command.-repl_user
Specify username used by replication.-repl_passwd
Specify password forrepl_user
.-repl_option
Specify additional option(s) forCHANGE MASTER TO
command. e.g.MASTER_CONNECT_RETRY=30
. Separate options with commas.-slaves
Specify comma separated slave server names for failover.-portmap
Specify Transactd port and MySQL port if you change them from default. e.g.-a 3307:8611
.-value
Specify value used byset_failover_enable
andset_server_role
command.-username
Specify username used by Transactd access.-password
Specify password forusername
.-readonly
Specify0
or1
. Use1
to set server variableREADONLY
on switchover or failover. The master will be set OFF. The slaves will be set ON.READONLY
affects only to SQL access. Transactd API access is not affected by it.-disable_demote
Specify0
or1
. Use1
to detach old master on switchover (then it will not be demoted to slave).
Switchover
Change the master.
e.g. Change the master from 192.168.0.2
to 192.168.0.3
.
./haMgr64 -c switchover -o 192.168.0.2 -n 192.168.0.3 -R1 -r replication_user -d abcd -u root -p xxxx
The switchover changes master, and demote it to slave at the same time.
- If
–disable_demote 1
is specified, the old master will not be demoted. It will be detached. - If
–readonly 1
is specified, the old master will be setREADONLY
. SQL access will be limited read only.
Failover
Do failover manually.
e.g. Failover with current two slaves 192.168.0.3
and 192.168.0.4
.
./haMgr64 -c failover -s 192.168.0.3,192.168.0.4 -u root -p xxxx
Demote the master
e.g. Add the old master 192.168.0.2
as a slave, to current master 192.168.0.3
.
./haMgr64 -c demote_to_slave -o 192.168.0.2 -n 192.168.0.3 -r replication_user -d abcd -u root -p xxxx
Change enable / disable of failover
Use 1
to enable failover. Use 0
to disable it.
e.g. Enable failover on current master 192.168.0.3
.
./haMgr64 -c set_failover_enable -v 1 -o 192.168.0.3 -u root -p xxxx
Change server role
Use 1
to set as the master. Use 0
to set as slave.
e.g. Set 192.168.0.3
as master.
./haMgr64 -c set_server_role -o 192.168.0.3 -v 1 -u root -p xxxx
Health check
Health check reports replication status, server roles, lock status and failover enablement of the master and its slaves to stdout.
The program returns 0
without error. If some error was found, it returns 1
.
e.g. Health check with the master 192.168.0.2
ant the two slaves 192.168.0.3
and 192.168.0.4
.
./haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p xxxx
2016-07-05T18:34:28 Starting health check...
SLAVE_LIST=192.168.0.3,192.168.0.4
192.168.0.2: Role = Master OK!
192.168.0.2: HA lock OK!
192.168.0.3: Role = Slave OK!
192.168.0.3: Failover is enabled OK!
192.168.0.3: HA lock OK!
192enablement .168.0.3: channel name=
192.168.0.3: SQL thread running OK!
192.168.0.3: IO thread running OK!
192.168.0.3: SQL thread delay=0
192.168.0.4: Role = Slave OK!
192.168.0.4: Failover is enabled OK!
192.168.0.4: HA lock OK!
192.168.0.4: channel name=
192.168.0.4: SQL thread running OK!
192.168.0.4: IO thread running OK!
192.168.0.4: SQL thread delay=0
No errors detected.
2016-07-05T18:34:29 Done!
Health ckeck shows if the system can do failover on fault. Doing health check regularly enhances the reliability of the system.
To do health check correctly, it is important that use same value with haNameResolver::start() to parameters.