One of our clients Gorkana was bought by Durrants. As such I have been working on integrating and merging data. As a result I wrote a script to synchronize data between MySQL and Oracle, which is presently a one way synchronization.

The implementation

The script(s) to synchronize data was broken into a few scripts.

  • Automatically generate triggers for tables to push data into changelogs
  • Generate sql to backfill changelog tables
  • For additional data from model classes push changesets into the database and queued into batches to be written into the database.

Key points of the script.

  • Only synchronize key sets of data (Pushing all data was not required).
  • Only works on changesets of data, data is pushed into a changelog with the field name, the from value and the to value, etc.
  • Identify relations within the database
  • Recurse through the dependencies of any foreign keys and push change sets in an order for an atomic commit.
  • Performance was not a key requirement, however the original dataset to be synchronized has become a lot larger. Roughly 27 million rows with a fresh set of changelogs
  • The changelogs can be flushed at any point.
  • Data can be filtered

Connectivity

Whilst setting this up I came across a few issues. The infrastructure for Gorkana is located at Bytemark in Manchester, and Durrants in London.
The first task was to setup a VPN connection between Durrants and our office, as well as our production machines.

We had issues setting up the VPN and tried varying software VPN clients including racoon and openvpn with a site to site connection. This chewed up quite some time and I ended up using vpnc as we were unable to establish a site to site connection in adequate time.

Whilst using vpnc the connection dropped after long periods of time and I wrote a few scripts to reestablish the tunnel if it was down. Ping is restricted on the network, which is why I used “netcat -w 3 -z ${ORACLE_SERVER} ${PORT}” to test the connection.

/etc/vpnc/vpn.wrapper.sh download – Ensure single execution
/etc/vpnc/routes.sh download – Tests and attempts to reconnect VPN.

Blocking IO

Whilst testing, the VPN connection would drop out and one of oci_connect, oci_parse, oci_execute would block IO indefinitely on a read syscall. Despite the php documentation stating “OCI8 can hang until a TCP timeout occurs and an error is returned, which might be several minutes.”.

There are a few ways I attempted to resolve the problem.

Oracle Configuration

I added connect and send timeouts to “$ORACLE_HOME/network/admin/sqlnet.ora” with the following.

TCP.CONNECT_TIMEOUT=10
SQLNET.SEND_TIMEOUT=300
Fast Application Notification (FAN)

Oracle supports “Fast Application Notification (FAN) Support“, which effectively pings the oracle server and fails over to other connections in the pool. The only issue with this is that the network I am connecting to prohibit ping. As such not making this method feasible.

In order to enable FAN add to the php ini “oci8.ping_interval=1” and “oci8.events = On” and enable on oracle;

SQL> execute dbms_service.modify_service(
            SERVICE_NAME        => 'GMD',
            AQ_HA_NOTIFICATIONS => TRUE);
Forking

I have yet to see the script become blocked on IO after making the above changes or tested over a long period, however if this fails, I can fork the script and either set an alarm or terminate the child from the parent process using IPC .

Other

Oracle Multi Table Inserts

I first pushed the data to Oracle using Multi Table Inserts, which has a limitation in Oracle 10g where by the sum of all the INTO columns cannot exceed 999 (ORA-24335). As an example 100 rows (1000 columns divided with 10 columns = maximum 100 rows to insert). This has been corrected in 11g.

INSERT ALL
  INTO changelog_users (changetype, fieldname fromval, toval) VALUES  ('INSERT', 'fname', '','Andrew')
  INTO changelog_users (changetype, fieldname fromval, toval) VALUES  ('INSERT', 'sname', '','Johnstone')

I switched this to union the results, which avoids this limitation.

INSERT INTO changelog_addresses (addresses_changelog_id,date_created,changetype,fieldname,fromval,toval,entity_id)  SELECT '11833904',(to_date('2011-01-18 15:44:37', 'yyyy-mm-dd hh24:mi:ss')),'INSERT','is_primary','','0','45012' FROM dual UNION ALL
 SELECT '11833905',(to_date('2011-01-18 15:44:37', 'yyyy-mm-dd hh24:mi:ss')),'INSERT','is_primary','','0','45013' FROM dual

Filtering and optimization

Before pushing data to oracle I rewrote queries to take advantage of associative fieldnames in order, so I do not have to filter data in PHP. As an example the following converts enum types to boolean values to insert directly into oracle.

SELECT `cl`.*, IF(toval IS NULL,NULL, IF (toval = "YES",0,1)) AS `toval` FROM `users_changelog` AS `cl`  INNER JOIN `users` AS `u` ON u.users_id = cl.entity_id WHERE (synchronized = 'NO') AND (u.user_type = "JOURNALIST") AND (`fieldname` IN ('search_hidden_gds','is_primary_contact'))

Queries are batched into chunks of 10,000 rows to insert into Oracle and grouped by the corresponding tables to insert. Database queries to source data from MySQL were reused within filters to ensure only necessary data was pushed across, as only a subset was required to be pushed.

There are a few things that need to be optimized yet, updating a single table with 14 million rows is taking 44 seconds to update and the index is not being used with the between statement. However once the data has been processed it no longer needs to be retained. As such can simply truncate the data.

UPDATE addresses_changelog SET synchronized = "YES", last_sync=NOW() WHERE synchronized = "NO" AND addresses_changelog_id BETWEEN 11833904 AND 11840954
Finished Query MySQL in 44.525539875

It takes 0.2 seconds to insert batches of 10k into oracle an and currently the MySQL queries have not been optimized, but execute between 0.2 to 40 seconds on tables that range up to 18 million rows. The data to be synced after the initial transfer of data will be fairly minimal as the application is not write intensive.

Ideally I need to fork the script when executing statements, which will vastly speed up the script.