Network Status
  Albury Local Internet Network Status checked on 05-Jun-2023 20:20 -ALImon-
Core Router0.2
Albury Core42.0
Albury GTF33.9
nameserver 10.0
nameserver 20.2
nameserver 30.2
nameserver 413.9
Mail Server0.1
webserver 10.2
webserver 20.0
webserver 30.1
webserver 50.4
webserver 60.4
Time Server0.1
Darkfighter PTZ39.8
Weather 145.9
Albury Lightning Sensor45.1
Corryong Lightning Sensor16.1
Wagga Lightning Sensor-
ADSB feeder42.4
TTN Gateway39.6
Key: UP, optimal Slow Slight Loss Excess Loss Down Restored Alarm DNS Fail

07:03	 A SAN controller failure has seen several servers go offline. Engineers
	 got replacement parts quickly. Most services were back running at 8:45
	 but one server took forever to complete file system consistency checks
	 and didn't come back online until 9:37.
	 Auth, Phone and email all affected. Disruption to DNS will have affected
	 various unrelated services until restoration was complete.

01:40	 Loss of connectivity to a small number of hosts (especially
	 3rd-party router is showing multiple failures. Engineers working on it.
	 Router restored 02:30, however only internal traffic is flowing. More
	 analysis underway. 06:45 corrupted firewall rules rebuilt and all working.

22:30	 BGP failure on core router saw pretty much all connectivity lost.
	 Reload of router by technicians at 23:00 restored service.
	 Investigations continuing.
19:00	 Network connections in WAIT_STATE inexplicably increasing exponentially.
	 No source identified, performance appears unaffected but not comfortable
	 leaving it without explanation. Subsystem being restarted after 374 days.
16:40	 Remarkably similar outage to 18-Feb. Sudden loss of all connectivity.
	 Problem identified as failed Redback. Engineers despatched to the
	 datacentre to install/configure the spare. Services restored 18:39
	 Waiting on post-incident report.

11:25	 Sudden loss of all connectivity to all customers and all servers.
	 Upstream experienced a "power event" within the datancentre that
	 has destroyed both primary and secondary power supplies on their
	 aggregation switch, effectively isolating every part of the network.
	 Engineers on site installing backup router, and rebuilding configuration.
	 Seeing customers come back online at 14:12, restoring most of our core
	 infrastructure including authentication, mail and phones. Some servers
	 are still unreachable, we have no ETR but it is being worked on as a
02:00	 Over the last couple of hours, performance issues have been observed
	 with several hosts, ultimately being identified as a failure of a SAN.
	 Due to subsequent loss of ISCSI connections, multiple servers have
	 rebooted. Although most servers were restored soon after, phone, alarms,
	 monitoring, weather, graphing, some DNS and a few other services were
	 unable to be restored until 09:00, although all services have degraded
	 Nominal performance restored around 14:30.
09:00	 Have lost all our meteorological, albury cameras, lightning tracking,
	 amateur radio wormhole, IRLP, ADSB, websdr etc feeds due to being out
	 of town and suffering yet another UN-NOTIFIED NBN outage. At least it
	 was only 6 hours not the 12 they now suggest it may take! Restored just
	 before 15:00. How many more times, nbn???

06:30	 We are advised there is a "Major incident impacting various POIs in NSW.
	 Services are experiencing no data flow. POIs include: Albury, Campbelltown,
	 Coffs Harbour, Gosford, Hamilton, Liverpool, Maitland, Nowra, Tamworth,
	 Wagga, Berkeley Vale. Engineers are investigating with highest priority.

07:35	 Loss of external connectivity to our main servers. Engineers
	 investigating found upstream was resetting BGP. Fixed and
	 back running 08:05. Root cause being researched.

01:03	 Scheduled work by our upstream warns of a couple of minutes
	 downtime while routes change. Appears AAPT failed to accept
	 some of the new routes, resulting in some services being
	 unreachable outside our network. Engineers finally contacted
	 at 6:30am and all services restored at 6:47am.
	 Process failure analysis underway.

14:02	 Partial loss of connectivity to several hosts in S1 datacentre.
	 Servers were all reachable again in about 90 seconds.
	 Investigations are underway to find the cause and prevent any repeat.

04:00    We have been advised of an "urgent works" requirement by the datacentre
	 and are expecting an outage of 30-60 minutes from 4am.
09:06	 Become aware that we have not been getting RADIUS auth requests.
	 Problem has been identified as telco RADIUS server died at 6:27
	 and went un-detected by their monitoring. Service restored 9:58
	 after talking with telco engineers and they rebooted their server.

18:00	 We are upgrading some of our Melbourne infrastructure, and are
	 anticipating up to 60 minutes of server downtime for our mel-2
	 and mel-3 servers. These are primarily quaternary nameserver,
	 backup monitoring and some VPN services. No impact is expected
	 to any customers.

11:45	 The shutdown of several servers planned for 11am was deferred
	 45 mins to facilitate unplanned access difficulties at the
	 datacentre. All hosting on syd-5, dataretention and liveweatherviews
	 systems is anticipated to be down for 30 minutes while a new
	 storage subsystem is installed, requiring servers to be relocated
	 within the rack, and additional fibre interfaces to be installed.
	 ILO back 12:00
	 All other services back 12:12

00:07	 Appears AAPT have an aggregation switch failure and half of our
	 upstream providers network is down. They're working on it but no ETR.
	 All ADSL and NBN services affected.
	 (Resolved 01:00)

10:00	 Due to power infrastructure upgrades at our office, some
	 services (billing queries etc) will be offline for approx
	 1hr. (Work completed. System was offline 10:08-10:48am)

10:12	 Billing, accounts and enquiries, office phones, weather
	 and aviation data all offline. Construction workers 400m
	 away have put an excavator through the main street cable.
	 Waiting on repair crew.
	 (13:15. Repair crew still at least 3 hours away. Taken
	 action myself, it's only a 30-pair cable, have done a
	 temporary repair to at least get back online)

07:19	 Some systems demonstrating degraded performance.
	 Within a few minutes, virtually all systems were non-responsive.
	 Half-million dollar SAN has crashed, and vendor-provided
	 configuration contains an error preventing failover to the
	 secondary unit! Rebuild under way, but it's taking forever.
	 Finally recovered and all systems back online at 15:39

10:47	 Packet loss on all external links. Under investigation.
	 (Partially Restored 11:02 but some paths still showing loss)
	 (Fully restored 12:00, massive DDoS taking down all links)

03:02	 syd-1 server became unresponsive. Engineers called, services
	 restored 03:28. Investigation into cause to commence later today.
	 Some websites, most authentication, mail affected.

10:45	 All services to both datacentres are isolated. Engineers are
	 enroute to sites now to see what has happened. No ETR.
	 (Restored 12:46, waiting on word as to the cause)

18:30	 We're declaring an urgent maintenance window of 30 minutes
	 to facilitate urgent maintenance on one of the blade servers.
	 Outage expected to be 10 minutes. Affected services will
	 include email, authentication and some web services.
	 Shutdown at 18:23, back online 18:43.

06:37	 Loss of connectivity to all servers in both datacentres in
	 Sydney. Engineers investigating.
	 Resolved and all services back 09:27. Awaiting full report.

06:30	 Due to a developing fault in a fibre-optic service, engineers
	 will be replacing a link at around 6:30am. All services except
	 for our 3G mobile services, will be unavailable during this
	 time. Duration is expected to be less than 1 minute.
	 Work was completed ahead of time. Outage commenced 05:49:50
	 and full connectivity restored 05:50:30  (40 seconds outage)

16:00	 It is with some sadness, I report the decommissioning of our
	 longest-running server. "Starone" was built on 1-April-1997
	 and brought into service shortly after, where it has run 24/7
	 ever since. After starting to throw hard disk errors a while
	 ago, a replacement platform has been brought online to take
	 over all of its tasks. 18 years 9 months and 12 days.
10:54	 Works completed. New servers running, all tested services are
	 back online, no mail missed. Will continue to check everything
	 has migrated properly and monitor for any problems. In the event
	 users do encounter issues, please call 0409 578 660.

10:30	 A 30-minute maintenance window declared for migration of our
	 remaining servers from Global Switch datacentre at Ultimo to
	 our new facilities at S1.

15:10	 Notified by datacentre that critical switch fabric has failed, and
	 an emergency maintenance window was declared to replace it. All
	 services down at 15:12, restored approx 15:22. Not anticipating
	 any more issues. This is also the explanation for the brief blip
	 at 14:29 this afternoon.

14:29	 Brief loss of connectivity to all core server. Investigating.

17:21	 Loss of Australian VoIP services. Upstream carrier appears to have
	 changed their infrastructure without advising wholesalers. After
	 finally getting someone there to identify the cause, it was a quick
	 fix to restore services. Most incoming calls redirected to mobiles
	 and all services restored by 18:45

21:43	 Intermittent problems with several of our servers over the last 2
	 days has been identified as a "perfect storm" confluence of three
	 events. 1. A controller failure in the SAN, resulting in degraded
	 IO performance; 2. ISCSI driver problem in the new ESXI VM host
	 causing further SAN problems and 3. Loss of a blade server, causing
	 additional load on the remaining heads, esacerbating a bad situation.
	 Problems resolved by relocating all our hosts onto a brand new SAN
	 and new higher-performance servers. We are assured there should be
	 no further performance issues!

09:02	 SAN connectivity issues in the datacentre resulted in two servers
	 losing file systems and rebooting. Little impact to customers, some
	 webservers (CMS systems) affected for approx 12 minutes, cameras,
	 weather data etc inoperative for the same time.

16:30	 Mail, authentication and webservers lost connectivity.
	 Investigating root cause. VMs restarted, systems back online
	 at 16:40.

10:00	 Major disruption to infrastructure at our upstream. Claimed to be due
	 to the leap-second inserted at 23:59:59 UTC (ie, 10AM local time),
	 causing fibre-optic links to lose timing integrity and shutdown.
	 Required powercycling and disabling of time sync at multiple sites.
	 Restored 11:25.

05:09	 Connectivity loss at transit provider. Cause unknown, but their tech
	 people are working on it. No ETA or cause known yet. Will update as
	 I get anything. All our equipment is working, traffic is dying 1 or 2
	 hops into the carriers network.
	 Restored 6:57am. Word back from the carrier is that they were doing
	 "network maintenance" that went seriously wrong. They had not issued
	 maintenance notice or hazard warnings because in their view it was
	 "inconsequential maintenance". It is unclear why both our primary and
	 backup links failed as they are supposed to be with different carriers
	 on different infrastructure.

06:45	 Connectivity to both datacentres lost! Completely unrelated to all the
	 other server migration works. Post-incident analysis shows that the
	 main route-server lost its mind at 6:45. The backup server should have
	 taken over, but as murphy would have it - the backup server was away
	 being upgraded so full seamless-failover functionality would work
	 between both sites! Primary server rebuilt, backup server will be back
	 in a couple of days and will be reinstalled then. (This server is not
	 part of our infrastructure and we were not advised it was being taken
	 out of service - an oversight that is being addressed!)
	 All services restored between 09:15 and 09:22

20:27	 ali-syd-4 has been relocated and decommissioned
	 $ uptime
 	 8:26PM  up 654 days,  6:41, 1 user, load averages: 0.01, 0.01, 0.00

20:17	 The second of this weekends planned infrastructure upgrades has completed
	 without any evident problems. Syd2 has been migrated to our new server
	 platform at a different datacentre. This server runs (amongst other things)
	 most of out monitoring sysem, VoIP system and weather site.
	 Like syd3, it has been a faithful, reliable and hard working server.
	 [root@ALI-SYD-2 /]# uptime
	 08:40PM  up 556 days, 20:45, 12 users, load averages: 1.20, 0.80, 0.78

11:14	 The first of this weekends planned infrastructure upgrades has completed
	 without any evident problems. One of our oldest remaining servers (syd3)
	 has been migrated to our new server platform at a different datacentre.
	 It's been a faithful and reliable server, will be a pity to turn it off!
	 [root@ALI-SYD-3 /]# uptime
	 11:40AM  up 821 days, 13:50, 24 users, load averages: 0.00, 0.00, 0.02
	 Services were stopped on old syd3 at 11:14:30, the final data sync
	 completed and the new server brought online at 11:27:05

10:31	 ProxyRadius server restored, authentication working again. Carrier is now
	 investigating why their failover system failed to work, but apologise for
	 any inconvenience.

06:02	 Our upstream provider has had a failure in part of their infrastructure,
	 resulting in them not passing us authentication requests. They are working
	 on it but have no ETR. This will affect any dial-up customer, and anyone
	 on ADSL whos modem drops off-line and has to re-authenticate.

20:35	 Everything is back online and running. Problems getting other people in
	 to the restricted access areas of secure datacentres caused significant
	 delays. We have no identified the cause of these mysterious and intermittent
	 "non-responsive" periods and replacement hardware will be ordered tomorrow.

16:42	 Our main authentication/mail/web server has become non-responsive.
	 Support staff are all interstate, backup staff are being recalled to the

08:36	 Our main mail/authentication/web server has again become non-responsive.
	 System remotely restarted, back online 8:55. Suspect hardware identified
	 and replacements being arranged.

08:48	 Our main mail/authentication/web server has become non-responsive.
	 System restarted remotely and back online 8:53. Investigation underway.

07:00	 We are relocating our Melbourne rack to a different part of the datacentre.
	 This will result in approx 2hrs downtime. This will only affect tertiary
	 nameservers and some monitoring. Customer services, mail, authentication,
	 accounting, websites etc will be unaffected.

18:22	 Due to a failed power supply, the ALI offices are currently without
	 power or on very limited power. Some services may be unavailable
	 or will be intermittently available until further notice, but we
	 hope replacement equipment will be here within 24 hours.
	 Affected services will include on-line bill enquiries and payments,
	 weather, lightning, some skycam data and aircraft tracking systems.

Previous status reports (by year): 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996