fbpx
16.7 C
Sydney
Saturday, April 20, 2024

Buy now

  • HID
  • HIKVISION
  • HIKVISION
Category:
HomeArticlesDisaster recover and fault tolerance

Disaster recover and fault tolerance

SOME manufacturers may claim that off-the-shelf computer
products are as reliable as the best proprietary DVR solutions but this is not
the case. Most computer systems operate in an environment that does not demand
perfect operation and there is less emphasis on reliability and onboard
redundancy.

Better DVRs are custom-built for their role with powerful
streaming capabilities, bulletproof operating systems, multiple ports and
either mirrored HDDs or onboard RAID storage. They also have the ability to
support multiple monitors and analogue CCTV cameras. As part of your disaster
recovery plans you need to think hard about which type of solution will
greatest reliability.

Having said this, there are also computer components and
installation techniques that allow duplication of vital comms paths, power
sources and administrative functions. Applying backup systems during planning
is what will guarantee system integrity in long term, whether you use DVRs or
off-the-shelf hardware.

As clever installers will be aware, one of the first
things you need to undertake when considering fault tolerance and disaster
recovery is impact analysis. What will complete failure of a networked video
surveillance system mean for an organization? What would failure of a networked
access control system mean? You need to assess the results of failure and then
work backwards to avert risks and establish safeguards.

Central to the disaster recovery process is sitting down
with the management of an organization and spelling out to them the likely
results of a series of disaster scenarios. Senior management of a big site may
assume that in the event of catastrophic power, phone or Internet failure they
will have the unlimited support of the networked access system, while in truth
that system may be hampered by limited UPS supported and guaranteed to fail
open within 2 hours of site power failure.

The last thing you want as an installer (or as a security
manager) is to foster unrealistic expectations of system performance in the
event of network failure. Networks are touchy beasts and their multiplicity of
essential elements – firmware, hardware and software – can lead to unusual and
unexpected failures that may take many hours or even several days to put right.

In a recent incident, an IT solutions provider we know of
had a customer experience complete server failure. Just to make things more
interesting this was topped off with trouble with a hot replacement server. The
full extent of the problem was not fully appreciated early on – junior techs
were on the site initially and they battled the problem without reporting its
seriousness for most of a day.

The upshot of all this was that a large company was denied
Internet connection or email servcies for 2-and-half days while the entire
technical staff of the ISP tried to rebuild the failed system. In the end IT
company kept the business but it was a near run thing and the stress on the
technical team and company management was intense.

Apply this scenario to the servers supporting video
surveillance at a casino, airport, metropolitan rail system or any large
iconic, industrial or defense site and you have serious problems that that
would mean partial closure of operations at great cost to end user and security
contractor.

Clearly it’s vitally important that you establish what
the user expects from their system and apply those demands in the process of
system design. If the user wants the sort of data recovery redundancy that will
demand RAID 1 make sure you find that out before the system is installed not
after a hard drive failure. And security managers – know what you want and make
your demands very clear from the outset.

When working on disaster-capable networked systems,
security teams and the IT team will need to sit down and work together on
problems of this complexity. Not only are you going to need to establish the
results of security system failure, you’ll also need to work out what sort of
problems are likely to lead to such failures. The bigger the system and the
bigger the network, the more difficult it will be to establish the variables
that will relate to system failure. Just to complicate matters there may be
partial disasters as well as total disasters.

When you think disaster recovery you’re not just thinking
of natural disasters, fires or terrorist attacks that might cause structural
damage, or upset the utilities that support the system. You also need to think
about local disasters that might cause destruction and disruption to the system
itself.

The sort of disasters that are likely to impact on a
networked electronic security system include lightning strike, total power
failure, loss of communications between node zero and control room, failure of
controlling PC, failure of switches, routers or hubs, system intrusion, virus,
hacker attack, etc), failure of HDDs and software bugs. Once you have a list of
potential problems then you can start thinking about what’s required to recover
from them.

A key issue with disaster recovery is putting together a
plan that will bring a system back from failure and the best way to do this is
to compartmentalize various tasks and assign a team leader to each area of the
system. This sounds complicated but it may simply mean giving leadership in the
event of a particular problem to the individual with greatest relevant
expertise.

Think hard about people. There may be some worst case
scenarios that could mean key members of staff may be unable to perform their
duties and the responsibility will need to be carried by other team members. If
possible you need to duplicate chains of command in other offices. Management
at an interstate office might take over, and vice versa.

“When working on disaster-capable networked systems,
security teams and the IT team will need to sit down and work together on
problems of this complexity. Not only are you going to need to establish the
results of security system failure, you’ll also need to work out what sort of
problems are likely to lead to such failures”

Your recovery plan will be broad. There will be issues
with the physical components like the cable plant and the patch panel on one
hand, while there may be problems with control software or remote
communications systems on another.

Just to make matters harder there are also guaranteed to
be areas of potential failure where a security electronics team has no business
getting involved. These areas may be handled by the systems department, or in
the event of external lighting failure, responsibility may go to inhouse
maintenance crews or contractors.

In each and every case, the recovery plan needs to be
broken down to its smallest parts. In order to get this element right you need
detailed procedures, as well as up to date contact lists and guaranteeing fast
communication. There may also be areas where training is required to ensure
appropriate support. Such training should be included in disaster recovery
procedure manuals.

Managing the networked security system’s disaster
recovery plans is a job for the security manager, or a competent and well respected
assistant manager or supervisor, not a junior. This individual will be
responsible for contacting team leaders, as well as overseeing upgrades to the
plan that will came as personnel and equipment change. It sounds simple enough
but getting this part right will require constant monitoring.

Another key element to the disaster recovery process is
establishing and maintaining test procedures. Once you’ve established a
recovery plan it’s not going to be much use if that plan has never been tested
in a simulated disaster.

Remote locations

Depending on the nature of the installation, recovery may
mean that a control room at a remote location takes over the responsibility of
managing the system using a secondary comms channel and a dedicated power
source.

The system may also be designed with distributed
intelligence so components can continue to function even if the control room is
not operational. And in surveillance applications on large sites, a central
core of cameras in the highest security areas may be double-wired to a
standalone machine with its own power source while security officers patrol the
rest of the site.

Something that’s easy to do on larger sites is to
physically separate vital system components so that a local disaster can’t destroy
an entire control and storage solution. This could be as simple as spreading
servers around a number of different server rooms or keeping RAID of DAT
backups in another location from your servers/DVRs.

Almost every big security office has always kept its
backup audio, video and event files behind the mantrap and in the same location
as primary storage devices. Even if all you do is keep a single server and
controller PC outside the control room and in another part of the facility, it
will be a major enhancement to your disaster recovery capabilities.

When thinking about an alternate management and storage
site you’ll need to talk to the IT department but the ability to move the
security operation off site and still function effectively is vital in the
event of disasters like fires, chemical spills or any threat that causes
complete evacuation of a site.

Modern networked access control and video surveillance
systems are admirably equipped to offer this capability. A little foresight
during system planning should allow either the staff at another major office,
or the control room team to use remote workstations – laptops if necessary – to
manage the system externally. 

If there are plans to relocate in the event of trouble
then you need to think about staffing the remote location. This means serious
consideration must be given to issues like network access, training and sharing
of information at the highest levels.

Important with all IT infrastructure is thinking about
things like policy-based management, root cause analysis and knowledge bases to
broaden the ability of teams to make the best possible decisions when working
without direct guidance of senior management. The last thing you want is a
comprehensive recovery plan that falls over because the only person who knows
how it works is on holiday.

Something else to consider is maintenance of a full
schematic diagram of a security networks and all their peripheral equipment. In
the event of the destruction of a site this sort of detail allows fast rebuilding.

As mentioned earlier in this article, redundancy is vital
and the earlier it’s incorporated into a system design the better. Any system
whose operation is essential for the system’s overall functionality must be
duplicated if there’s any hope of maintaining operation in the event of a
disaster.

Data recovery with RAID options

Disaster recovery planning needs to be systematic and
thoroughly thought out. One of the most important things to think about is data
management. This can be a big issue for security departments with large numbers
of cameras whose storage requirements might be enormous. Many security managers
are flat out getting storage enough for 15 images per second held 7 days, let
alone having video servers or DVR hard drives backed up off site.

One of the key elements of onsite data recovery for
security teams is use of RAID storage systems and it’s worth us going into a
bit of detail to give the best understanding of how these systems work. For a
start the RAID acronym stands for Redundant Array of Independent Disks and
central to the concept of RAID is “striping”. This is a way of combining an
array of HDDs into a single storage unit. Essentially striping an array of hard
drives involves partitioning each drive’s platter into storage stripes of any
size from half a kilobyte to a few megabytes.

Storage stripes are interleaved across the array so that
the entire storage solution is actually made up of many different stripes from
all the disks woven together. Data saves or searches see the disks shuffled
like a deck of cards during download and retrieval. The benefits of RAID
include the fact that storage levels are exceedingly high and that in the event
of disk failure certain recording modes guarantee no data will be lost. Possible
RAID modes include 0, 1, 2, 3, 4, 5 and 6.

In video surveillance applications, RAID allows the use
of small stripes around 512-bytes long so that images are recorded across every
disk in the array with each drive storing a part of the image stream. There are
2 advantages here. Firstly, loss of a hard drive doesn’t mean complete loss of
data – the other disks in the array can rebuild the files lost from one failed
HDD. And secondly, record accesses can be performed very quickly – that’s
perfect for video applications. 

Different modes are very much worth having as they give a
range of performance options, depending on your requirements. RAID-0 sees data
split over the array giving high performance in terms of storage at the expense
of possible data loss in the event of disk failure. It’s the fastest RAID mode.

RAID-1 is perfect for performance-critical,
fault-tolerant solutions. It provides redundancy by writing all data to 2 or
more disks giving faster reads and slightly slower writes over single drive
storage. Most importantly though, there’s full data redundancy.

RAID-3 is ideal in data heavy situations where long
sequential record recalls improve data transfer. It lays down data in
byte-sized stripes, storing parity on one drive. This parity configuration
allows complete recovery of all information in the event one drive fails –
excellent for video surveillance applications if you want to employ every disk
to its full capacity. 

Lastly, RAID-5 works in the same way as RAID-4 but unlike
RAID-3 it shares parity across all the disks, meaning there’s no single-disk
parity bottleneck. Raid-5 allows smaller writes to be undertaken faster that
RAID-4 but read performance is not as good. RAID-5 is the answer in multi-user
environments where performance is not the ultimate goal.

Addressing the practicalities of network redundancy

For electronic security teams hot-swappable servers/DVRs
are central to the recovery process but you also need at least one alternate
Ethernet path and back-ups for routers, switches and hubs. What you want from a
networked electronic security system is fault tolerance and high availability.

Typical fault tolerant systems resist problems by
duplicating power supplies, duplicating disk arrays and offering automatic
changeover software. If there’s a negative with such automation it’s that you
may be unaware there’s a problem because the system will do the thinking for
you. A capable monitoring and reporting solution is vital here.

Another aspect of fault tolerance is building multiple
connections between video servers/DVRs and network switches. Building networks
this way ensures there’s backup should a NIC fail. You might also connect a NIC
to a pair of switches instead of just one.

As an alternative you might opt for high availability. A
high availability network design is one that ensures performance at a level
that guarantees that no matter what components fail – short of complete site
destruction – some operational capability will be retained.

If there’s a standout advantage of high availability vs
fault tolerance it’s cost – high availability solutions are going to be much
less expensive than fault tolerant ones. A typical high availability site may
incorporate separate Ethernet systems with both client and server machines incorporating
a pair of Ethernet cards incorporated.

Duplication of the Ethernet network may seem like an
expensive business but when you think about the low cost of hardware and the
ease of pulling duplicate cables, getting full network redundancy locally is
almost too easy, especially when you consider the small size of most security
LANs or LAN segments. It’s definitely harder to work maintaining a hot
swappable server than it is to keep a simple Ethernet idling ready to go.

Another seriously valuable addition to the security
control room and its network would be a full blown intranet router designed to
direct traffic around a network using smarts like network address translation,
and port address translation, as well as having the ability to execute firewall
rules. You can use a top end router like this to handle critical operations,
too.

Having such a router solution integrated and aligned with
a security network would give excellent support but you need to take into
account that rebuilding an intranet router from scratch is not easy so you’d be
looking at hot standby features that allow primary and secondary routers to
stay in touch in real time. Any failure of primary routers would see the
secondary unit take over. Using a load balancing router solution a similar
effect is achieved because the overall solution has the ability to pick up the
slack in the event a router fails. The central issue with these solutions,
however, is dollars.

Replacement bits

Getting disaster recovery right means having the ability
to rebuild a failed system on the spot and that means having replacement
hardware on site where it can be plugged into the system immediately. Such
equipment may include a server, a PC, a DVR, and a couple of switches or
routers.

While a hotel or large industrial site may be able to get
away with having the surveillance system down for half an hour, big airports or
casinos will need to get failover times down to a couple of seconds.

With networked DVRs or video servers this may mean servers
link to a hub through a bunch of different NICs using different switch ports.
It’s straightforward stuff but it means not only will there be failure
protection, the resultant “fat pipe” will pump up local bandwidth
possibilities.

If failure of the cable plant, its connector and
terminations is a worry then building lots of connections between switches will
help. You could also connect a DVR/video server to a couple of different
switches on the same network so no single switch failure will see the system off
line.

Quick tips for disaster recovery of networked security
systems include:

* Planning for the worst – this way you won’t be
surprised

* Checking your plan and updating it regularly in order
to keep fresh
* Incorporating many solutions all able to be executed
fast
* Document the plan and ensure there are multiple copies
* Make sure your team is familiar with the plan.

“Key issues with disaster recovery is putting together a
plan that will bring a system back from failure and the best way to do this is
to compartmentalize various tasks and assign a team leader to each area of the
system”

AUTHOR

SEN News
SEN Newshttps://sen.news
Security & Electronics Networks - Leading the Security Industry with News and Latest Events. Providing information and pre-release updates on the latest tech and bringing it all to you daily. SEN News has been in print for over 20 years and has grown strong as a worldwide resource in digital media.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Related Articles

TODAYS WEATHER

16.7 C
Sydney
12.1 C
Canberra
32.5 C
Darwin
15.1 C
Hobart
28.3 C
Perth
18.3 C
Brisbane
18.2 C
Auckland
19.8 C
Melbourne

RECOMMENDED

- Advertisement -

POLL

RECOMMENDED