Top 10 Patterns for Scaling Out Java Applications
My DRAFT notes for the session (TS-6339) given by Cameron Purdy from the Oracle Coherence team at JavaOne.
Current trends to reflect the move to lighter technologies. However, the patterns for scaling don’t change, and this was a session aimed at giving guidance to achieve that…without shooting yourself in the foot.
10. Understanding the Problem
- Initial expectations are crucial.
- A scalable system will always be slower in single user mode
- Complexity always increases path length
Indeed, if a system is fast in single user mode but slow under heavy load, then the problem is not one of performance, but scalability.
Horizontal scalability is the only answer
9. Define the requirements
The ARSPMS Requirements model
Availability
Reliability
Scalability
Performance
Manageability
Serviceability
- Understand the theoretical achievable performance – define single user load scenario
- Understand the cost of achieving these capabilities.
- Work towards ensuring linear system scalability, i.e. making sure the cost per user doesn’t rise (what you see when adding more servers, each successive addition supports less users than before)
Gotcha – Predictable latency
more moving pieces
- Garbage Collection
- Thread switches
- Network I/O
- Switches and Routers
Gotcha – Read Consistency
2nd law of distributed systems – The only things you can distribute are state (info) and behaviour. This presents two challenges
- distribution creates multiple copies – how do you keep them in sync?
- how do you guarantee transactional consistency
Gotcha – Durability
Achieving availability of stateful system requires a form of durability e.g.
- sync replication
- disks/SANs
- transaction logging
- durable message queues
The challenges then are:
- latency
- shared access between writer and any potential failover
8. Architecture trumps technology!
I like this one, and it’s at the heart of many of the others here i.e. you can’t just throw more technology at the problem. There may be natural domain specific limitations, sequential high latency operations, globally ordered operations or data hot spots.
On the flip side, data consistency and reliability, immediate reliable state access, durability of state change.
Red herrings – in truth, these are all scalable…
EJBs, tiered architectures, RPC, XML, Web Services, SOAP, HTTP, Programming languages and Operating systems. They may well be inefficient but not un-scalable!
There are some legitimate language and platform concerns that should worry us:
- predictability of latency – unpredictable scheduling and duration of full gc pauses
- Lack of control over thread scheduling
- Lack of asynch I/O
- Ease of development – languages differ in effectiveness for special tasks
- Availability of libraries and components, after all we don’t want to keep re-inventing the wheel..especially if someone smart like Doug Lea wrote it
7. Understand the basics
a) Delivery – packet oriented, stream oriented, req/resp
UDP /IP Packet oriented – not reliabile, limited payload size but predictable latency
TCP/IP – stream oriented, unpredictable latency but easy programming
HTTP – req/response based, built on top of TCP/IP
b) Messaging concepts
delivery guarantees, latency, processing model (at best once, at most once etc.) and ordering guarantees
Efficient models are single threaded, reliable, in order, zero latency once and only once – like being sat at your workstation.
- messaging makes you pay
- ordering limits scalability
- reliability adds costs
- introduces latency
- once and only once is hard
6. visualize the network
The wire – a few basics e.g. Ethernet and Infiniband, all very fragile
Switched fabric offers fault tolerance, scalability and direct p2p connections.
5. visualise the design
Patterns of stateful scaleout
Sprayer – no knowledge of app
Routing – to the right server
Patitioning – assign responsibility
Availabilty
Messaging
Coordination
Routing – information and processing in right place where there is locality of state. Reliable routing is fundamental and supports partitioning and replication.
Partitioning – ownership of state distributed
Replication (availability) – limit loss, relies on coordination
Coordination – difficult to scale and can work against availability, orchestrated state change process.
benefits – simplifies programming, supports synchronous behaviour
Messaging
Queue supports reliability(hand off), durability, ordering, collection (batching – a bus load). He made a good point about messaging, in that if you have enough localised queues, when you go to get a message from the queue, it’s not always best just to get one – think about batching work, why not take a bus load?
4a plan for overload
load balancing – spraying
partition
queue – collect then process in bulk
parellelize – spread concurrent processing.
Oveloaded? What are your options?
a) turn away request
b) Queue
c) add capacity dynamically
d) increase batch size or q delay – let them stand in larger groups
e) relax read and / or write consistency –> sometimes you have to e.g. caching
4b partition for scalability
It’s the only way to provide significant scale. And dynamic partitioning supports dynamic capacity. This introduces a challenge for distributed read consistency. Even with custom affinity of data, complete data locality is simply impossible for some apps.
3a plan for failure
we cause accidents…
The 3 R’s
- redundancy
- reserve
- recoverability
Plan for failure at two levels – component, and system.
When possible failover to transparently handle component failure – e.g. server. Use recoverability to handle everything else (requires durability – tx logs, db logs etc. to reconstruct after failure).
3b replicate for availability
Do
synchronous replication for durability
replicate at levels that are responsible for failover – not everywhere
asynch replication for increased locality e.g. caching or for recovery
Don’t
replicate synchronous inside a critical session unless you have to
Only replicate only what you need!
In most successful models events are partitioned and THEN replicated.
2. tier where it makes sense
- this is not horizontal scalability, just segmenting responsibility.
Pro’s
it can increase scale
localisation can eliminate need for coordination, and thus help to scale
tier traversal creates opportunities for batching queues
Cons
added complexity
potential for reduced efficiency
- tier traversal should be on a constant order
There is no reason to tier stateless systems for scalability! Though maybe for other reasons:
- latency
- scaling
- availability
- efficiency
1. Simplify
Complexity is the enemy. Reliable distributed systems must be modelled as finite state architectures – prove them
Partitioning comes with hidden treasures – localized ordering managed locally
Consider change to the programming model e.g. use an event driven architecture to turn the problem on it’s head.
Conclusion
if you can’t show it working on a white board…it’ll never work in production. Don’t we know it!
- Posted by admin at 05:45 pm
- Permalink for this entry
- Filed under: Best Practice, Educating, Introduction
- RSS comments feed of this entry
- TrackBack URI
No comments
Leave a comment