Share on twitter

Using vlingo/PLATFORM On AWS Fargate

I consider myself a refugee from the old JEE architectures. Threading was handled (naively, for the most part) by the container. Requests were synchronous, transactions distributed. The applications were deployed to a private data center, so scaling the ponderous beast was handled by scaling out (nodes to a weblogic cluster) and scaling up (increasing memory and CPU to what would be considered obscene by today’s standards). The inefficiencies were obvious, even then. Through VMware tooling, we could profile the virtual machines and see a single CPU in the multi-core configuration working at or near capacity while the other CPUs stood idle, engaging only for the frenzied gc activity. Three CPUs on a four CPU system sat idle 90% of the time, and our application performed poorly under load. There had to be a better way.

I’ve known and respected Vaughn for almost twenty years.  He introduced me to the reactive manifesto and, eventually, the vlingo/PLATFORM.  I began tinkering with vlingo somewhere around version 0.3, but did not then have a problem on which to use my shiny new hammer.  Sometime later,  I created a simulation to show the advantages of the actor model over stodgy JEE architecture.  With Vaughn’s encouragement, I expanded the simulation to include a Java/vlingo version of the application.  In the end, there were three benchmarks:

        • Scala with Akka 
        • Java with vlingo
        • A Java container version that didn’t schedule across CPUs

It ran from my aging laptop. The results were both predictable and enlightening.  It was predictable in the sense that software fully utilizing a multicore architecture will kick the ass of software that doesn’t.  The enlightenment came when I realized that vlingo out-performed Akka. Have you seen the film Ford v Ferrari in theaters? According to the Gatling report, the underdog was for real.  As far back as version 0.5, vlingo/PLATFORM could run with (or flat outrun) the big dogs.

Found a Nail

The chance to use the vlingo/PLATFORM outside of a simulation came with a request to create a simple service that inspects documents for metadata before uploading them to a service for archival.  It was a perfect nail for the vlingo hammer. By perfect, I mean it was well-defined and had obvious payoff from the innate concurrency of an actor system. By this time the vlingo/PLATFORM was somewhere around version 0.7 and was obviously solid enough to handle this problem space.

The target configuration eventually landed on Fargate, the AWS managed docker container. The application workload would be fed from S3:object created events.  My preference was to be event driven, but the reality was that routing to an SQS queue and polling was the course of wisdom.

Even though there was no requirement, I kept an HTTP interface for future diagnostic and instrumentation entry points. That proved to be a fateful choice. (I’ll explain a bit later.) The service sat behind an application load balancer that monitored the application availability with a periodic health check.

Once deployed to AWS, we began load testing. The traffic would come in batches; large batches that is. These create a “message storm” for the service.  I believed this was a perfect opportunity to showcase the strength of vlingo and its implementation of the actor model.

Trouble in Paradise

The test dropped enough documents into the S3 landing zone that the service took around five hours to clear its message queue.  What ran flawlessly on my workstation, however, did not play well in the Fargate environment. Periodically, under load, the container would cross the failure threshold and ECS would kill the container. That instability created two issues, one obvious, the other less so.

      • First the obvious problem: messages were left in flight, workflow was in an indeterminate state.  Although the inflight messages would eventually migrate back to availability and reenter processing, that took hours.  The state was self-correcting, but took time. SLAs would be undercut for a simple file transfer. Not acceptable.
      • The not-so-obvious problem is that Fargate is serverless, the underlying host is managed by Amazon. Each Fargate docker container shares an instance with other services.  Since there is no guarantee of landing on the same instance, the image is pulled from the repository each time a new container is created. The size for the service docker image is ~1.1GB.  Each time the service was recreated following a healthcheck termination, the network traffic generated revenue for Amazon

At first, I went down the route of configuring a very friendly health check. That only lessened the frequency, it did not address root cause.

I immediately turned a suspicious eye to vlingo/http. Come on!  It has to be able to service a health check! The problem was interesting enough that Vaughn got involved.  He turned, tinkered, and revved the platform to compensate for what amounted to a new use case. Things did improve, but the container was still unstable.  Finally, Vaughn insisted: “Tom, something else is going on.” He was right.

First, resources.  Here is where my background in JEE and private data centers worked against me (again). In my old world, a CPU has cores, a core had threads.  The math is easy. The math changes in the Fargate world. It’s still easy, but it is definitely different, and the difference matters.

Fargate doesn’t deal with CPUs.  It allocates resources in vCPU units.  A vCPU is not a CPU with cores.  It is not a core with threads.  It is a thread, period. Where I assumed (what did my momma tell me about assuming?)  I was running on 2 cores with 2 threads per core, I was running on two threads, period.  The service was running on half the resources I believed it to be. Still, that did not explain why a health check couldn’t be serviced.  That led to the second “aha moment”.

When actors that are designed for non-blocking make a blocking I/O request, they block on I/O.  With only two threads and intensive synchronous I/O with S3 and HTTP requests, the health check never stood a chance, no matter how liberal the health check. The two threads were fully utilized by actors processing documents, and they would rarely give those up for vlingo/http to use.

The Solution

I removed the HTTP interface.  The risk/reward simply was not there. The purpose of the service is to read and process messages from a queue.  With scarce resources, HTTP health checks are a luxury for a message-based service. Being free of an HTTP interface brought freedom from the load balancer and the ability to tune resources without worrying about the unintended consequence of a container restart.

And just as a chain is no stronger than its weakest link, the elegant reactive patterns that come naturally in vlingo come to a grinding, blocking halt when a synchronous operation is encountered. Happily, Amazon released V2 of their API with strong support for asynchronous operations.

The lesson learned: Other libraries provide the capability but not necessarily the abstractions that promote the use of reactive patterns. You have to craft them yourself. On the other hand, the use of reactive patterns comes naturally with vlingo. They are built in and extensible.  In the present, reactive applications are sometimes forced to straddle both of those worlds. But watch. As the vlingo/PLATFORM continues its inevitable expansion to codify reactive patterns, purely message based systems will be increasingly prevalent.

Tom Stockton is a Principal Architect at MAXIMUS, providing software solutions to support government health and human services programs in the United States.  Tom has been a Java enthusiast since the Java World Tour in ’97, is a proponent of reactive systems, Domain-Driven Design, and an early vlingo/PLATFORM adopter.

Share on twitter

More to explore

Are You A Techie?

Are you a technical person? I am often asked that question. Why? Does the answer matter? I believe most are curious and