Explicit Implicit Conversion

On July 5, 2014, in Scala, Tips and Tricks, by Noam Almog
One of the most common pattern we use on our day to day is converting objects from one type of object to another. The reasons for that are varied; one reason is to distinguish between external and internal implementations, another reason would be to enrich incoming data with additional information or to filter out some aspects of the data before sending it over to the the user. There are several approaches to achieve this conversion between objects:

1. The Naïve Approach

Add your converter code to the object explicitly
case class ClassA(s: String)

case class ClassB(s: String) {
   def toClassA = ClassA(s)
}
While this is the most straightforward and obvious implementation, it ties ClassA and ClassB together which is exactly what we want to avoid.

2. The fat belly syndrome

When we want to convert between objects, the best way is to refactor the logic out of the class, allowing us to test it separately but still use it on several classes. A typical implementation would look like this:
class SomeClass(c1: SomeConverter, c2: AnotherConverter, ...., cn, YetAnotherConverter) {
...........
}
The converter itself can be implemented as a plain class, for example:
enum CustomToStringConverter {

    INSTANCE;

    public ClassB convert(ClassA source) {
        return new ClassB(source.str);
    }
}
This method forces us to include all the needed converters for each class that requires these converters. Some developers might be tempted to mock those converters, which will tightly-couple their test to concrete converters. for example:
   // set mock expectations
   converter1.convert(c1) returns c2
   dao.listObj(c2) returns List(c3)
   converter2.convert(c3) returns o4

   someClass.listObj(o0) mustEqual o4
What I don’t like about these tests is that all of the code flows through the conversion logic and in the end you are comparing the result returned by some of the mocks. If for example one of the mock expectation of the converters doesn’t exactly compare the input object and a programmer will not match the input object and use the any operator, rendering the test moot.

3. The Lizard’s Tail

Another option is to use with Scala is the ability to inherit multiple traits and supply the converter code with traits. Allowing us to mix and match these converters. A typical implementation would look like this:
class SomeClass extends AnotherClass with SomeConverter with AnotherConverter..... with YetAnotherConverter {
  ...............
}
Using this approach will allow us to plug in the converters into several implementations while removing the need (or the urge) to mock conversion logic in our tests, but it raises a design question – is the ability to convert one object to another related to the purpose of the class? It also encourages developers to pile up more and more traits into a class and never remove old unused traits from it.

4. The Ostrich way

Scala allows us to hide the problem and use implicit conversions. This approach allows us to actually hide the problem. An implementation would now look like this:
implicit def converto0too2(o0: SomeObject): AnotherObj = ...
implicit def convert01to02(o1: AnotherObject): YetAnotherObj = ...

def listObj(o0: SomeObj): YetAnotherObj = dao.doSomethingWith(entity = o0)
What this code actually does is converting o0 to o1 because this is what listObj needs. When the result returns o1 and implicitly convert it to o2. The code above is hiding a lot from us and leaves us puzzled if the tooling doesn’t show us those conversions. A good use case in which implicit conversions works is when we want to convert between object that has the same functionality and purpose. A good example for those is to convert between Scala lists and Java lists, both are basically the same and we do not want to litter our code in all of the places where we convert between those two. To summarize the issues we encountered: 1. Long and unused list of junk traits or junk classes in the constructor 2. Traits that doesn’t represent the true purpose of the class. 3. Code that hides its true flow. To solve all of these, Scala has created a good pattern with the usage of implicit classes. To write conversion code we can do something like this:
object ObjectsConveters {

implicit class Converto0To1(o0: SomeObject) {
   def asO1: AnotherObject = .....
}

implicit class Converto1To2(o0: AnotherObject) {
   def asO2With(id: String): YetAnotherObject = .....
}
Now our code will look like this:
import ObjectsConveters._

def listObj(o0: SomeObj): YetAnotherObj = listObj(o0.asO1).asO2With(id = "someId")
This approach allows us to be implicit and explicit at the same time. From looking at the code above you can understand that o0 is converted to o1 and the result is converted again to o2. If the conversion is not being used, the IDE will optimize the imports out of our code. Our tests won’t prompt us to mock each converter, resulting in specifications which explain the proper behavior of the code flow in our class. Note that the converter code is tested elsewhere. This approach allows us to write more readable test on other spots of the code. For example, in our e2e tests we reduce the number of objects we define:
"some API test" in {
   callSomeApi(someId, o0) mustEqual o0.aso2With(id = "someId")
}
This code is now more readable and makes more sense; we are passing some inputs and the result matches the same objects that we used in our API call.
Tagged with:  

Continuous Delivery – Feature Toggles

On April 19, 2014, in Web Development, by Aviran Mordo

One of the key elements in Continuous Delivery is the fact that you stop working with feature branches in your VCS repository; everybody works on the MASTER branch. During our transition to Continuous Deployment we switched from SVN to Git, which handles code merges much better, and has some other advantages over SVN; however SVN and basically every other VCS will work just fine.

For people who are just getting to know this methodology it sounds a bit crazy because they think developers cannot check-in their code until it’s completed and all the tests pass. But this is definitely not the case. Working in Continuous Deployment we tell developers to check-in their code as often as possible, at least once a day. So how can this work? Developers cannot finish their task in one day? Well there are few strategies to support this mode of development.

Feature toggles Telling your developers they must check-in their code at least once a day will get you the reaction of something like “But my code is not finished yet, I cannot check it in”. The way to overcome this “problem” is with feature toggles.

Feature Toggle is a technique in software development that attempts to provide an alternative to maintaining multiple source code branches, called feature branches. Continuous release and continuous deployment enables you to have quick feedback about your coding. This requires you to integrate your changes as early as possible. Feature branches introduce a by-pass to this process. Feature toggles brings you back to the track, but the execution paths of your feature is still “dead” and “untested”, if a toggle is “off”. But the effort is low to enable the new execution paths just by setting a toggle to “on”.

So what is really a feature toggle? Feature toggle is basically an “if” statement in your code that is part of your standard code flow. If the toggle is “on” (the “if” statement == true) then the code is executed, and if the toggle is off then the code is not executed. Every new feature you add to your system has to be wrapped in a feature toggle. This way developers can check-in unfinished code, as long as it compiles, that will never get executed until you change the toggle to “on”. If you design your code correctly you will see that in most cases you will only have ONE spot in your code for a specific feature toggle “if” statement.

Continue reading »

 

Using Specs² macro matchers for fun and profit

On December 27, 2013, in Scala, by Shai Yallin

At Wix, we make extensive use of the Specs² testing framework. It has become the standard tool for writing software specifications in our backend group, replacing JUnit, Hamcrest and ScalaTest.

We have always been highly fond of matchers, having adopted Hamcrest back in 2010, and our testing methodology has developed to heavily rely on matchers. We went as far as writing a small utility framework, back in our Java days, that takes an Interface and creates a matcher from it using JDK Proxy. For a class Cat with fields age and name, the interface would basically look like this:

  
interface CatMatcherBuilder extends MatcherBuilder<Cat> {  
  public CatMatcherBuilder withAge(int age);  
  public CatMatcherBuilder withName(String name);  
}  

Which then would’ve been used like this:

  
CatMatcherBuilder aCat() {
  return MatcherBuilderFactory.newMatcher(CatMatcherBuilder.class);
}

...


assertThat(cat, is(aCat().withName("Felix").withAge(1)));

As you can see, this involves a fairly large amount of boilerplate.

Moving to Specs², we wanted to keep the ability to write matchers for our domain objects. Sadly, this didn’t look much better written using Specs² matchers:

  
def matchACat(name: Matcher[String] = AlwaysMatcher(),  
              age: Matcher[Int] = AlwaysMatcher()): Matcher[Cat] =  
  name ^^ {(cat: Cat) => cat.name} and  
  age ^^ {(cat: Cat) => cat.age}  

What quickly became apparent is that this calls for a DRY approach. I looked for a way to create matchers like these automatically, but there was no immediate solution for implementing this without the use of compiler-level trickery.

The breakthrough came when we hosted Eugene Burmako during Scalapeño 2013. I discussed the issue with Eugene who assured me that it should be fairly easy to implement this using macros. Next, I asked Eric, the author and maintainer of Specs², if it would be possible for him to do that. Gladly, Eric took the challenge and Eugene joined in and helped a lot, and finally, starting with version 2.3 of Specs², we can use macros to automatically generate matchers for any complex type. Usage is fairly simple; you need to add the Macro Paradise compiler plugin, then simply extend the MatcherMacros trait:

  
class CatTest extends Specification with MatcherMacros {
  "a cat" should {  
   "have the correct name" in {  
     val felix = Cat(name = "Felix", age = 2)  
     felix must matchA[Cat].name("Felix")  
   }  
  }  
}  

It’s also possible to pass a nested matcher to any of the generated matcher methods, like so:

  
class CatTest extends Specification with MatcherMacros {
  "a cat" should {  
   "have the correct name" in {  
     val felix = Cat(name = "Felix", age = 2)  
     felix must matchA[Cat].name(startWith("Fel"))  
   }  
  }  
}  

For further usage examples you can look at the appropriate Unit test or check out my usage in the newly released Future Perfect library.

This would be the place to thank Eric for his hard work and awesomeness responding to my crazy requests so quickly. Eric, you rock.

 

Introducing Accord: a sane validation library for Scala

On December 13, 2013, in Scala, by Tomer Gabel

Accord is an open-source (Apache-licensed) Scala validation library developed at Wix. It’s hosted on GitHub and you’re welcome to fork and dig into it; let us know what you think!

Why another validation library?

As we were transitioning from Java to Scala we’ve started hitting walls with the existing validation libraries, namely JSR 303 and Spring Validation. While there are a few validation frameworks written for Scala, notably scalaz with its validation features, after evaluating them we remained dissatisfied and ended up designing our own. If you’re interested, there’s more background and comparisons with existing frameworks on the project wiki at GitHub.

So what does it a look like?

A type validator is defined by providing a set of validation rules via the DSL:

import com.wix.accord.dsl._    // Import the validator DSL

case class Person( firstName: String, lastName: String )
case class Classroom( teacher: Person, students: Seq[ Person ] )

implicit val personValidator = validator[ Person ] { p =>
  p.firstName is notEmpty                   // The expression being validated is resolved automatically, see below
  p.lastName as "last name" is notEmpty     // You can also explicitly describe the expression being validated
}

implicit val classValidator = validator[ Classroom ] { c =>
  c.teacher is valid        // Implicitly relies on personValidator!
  c.students.each is valid
  c.students have size > 0
}

You can then execute the validators freely and get the result back. A failure result includes its respective violations:

scala> val validPerson = Person( "Wernher", "von Braun" )
validPerson: Person = Person(Wernher,von Braun)
 
scala> validate( validPerson )
res0: com.wix.accord.Result = Success
 
scala> val invalidPerson = Person( "", "No First Name" )
invalidPerson: Person = Person(,No First Name)
 
scala> validate( invalidPerson )
res1: com.wix.accord.Result = Failure(List(RuleViolation(,must not be empty,firstName)))
 
scala> val explicitDescription = Person( "No Last Name", "" )
explicitDescription: Person = Person(No Last Name,)
 
scala> validate( explicitDescription )
res2: com.wix.accord.Result = Failure(List(RuleViolation(,must not be empty,last name)))
 
scala> val invalidClassroom = Classroom( Person( "Alfred", "Aho" ), Seq.empty )
invalidClassroom: Classroom = Classroom(Person(Alfred,Aho),List())
 
scala> validate( invalidClassroom )
res3: com.wix.accord.Result = Failure(List(RuleViolation(List(),has size 0, expected more than 0,students)))

Design goals

Accord was designed to satisfy four principal design goals:

  • Minimalistic: Provide the bare minimum functionality necessary to deal with the problem domain. Any extended functionality is delivered in a separate module and satisfies the same design goals.
  • Simple: Provide a very simple and lean API across all four categories of call sites (validator definition, combinator definition, validator execution and result processing).
  • Self-contained: Reduce or eliminate external dependencies entirely where possible.
  • Integrated: Provide extensions to integrate with common libraries and enable simple integration points where possible.

The first milestone release (0.1) already includes a substantial set of combinators (Accord’s terminology for discrete validation rules, e.g. IsEmpty or IsNotNull), a concise DSL for defining validators, result matches for ScalaTest and Specs², and integration facilities for Spring Validation.

Accord’s syntax is specifically designed to avoid user-specified strings in the API (this includes scala.Symbols). In practical terms, this means it doesn’t use reflection at runtime, and furthermore can automatically generate descriptions for expressions being validated. In the above example, you can see that RuleViolations can include both implicit (as in firstName) and explicit (as in lastName) descriptions; this feature enables extremely concise validation rules without sacrificing the legibility of the resulting violations.

Next up on this blog: Accord’s architecture. Stay tuned…

Tagged with:  

Continuous Delivery – Production Visibility

On November 3, 2013, in Web Development, by Aviran Mordo

A key point for a successful continuous delivery is to make the production matrix available to the developers. At the heart of continuous delivery methodology is to empower the developer and make the developers responsible for deployment and successful operations of the production environment. In order for the developers to do that you should make all the information about the applications running in production easily available.

Although we give our developers root (sudo) access for the production servers, we do not want our developers to look at the logs in order to understand how the application behaves in production and to solve problems when they occur. Instead we developed a framework that every application at Wix is built on, which takes care of this concern.

Every application built with our framework automatically exposes a web dashboard that shows the application state and statistics. The dashboard shows the following (partial list):
• Server configuration
• All the RPC endpoints
• Resource Pools statistics
• Self test status (will be explained in future post)
• The list of artifacts (dependencies) and their version deployed with this version
• Feature toggles and their values
• Recent log entries (can be filtered by severity)
• A/B tests
• And most importantly we collect statistics about methods (timings, exceptions, number of calls and historical graphs).

Wix Dashboard

We use code instrumentation to automatically expose statistics on every controller and service end-point. Also developers can annotate methods they feel is important to monitor. For every method we can see the historical performance data, exception counters and also the last 10 exceptions for each method.

We have 2 categories for exceptions: Business exceptions and System exceptions.
Business exception is everything that has to do with application business logic. You will always have these kinds of exceptions like validation exceptions. The important thing to monitor on this kind of exception is to watch for sudden increase of these exceptions, especially after deployment.

The other type of exception is System exception. System exception is something like: “Cannot get JDBC connection”, or “HTTP connection timeout”. A perfect system should have zero System exceptions.

For each exception we also have 4 severity levels from Recoverable to Fatal, which also help to set fine grain monitoring (you should have zero fatal exception)
Using this dashboard makes understanding what is going on with the server easy without the need to look at the logs (in most cases).

One more benefit to this dashboard is that the method statistics are also exposed in JSON format which is being monitored by Nagios. We can set Nagios to monitor overall exceptions and also per method exceptions and performance. If the number of exceptions increases or if we have performance derogation we can get alerts about the offending server.

The App dashboard is just one method we expose our production server to the developer. However the app dashboard only shows one server at a time. For an overview look of our production we also use an external monitoring service. There are several services you can use like AppDynamics, Newrelic etc’.
Every monitoring service has its own pros and cons; you should try them out and pick whatever works best for you (we currently use Newrelic).

Every server is deployed with a Newrelic agent. We installed a large screen in every office which shows the Newrelic graphs of our production servers. This way the developers are always exposed to the status of the production system and if something bad is going on we immediately see it in the graph, even before the alert threshold is crossed. Also having the production matrix exposed, the developers see how the production system behaves in all hours of the day. It is not a rare case where a developer looks at the production graphs and decides that we can improve the server performance, and so we do. We saw time and again that every time we improve our servers performance we increase the conversion rate of our users.

The fact that we expose all this information does not mean we do not use logs. We do try to keep the developers out of the log files, but logs have information that can be useful for post mortem forensics or when an information is missing from the dashboards.

To summarize, you should expose all the information about your production environment to the developers in an easy to use interface, which includes not only the application statistics but also system information like routing tables, reverse proxy settings, deployed servers, server configurations and everything else you may think of that can help you understand better the production system.

Tagged with:  

The road to continuous delivery

On October 2, 2013, in Tips and Tricks, by Aviran Mordo

The following series of posts are coming from my experience as the head of back-end engineering at Wix.com. I will try to tell the story of Wix and how we see and practice continuous delivery, hoping it will help you make the switch too.

So you decided that your development process is too slow and thinking to go to continuous delivery methodology instead of the “not so agile” Scrum. I assume you did some research, talked to a couple of companies and attended some lectures about the subject and want to practice continuous deployment too, but many companies asking me how to start and what to do?
In this series of articles I will try to describe some strategies to make the switch to Continuous delivery (CD).

Continuous Delivery is the last step in a long process. If you are just starting you should not expect that you can do this within a few weeks or even within few months. It might take you almost a year to actually make several deployments a day.
One important thing to know, it takes full commitment from the management. Real CD is going to change the whole development methodologies and affect everyone in the R&D.

Phase 1 – Test Driven Development
In order to do a successful CD you need to change the development methodology to be Test Driven Development. There are many books and online resources about how to do TDD. I will not write about it here but I will share our experience and the things we did in order to do TDD. One of the best books I recommend is “Growing Object Oriented Guided by tests

A key concept of CD is that everything should be tested automatically. Like most companies we had a manual QA department which was one of the reasons the release process is taking so long. With every new version of the product regression tests takes longer.

Usually when you’ll suggest moving to TDD and eventually to CI/CD the QA department will start having concerns that they are going to be expandable and be fired, but we did not do such thing. What we did is that we sent our entire QA department to learn Java. Up to that point our QA personnel were not developers and did not know how to write code. Our initial thought was that the QA department is going to write tests, but not Unit tests, they are going to write Integration and End to End Tests.

Since we had a lot of legacy code that was not tested at all, the best way to test it, is by integration tests because IT is similar to what manual QA is doing, testing the system from outside. We needed the man power to help the developers so training the QA personal was a good choice.

Now as for the development department, we started to teach the developers how to write tests. Of course the first tests we wrote were pretty bad ones but as time passes, like any skill, knowing how to write good test is also a skill, so it improves in time.
In order to succeed in moving to CD it is critical to get support from the management because before you see results there is a lot of investments to be done and development velocity is going to sink even further as you start training and building the infrastructure to support CD.

We were lucky to get such support. We identified that our legacy code is unmaintainable and we decided we need a complete re-write. However this is not always possible, especially with large legacy systems so you may want to make the switch only for new products.
So what we did is we stopped all the development of new features and started to progress in several fronts. First we selected our new build system and CI server. There are many options to choose from, we chose Git, Maven, Team City and Artifactory. Then we started to build our new framework in TDD so we could have a good foundation for our products. Note that we did not anything that relates to deployment (yet).

Building our framework we set few requirements for ourselves. When a developer checks our code from Git he should be able to run unit test and integration tests on his own laptop WITHOUT any network connections. This is very important because if you depend on fixtures such as remote database to run your IT, you don’t write good integration tests, limit to work only from the office, your tests will probably run slower and you will probably get into trouble running multiple tests on the same database because tests will contaminate the DB.

Next chapter: Production visibility

Tagged with:  

Hello, hello, what’s this all about?

In the previous post in the series I introduced Lifecycle – Wix’ integrated CI/CD action center. In this post I’d like to share a bit of nifty math I threw in a couple of months ago, which allowed us to save about 60% of our CI resources.  I’ll start with a brief overview of our setup and the problem it caused, and then move on to a little fun math for the geeks out there.

 So where’s that overview you promised?

Here it is. Our build and release process is undergoing some changes as we attempt to scale CI tools and methods to adjust for rapid company growth. Our current setup includes 2 TeamCity servers, one called ‘CI’ and the other ‘Release’. In olden days (last year), the CI server had builds running on every Git commit (to the specific project), using latest snapshot version of internal artifacts. The release server had the same configuration, and ‘releasing’ a build meant building it again, with the release version of all our internal artifacts, after which it could be deployed to production.

We have recently switched to a setup where each build in the CI server has automatically configured snapshot dependencies on all other relevant builds, and the Git trigger is configured to run when any of the dependencies change. This means that on any commit to a project, it is built along with all projects which depend on it. Among other things, this allowed us to move to a ‘quick release’ mode, where releasing a build means merely taking the latest snapshot and changing its name (well, some other stuff too, but nothing that has to do with the compiled code itself).

Well, that’s all very cool. Where’s the problem?

The problem is that we have quite a lot of projects going on. Before we started doing dependency reduction, our setup included about 220 build configurations, with almost 700 dependencies between them. On the first couple of times we tried to make this process work, we inflated the TeamCity build queue to many hundreds of builds, leading to extreme slowness, Git connections getting stuck, and in some cases the server crashing completely.

A few of these incidents exposed real TeamCity issues or places for optimization. But the clear conclusion was that we needed to reduce the number of builds, especially the number of ones trying to run concurrently. Since when you add a build to the queue in TeamCity all its snapshot dependencies are added as well (at least until suitable previously existing artifacts are found), this essentially meant reducing the number of dependencies.

Interesting. How do you reduce the number of dependencies?

Well, here’s where nifty math starts to creep in. The important thing to remember here is that the structure of snapshot dependencies between builds is a Directed Acyclic Graph, or DAG.  A graph is a collection of vertices (build configurations in our case), and edges (dependency relations). Each edge connects exactly two vertices. The graph is directed, meaning that each edge has a direction. In our case, a dependency between builds A and B means that A is dependent on B, but not the other way round. The graph is also acyclic (i.e. having no cycles), as a build cannot ultimately depend on itself (in fact, TeamCity throws an error if you try to define dependencies that form a cycle).

Now, what we want to do is take our graph of dependencies and remove all redundant edges. A dependency of A on B is redundant in this case if A is already dependent on B through some other path. For example, if A is dependent on B and C, and B is also dependent on C (3 edges in total), we can safely remove the dependency of A on C, and still know that a build of C will trigger a build of A (because it will trigger B, on which A is still dependent).

As it turns out, in Graph Theory this concept is called ‘Transitive Reduction’, and that for a DAG the transitive reduction is unique. From the Wikipedia entry: “The transitive reduction of a finite directed graph G is a graph with the fewest possible edges that has the same reachability relation as the original graph. That is, if there is a path from a vertex x to a vertex y in graph G, there must also be a path from x to y in the transitive reduction of G, and vice versa. The following image displays drawings of graphs corresponding to a non-transitive binary relation (on the left) and its transitive reduction (on the right).”

 dag

The same wiki entry also tells us that the computational complexity of finding the transitive reduction of a DAG with n vertices is the same as that of multiplying Boolean matrices of size n*n, so that’s what we set out to do.

We Were Promised Some Math, I Recall…

Ok, here’s where it gets nerdy. A graph with k vertices may be trivially represented as a k*k matrix G, where the value at Gi,j is 1 if there is an edge from i to j, and 0 otherwise. Now, a nice trait of the matrix representation of a graph is that can be easily used to find paths between vertices. For example, if we raise the matrix of a graph to the 2nd power, we get a new matrix of the same size G2, where the value at G2i,j is the number of paths of length 2 from i to j (i.e. the number of vertices v that have an edge from i to v and edge from v to j). The same goes for any other natural power (the proof is pretty simple, and left as an exercise for the reader).

Now, consider that any edge from i to j in our original graph is redundant if there is a value greater than 0 at Gni,j for some n>1. This is because a positive value indicates a path of length n from i to j, so we can naturally remove the direct edge and still be assured that j can be reached[1] from i.

Since our graph has k vertices and no cycles, the longest dependency chain possible is of length k-1. This means that any power of G greater than k-1 will be a null matrix. In fact, if our longest dependency chain is of length k’, any power of G greater than k’ will be a null matrix.

This gives us an easy way to find redundant edges in our original graph. We compute th

e powers of G from 2 upwards, until we get to a null matrix. Then we take G`, the sum of all these powers, and set Gi,j to 0 wherever G`i,j is greater than 0. This is equivalent to removing every edge that has an alternative path of length 2 or greater, which is exactly what we set out to do[2].

A minor point of interest here is that previous proofs of this algorithm (such as the one by Aho, Garey and Ullman below) have only gone as far as to show that once you have identified the redundant edges, you can remove any one of them, and then you have to compute the redundancies all over again.

It is, however, easily proven by induction that if you can remove one redundant edge, you can remove them all at the same time, as long as the graph is acyclic. Indeed, if the graph has cycles, our entire algorithm is faulty, and will end up removing many more edges than we want.

This proved a serendipitous discovery, since once you have a matrix representation of a graph, cycle detection is extremely easy. Recall that when looking at G2 we know that the value at G2i,j is the number of paths of length 2 from i to j. Naturally, this means that if we have a non-zero value at some G2i,i along the diagonal of G2, there is a cycle of length 2 from i to itself.

This is true for any power matrix of our graph, so we added a cycle detector to our dependency manager. After calculating a power matrix, we check that the trace (sum of the values along the diagonal of the matrix) is 0. If the trace is not 0, we locate the cycles and alert about them, leaving the dependency structure unchanged until the issue is resolved.

Does This Really Work?

As you may recall from the start of this post, when we started working on this mechanism we had 220 build configurations, with about 700 dependencies between them. Attempts at activating any sort of automatic dependency chains caused TeamCity build queue to swell up with many hundreds of builds, resulting in stuck Git connections, server crashes, and other shenanigans.

In contrast, we now have about 50% more projects. The last run of the dependency manager covered 310 build configurations, resulting in only 234 dependencies post-reduction. TeamCity queues rarely go over a hundred builds, and most of these are quickly disposed of as artifacts built from the same code revision already exist. This means that we were able to activate the automatic dependencies while keeping the ratio of time builds spend in a queue to time of actually building from going up.


[1] This is only true since we implicitly assume the graph has no edge from a vertex to itself. This is trivial in our case, since a build never depends on itself.

[2] For a more rigorous proof, see: Aho, A. V.; Garey, M. R.; Ullman, J. D. (1972), “The transitive reduction of a directed graph”, SIAM Journal on Computing 1 (2): 131–137, doi:10.1137/0201008, MR 0306032.

 

There Are So Many CI/CD Tools, Why Write One In-house?

About 3 years ago we first set foot on the long and winding road to working in a full CI/CD mode. Since then we have made some impressive strides, finding the way to better development and deployment while handling several years’ worth of legacy code.

An important lesson learned was that CI/CD is a complex process, both in terms of conceptual changes required from developers and in terms of integrating quite a few tools and sub-processes into a coherent and cohesive system. It quickly became clear that with the growing size of the development team, a central point of entry into the system was needed. This service should provide a combined dashboard and action center for CI/CD, to prevent day to day development from becoming dependent on many scattered tools and requirements.

To this end, we started building Wix Lifecycle more than two years ago. This post describes Lifecycle in broad terms, giving an overview of what it can do (and a little about what it is going to be able to do). Following posts will describe interesting series and other fun stuff the system does.

Keep in mind that our processes are nowhere near full CI/CD yet, and that Wix Lifecycle has to allow a great deal of backward compatibility for processes that are remnants of earlier days, and for teams that move towards the end goal at a different pace.

So What Does It Do?

A full CI/CD process is a complicated creature. Not only does it require of many developers a significant change in their mindset and working practices, it also relies on many tools which are crucial for the process. Many of these tools are used in ‘classic’ development processes as well (everybody needs a VCS), but a main tenet of CI/CD is to make the process as automatic as possible (ideally completely).

In our experience this meant we had to write a single system, integrating all our tools and services in one central dashboard / action center. In my opinion, any company which embraces CI/CD in today’s tooling ecosystem will need to craft a similar system to suit its own needs, handling quirks and idiosyncrasies peculiar to its own build, deployment and monitoring process.

In our case, Lifecycle handles (either directly or through links to other services) all tasks associated with building, deploying, monitoring and – if necessary – rolling back our artifacts. In our case, this means that Lifecycle integrates with:

  • VCS – in our case both a local Gitorious installation and github projects.
  • Build servers – 2 instances of JetBrains TeamCity
  • Maven – through our own release plugin
  • Artifactory
  • Issue Tracker – Jira
  • Deployment services – handled by Chef through in-house services
  • Monitoring – NewRelic monitoring and our own AppInfo (method-level monitoring on individual artifacts) and other internal services.

As you will see below, Lifecycle has action buttons that allow users to control the deployment cycle of their artifacts. A user can release a build – provided it is green in CI – which sets the internal versions of dependencies, and allows it to pass to manual QA (where required). Released builds can be deployed to a ‘testbed’ production server, GA-ed (deployed to all relevant servers), and rolled back if necessary. In the same fashion, ‘Hotfix’ branches can be created in a project (as I mentioned, we are not fully CI/CD, so this happens) and artifacts can be released and deployed from those branches as well.

In addition, Lifecycle provides many discrete small services and reports which are easily generated once such a system is in place. E.g. a ‘where is my jar’ API that allows a developer to find out which configurations in TeamCity are responsible for the jar on which she just discovered a dependency, or an ‘Idle Builds’ report which shows projects that were not deployed for a specified time interval.

And How Does It Look?

In the pic below you can see a sample Lifecycle screenshot from the desk of an actual user.

post-1

On the left we have a list of TeamCity projects (the dark grey lines) and child build configurations (lighter grey lines). Each user adds the projects and builds relevant to her, to see a concise status of the relevant parts of the system.

The right side of the screen shows system events – current and recent deployments, summaries of pertinent NewRelic reports, system events from the Lifecycle application itself, and current running builds.

The middle section shows details for a selected project. The header has the build name, owner, and links to the Gitorious repository, CI and Release TeamCity servers, NewRelic monitoring for this build’s artifact, and a couple of internal monitoring services.

Below the header are two sets of buttons – Action buttons in dark blue, and information buttons in lighter blue. The pic below shows the central info panel when each of the buttons is selected. Naturally, artifact names link to the relevant Artifactory location. Server names, on the other hand, link to our internal server monitoring (AppInfo) screen on each server for the specific deployable associated with the build configuration.

post-2

How Does It Work, Really?

Lifecycle is written in Java, with the frontend developed in AngularJS. In most cases, it integrates with other tools through REST APIs. When needed, and possible, we extend those APIs for our own purposes. For example, you can see our extension of the TeamCity REST API at https://github.com/wix/teamcity-rest-api-extension. As mentioned above, we have our own Maven release plugin as well.

When faced with more obtuse tools – in particular, Gitorious is notorious for its lack of API – we hack around them, and in a future post we may show how we can create and manage Git repos in Gitorious programmatically (basically, there is a lot of form scraping and JavaScript tomfoolery, but it is working).

Wherever possible, we make use of existing features provided by external tools. Artifactory, Maven and TeamCity are pretty well integrated. We make extensive use of TeamCity API (originally a plugin, now part of TeamCity), as well as the tcWebHooks plugin (at http://netwolfuk.wordpress.com/teamcity-plugins/). Similarly, while we have our own Maven plugin for releasing projects, we prefer using existing plugins wherever possible.

To touch on a point made early in this post, while we believe that every company above a certain size will need to create its own CI/CD system (at least for the near future), the complexity of such a system makes relying on external tools all the more important.

What’s Next for Lifecycle?

Many more things are in store for Lifecycle, and more requirements become apparent as the development team grows.

The main things on the table in the short term are a better users/roles/permissions mechanism, a Build/project management system that will let users create and manage TC builds/projects; Git repos and more from the Lifecycle app itself; and some UI improvements.

What’s the Next Post?

The next post in this series will be about a little peace in Lifecycle called DependencyManager, which calculates – through some slightly nifty math – the smallest dependency graph equivalent to our full graph, so it can set TeamCity triggers in the way which will ensure all relevant projects are built with relevant dependencies with the minimal number of builds.

Another post I would expect is about the Wix release/deploy process, demonstrating the progress we have made towards full CI/CD and the steps still left to make.

And of course, feel free to ask in the comments about any part which you would like to read more about.

 

In slightly more than 2 weeks, Tomer Gabel - with whom I have founded Underscore, the Israeli Scala user group - and myself are hosting Scalapeño, the first Israeli Scala conference. Working to promote the conference, Typesafe have been so kind as to host me as a guest blogger. Here’s a reblog of that post.

Of two beginnings

I first came upon the name Scala in 2009, when the startup I worked for started using Groovy to write smaller, more concise end-to-end tests. Starting to learn Groovy, I came upon the now-infamous quote by James Strachan. I shortly looked at the Wikipedia page for Scala, got scared by the syntax (“it looks too much like Ruby. I hate Ruby!”) and went back to Groovy.

Fast-forward to late 2010. I’m now a backend engineer for Wix.com, where lots of server software needs to rewritten, and this time it needs to be more maintainable, readable, testable and concise. I recall my brief encounter with Scala and decide to give it another shot, using the O’Reilly online book as my guide. This time I got hooked. The concepts of immutability hit close to home and I had good experiences with the collections framework in Groovy, and more importantly, it all just made sense. I had made up my mind – I’m now a Scala developer.

A very delicate time

And it wasn’t easy. I managed to get a core team of engineers excited about the language, its powers and potential, and we decided to develop one project in 100% Scala. However, it soon became apparent that the tools just don’t cut it. These were the days of IntelliJ IDEA 10 (our IDE of choice) and its Scala plugin was… less than convenient. We ran into frequent bugs and issues with the development environment, and we made some bad choices of 3rd-party libraries. Towards the middle of 2011 we decided to hold back with Scala for a few months and give it another try when the language, the community and tooling support become more mature.

I kept the flame going during most of 2011, checking tooling support and the language maturity often, migrating the one existing artifact to 2.9 when it came out. In the meantime, I kept programming in Java, but my coding style was not the same. I kept looking for ways to enforce immutability and to have functional or pseudo-functional transformations (turning to Google’s Guava suite of libraries), and even wrote a small utility library for Option and Either in Java.

Finally, early in 2012 we decided it was time to try again. This time it was an immediate success. We wrote several of our core products in Scala, making heavy use of pattern matching, immutable value objects and collections and asynchronous execution using Futures and Promises. We now have about 50% of our codebase in Scala, lazily migrating existing Java code as part of the maintenance tasks, and have decided to use Scala exclusively for new projects.

Spreading the word

One of the problems of being at the forefront of technology is having very few people you can consult with or come to for help, so very soon I started looking for other people who shared my passion for Scala. It was during the 2nd meetup of the JJTV group that I met Tomer Gabel, my partner in running Underscore. It soon became apparent that Tomer is far more experienced with Scala than I was at the time, so naturally it was to him I turned to help persuade my colleagues at Wix to give Scala a second chance. As time went by, I came upon some other avid Scala enthusiasts, mostly by chance.

There are two things you need to know about Israel: it’s a small country with a small but very active software community, and Israelis are, let’s say, not the most delicate of peoples. As a result of the two, the country is ideal as an incubator for new technologies and innovation. Recognizing this, Tomer and myself decided it was time to form a local group to help bring Scala users together and help interested parties at making the switch.

Introducing Scalapeño

The next logical step was a conference. For this purpose, we contacted Typesafe and several leading Israeli companies for help and sponsorship and put out a call-for-papers in the local and international community. Eventually we came up with an impressive (we’d like to think so, at least) agenda, including talks by Typesafe’s very own Stefan Zeiger and EPFL’s Eugene Burmako, as well as leading Israeli Scala engineers.

By running the Scalapeño conference and the Underscore group, we hope to bring together local Scala developers, some of which may not even know other companies that have adopted the language, help raise awareness for Scala as a prime choice for new projects and bring it closer to the mainstream in the Israeli software industry. We’re already seeing new companies paying attention to Scala and the birth of local Scala shops, and we have very high hopes for the future of Scala in Israel.

reblogged from the Typesage blog

Tagged with:  

How Many Threads Does It Take to Fill a Pool?

On June 12, 2013, in Uncategorized, by Yoav Abrahami

In recent months we have been seeing a small but persistent percentage of our operations fail with a strange exception – org.springframework.jdbc.CannotGetJdbcConnectionException  – “Could not get JDBC Connection; nested exception is java.sql.SQLException: An attempt by a client to checkout a Connection has timed out.

Our natural assumption was that we have some sort of contention on our C3P0 connection pool, where clients trying to acquire a connection have to wait for one to be available. Our best guess was that it is this contention that causes the timeouts.

So of course, the first thing we did was increase the max number of connections in the connection pool. However, no matter how high we set the limit, it did not help. Then we tried changing the timeout parameters of the connection. That did not produce any better results.

At this point intelligence settled in, and since guessing did not seem to work, we decided to measure. Using a simple wrapper on the connection pool, we saw that even when we have free connections in the connection pool, we still get checkout timeouts.

Investigating Connection Pool Overhead

To investigate connection pool overhead we performed a benchmark consisting of 6 rounds, each including 20,000 SQL operations (with a read/write ratio of 1:10), performed using 20 threads with a connection pool of 20 connections. Having 20 threads using a pool with 20 connections means there is no contention on the resources (the connections). Therefore any overhead is caused by the connection pool itself.

We disregarded results from the first (warm-up) run, taking the statistics from the subsequent 5 runs. From these data we gather the connection checkout time, connection release time and total pool overhead.

The benchmark project code can be found at: https://github.com/yoavaa/connection-pool-benchmark

We tested 3 different connection pools:

  • C3P0 – com.mchange:c3p0:0.9.5-pre3 – class C3P0DataSourceBenchmark
  • Bone CP – com.jolbox:bonecp:0.8.0-rc1 – class BoneDataSourceBenchmark
  • Apache DBCP – commons-dbcp:commons-dbcp:1.4 – class DbcpDataSourceBenchmark

(In the project there is another benchmark of my own experimental async pool – https://github.com/yoavaa/async-connection-pool. However, for the purpose of this post, I am going to ignore it).

In order to run the benchmark yourself, you should setup MySQL with the following table

CREATE TABLE item (
  file_name     varchar(100) NOT NULL,
  user_guid     varchar(50) NOT NULL,
  media_type    varchar(16) NOT NULL,
  date_created  datetime NOT NULL,
  date_updated  timestamp AUTO_INCREMENT NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY(file_name)
)

Then update the Credentials object to point at this MySQL installation.

When running the benchmarks, a sample result would look like this:

run, param, total time, errors, under 1000 nSec,1000 nSec - 3200 nSec,3200 nSec - 10 µSec,10 µSec - 32 µSec,32 µSec - 100 µSec,100 µSec - 320 µSec,320 µSec - 1000 µSec,1000 µSec - 3200 µSec,3200 µSec - 10 mSec,10 mSec - 32 mSec,32 mSec - 100 mSec,100 mSec - 320 mSec,320 mSec - 1000 mSec,1000 mSec - 3200 mSec,other
0, acquire,29587,0,0,5,1625,8132,738,660,1332,1787,2062,2048,1430,181,0,0,0
0, execution, , ,0,0,0,0,0,0,0,1848,6566,6456,5078,52,0,0,0
0, release, , ,0,8,6416,9848,3110,68,77,115,124,148,75,11,0,0,0
0, overhead, , ,0,0,49,4573,5459,711,1399,1812,2142,2157,1498,200,0,0,0
1, acquire,27941,0,0,125,8153,499,658,829,1588,2255,2470,2377,1013,33,0,0,0
1, execution, , ,0,0,0,0,0,0,6,1730,6105,6768,5368,23,0,0,0
1, release, , ,0,49,15722,3733,55,42,69,91,123,101,14,1,0,0,0
1, overhead, , ,0,0,2497,5819,869,830,1610,2303,2545,2448,1042,37,0,0,0

This information was imported into an Excel file (also included in the benchmark project) for analysis.

C3P0

It is with C3P0 that we originally saw the original exception in our production environment. Let’s see how it performs:

c3p0-1

c3p0-2

c3p0-3

c3p0-4

Reading the charts:

The first three charts (acquire, release, overhead) are bucket charts based on performance. The Y-axis indicates the number of operations that completed within a certain time range (shown on the X-axis). The default rule of thumb here is that the higher the bars to the left, the better. The 4th chart is a waterfall chart, where each horizontal line indicates one DB operation. Brown indicates time waiting to acquire a connection, green is time to execute the DB operation, blue indicates time to return a connection to the connection pool.

Looking at the charts, we see that generally, C3P0 acquires a connection within 3.2-10 microseconds and releases connections within 3.2-10 microseconds. That is definitely some impressive performance. However, C3P0 has also another peak at about 3.2-32 milliseconds, as well as a long tail getting as high as 320 – 1000 milliseconds. It is this second peak that causes our exceptions.

What’s going on with C3P0? What causes this small but significant percentage of extremely long operations, while most of the time performance is pretty amazing? Looking at the 4th chart can point us in the direction of the answer.

The 4th chart has a clear diagonal line from top left to bottom right, indicating that overall, connection acquisition is starting in sequence. But we can identify something strange – we can see brown triangles, indicating cases where when multiple threads try to acquire a connection, the first thread waits more time than subsequent threads. This translates to two performance ‘groups’ for acquiring a connection. Some threads get a connection extremely quickly, whereas some reach starvation waiting for a connection while latecomer threads’ requests are answered earlier.

Such a behavior, where an early thread waits longer than a subsequent thread, means unfair synchronization. Indeed, when digging into C3P0 code, we have seen that during the acquisition of a connection, C3P0 uses the ‘synchronized’ keyword three times. In Java the ‘synchronized’ keyword creates an unfair lock, which can cause thread starvation.

We may try patching C3P0 with fair locks later on. If we do so, we will naturally share our findings.

The C3P0 configuration for this benchmark:

  • Minimum pool size: 20
  • Initial pool size: 20
  • Maximum pool size: 20
  • Acquire increment: 10
  • Number of helper threads: 6

BoneCP

We have tried BoneCP at Wix with mixed results, so that at the moment we are not sure if we like it. We include the results of the BoneCP benchmarks in these posts, though the analysis is not as comprehensive.

bone-1

bone-2

bone-3

bone-4

Looking at the charts, we can see that Bone’s connection acquisition performance is outstanding – most the operations are completed within 3.2 microseconds, much faster than C3P0. However, we also observe that the connection release time is significant, about 1-10 milliseconds, way too high. We also observe that Bone has a long tail of operations with overhead getting as high as 320 milliseconds.

Looking at the data, it appears BoneCP is better compared to C3P0 – both in the normal and ‘extreme’ cases. However, the difference is not large, as is evidenced by the charts. Looking at the 4th chart, we see we have less brown compared to C3P0 (since connection acquisition is better) but trailing blue lines have appeared, indicating the periods of time that threads wait for a connection to be released.

As mentioned above, since we are ambivalent at best about using BoneCP, We have not invested significant resources in analyzing this connection pools’ performance issues.

Apache DBCP

Apache DBCP is known as the Old Faithful of datasources. Let’s see how it fares compared to the other two.

dbcp-1

dbcp-2

dbcp-3

dbcp-4

One thing is evident – DBCP performance is superior to both C3P0 and Bone. In all regards, it outperforms the other alternatives, whether in terms of connection checkout times, connection release time and in the form of the waterfall chart.

So What Datasource Should You Be Using?

Well, that is a non-trivial question. It is clear that with regards to connection pool performance, we have a clear winner – DBCP. It also seems that C3P0 should be easy to fix, and we may just try that.

However, it is important to remember that the scope of this investigation was limited only to the performance of the actual connection acquisition/release. Actual datasource selection is a more complex issue. This benchmark for example, ignores some important aspects, such as growth and shrinkage of the pool, handling of network errors, handling failover in case of DB failure and more.