martes, 1 de julio de 2014

Octopush

Octopush was born of the need to arbitrate our deployments to Staging environment, to guarantee the consistency of the environment with Production and to coordinate DEV and RM Jenkins. It is basically a queue and a web-page where everybody can see what is being deployed, what is on wait, and what has recently been deployed. 
It also integrates with Github providing SSO authentication and permission handling to push releases LIVE!
Users from different teams in the company can easily deploy LIVE their components with a push of a button, they also can rollback to any version available in the Production - Deployed section. 
An interesting thing we implemented in order to make the process more fool-poof and secure, is that we configured a couple of Alerts in our NAGIOS system, whenever they detect a number of errors that go above a certain threshold they call Octopush disabling it, this takes Octopush into a paused state where no more releases can be made. We combine this logic with our Canary release process on Jenkins, where we deploy to a subset of servers and then wait for a couple of minutes before moving on with the rest. Before resuming the deployment we check Octopush state, and if it is paused it means there are errors LIVE an we should not proceed with the whole deployment, so Jenkins job rollbacks automatically the few servers it has deployed to and returns a FAILED DEPLOYMENT state.
Octpush has been released as an Open Source PHP project so everybody can download, try and contribute! In the github page you can find plenty documentation and even a link to a running version with its own Jenkins installation.

Continuous Delivery at OLX (part III)

On the tooling part, we have developed, besides Octopush, a very simple web app, called TagReporter, where we track & show everyone what version of components are deployed on each environment (components on the rows, envs on columns, links takes the user to JIRA, Github or JFrog). 

Additionally we provide information on the Release JIRA ticket that was or is about to be deployed LIVE and its current status. The last column provides different actions according to the logged user profile, devs can create tickets automatically and Release Engineers can trigger deployments/rollbacks. The app integrates all this tools via REST API, providing a Transactional deployment service: it calls jenkins (triggering deploys), interacts with JIRA (creating and transitioning tickets) and reports to IRC (notifying beginning or start of deployments).
Deployments scripts report back to TagReporter with the Component, version and environment where it has just deployed.
Since the tool saves all this information, it also provides reports of  historical deploys, allowing all kind of filters.
We used Hubot for IRC interactions, not only for notifications but also to ask him things like versions deployed in different servers or settings values. 
Finally our DBA team has developed a tool for tracking and running SQL scripts, devs commit scripts to Git (using Pull request to have them reviewed) and the tool runs them.

Continuous Delivery at OLX (part II)

Moving on with Continuous Delivery Posts at OLX, another big challenge we had is how to deal with application properties/configuration files. At the beginning everything was made manually, each time a developer wanted to change some application config file he would make the requirement in a JIRA Ticket, specifying the environment, the file and the diff. As you can imagine this brought a lot of problems, from human typing errors to lack of visibility and in-traceability of changes. So we desperately needed a version control over these files (they were treated differently from code) and an automatic way of making the changes without compromising the stability of our environments. We wanted devs to make the changes themselves but also keep Release Management and Operation guys in the loop, giving the heads up or not whenever the could foresee a dangerous change. Another issue was security, we had in the same config file credentials with passwords and all next to feature flags or 3rd party service URL's.
So first we started splitting these files (security and the rest), credentials was not something we wanted devs or any unauthorized person to see.
Secondly we adopted Github (though this was embedded in a much larger decision since we also decided to port all of our SVN tracking system to Git, but we'll talk further about this in another post), so we created a config repository for each code repository we had. We placed the config files there (using the original path) but using branches to specify the environment.
Third we gave only read permission to devs on these repos, so they can have proper visibility on the configuration and the changes made thru time on each environment. But we didn't want them to write directly to them, we decided to use Githib's Pull Request for this so that they have to propose the change to us, we review it and approve it (or not). The approval is the commit it self, so this triggers a commit hook on a Jenkins's job that deploys the configuration automatically (only in QA environments). Github is great for this, they make Pull Request real easy, for devs they need only to edit the files online and since they don't have write permissions, Github creates a fork and offers you to create the Pull Request to the original repo automatically.

All this process had been working great for us for some time, but we still needed to deal with credential files, they weren't being tracked and some loses on these files could have been catastrophic if we hadn't have any backups (but we all know backups are not enough when you need traceability). So we needed to track them, but give only visibility to some authorized IT resources, we though of having yet another repo with more restricted permissions for credentials, next to code and config, but we would have to create yet another deployment process for these. Until we found git-crypt, this great tool let us save in the same repo, encrypted and non-encrypted files, you only need to install it and have the proper keys to see the files, otherwise you only see rubbish, even Github shows it as a binary file. So this way we saved our selves from another set of repos and custom process for credentials, we do the same as we did for normal config files, but we installed git-crypt and keys where we checkout the repo and voilà!
All these changes weren't made without some kind of resistance, many devs were used to the just typing some new configuration on a ticket, or ask what value is set in "foo.bar" on Prod (which clearly doesn't scale when you have more than 100 devs an thousands of settings), but we gave them a much powerful tool, and in time everybody got used to it and learned to love it. 

Continuous Delivery at OLX (part I)

A year ago we started an ambitious project at OLX, achieve Continuous Delivery, you may ask why? It's a fair question, and there's a good answer too, because we needed a change and like most changes this was triggered by pain. This is a very common story among IT companies, every Release is painful, is a whole bunch of features and fixes that most Operations guys don't know about, and it's infrequent, and is manual and error prone. Even though  we were releasing once every 2 weeks which is a lot for many companies, for us it was clearly not enough, we had several issues, rollbacks and even blackouts every time.
So basically we started meeting once every week to discuss how would Continuous Delivery would look like at OLX, how to build a Pipeline, WHAT the HELL was a Pipeline!? Anyway, many of us had the chance to read the famous book written by Jez Humble and were excited about the possibilities. But we were light years away from what was described there. I mean, most teams didn't even got to Continuous integration, our architecture didn't help either since it was basically a gigantic monolithic PHP code that we would checkout from SVN and rsync completely into our web servers in about 10 minutes or so.
Another issue was that the few dev teams that were Continuously integrating had different Jenkins installation, each with different set of plugins and tools. Even myself, as a Release Manager, had my own Jenkins, which I found very useful for some automatic deployment tasks that I had already started implementing. So the first challenge was to decide what to do with our Jenkins; If we would integrate the whole thing in one Jenkins it would be a mess, too many hands on the same plate, many different technologies (from different teams) and different roles as well (compiling, testing, deploying), who would be the owner of such a tool? It was clear to us that it wasn't a great solution, I mean for smaller companies it would be more than fine, but it doesn't scale.
So we decided to keep our Jenkins separated and started researching how to integrate them, sadly there's no plugin out there for these (none that we could fine anyway) and the master/slave Jenkins schema doesn't fit our problem because the Slave is just that, a plain simple slave which mirrors its master's jobs.
So we used the Bash console on Jenkins job to curl another Jenkins and trigger the remote job, it turned out to be pretty easy, and thanks to Jenkins API the caller job would loop until the callee finished, and then we could even retrieve the result, whether it failed or succeeded.

Our first experiments were made thanks to Mobile Team collaboration, this a key part: find a team willing to experiment, shake things up a little bit. So they would build the code on each commit with their Jenkins, and then call the Release Management (RM) Jenkins to get their code deployed on a QA environment, and once this was successful they could run a battery of acceptance test against the environment. Easy, right? well... it took us some time to get there.
We will continue talking about CD Project at OLX on following posts, there is a lot more in this story, including the open sourcing of our deployment tool Octopush!