17 April 2020
Starting as a software developer in a multinational company my main driver was always improving my skills as a coder, so I was always reticent about the long-lasting meetings, the pain of setting up the environment, and all the time wasted on running and fixing tests with almost no return on investment. During each of those meetings, a thought ran through my mind asking me “Why do we keep losing time, when we could implement this thing?”
After I resigned there, I started working on a startup with some friends and the beauty of working there couldn't be compared to my last job. There were no rules, no meetings and no restrictions on the resources that we had. We didn’t need to make a request in order to have access to a repository, wait days to have a program installed or fix tests. We were just a team of four people sharing technical knowledge with one another and working day and night on their idea, using a wall full of sticky notes with features or bugs as a backlog and competing to see who would finish more of them by the end of the day. The only drawback was not having someone with more experience to guide us on the right path, so we started making mistakes, learning from them and correcting them as we were moving along.
As our codebase kept growing so was our team, and since every member was working on their little piece, we started losing the bigger picture. We couldn’t know if something done by X could impact something done by Y, and with the burnout being right behind us, we made obvious errors that someone could have foreseen them if he had the chance to review what was written.
We realized that we can’t keep working like this for much longer, our backlog beginning to fill up with regressions and technical debt. For the first time, I understood why we had all the meetings and all the restrictions in that company. So, we needed a model just like that to have stability. But we were still a startup. We couldn’t afford to lose precious time in meetings or reviewing what is done. We needed our own workflow. Something that would not add too much overhead to our productivity, but enough to assure us that the production errors will decrease, and we would be able to see an overall picture of how the development is going.
The search for tools and services that can help us automate most of that overhead started. As every company out there the best and most obvious start point was Atlassian, since we were already using Bitbucket cloud as our versioning tool. But we didn’t stop here. We added continuous integration and continuous development tools, private docker registry, error monitoring tools, an analytic engine for logs to store, search and view, automatic static analysis of code, static and dynamic security testing services, all of them interconnected with one another, all connected to the same user directory to ease the user management, choosing the opensource solution whenever one that suited our needs existed. We added and configured them with the mindset that if something can be automated, then it should be automated.
Since our team is working under a VPN we wanted to have all the tools needed for development in-house, so we installed them on-premise, in our local server. This is the point where things start to fall apart. In no more than two months a blackout fried the server’s motherboard. If it weren't for some backups that we did manually, all the data would have been down the drain, from tasks and documentation to code. But, we still lost all the hours of work installing and configuring those tools.
In order to keep this from happening again the cleanest and viable solution was Docker. Since Atlassian and some other third-party providers don’t support docker, we used images created by the community or we created images to suit our needs. Everything was working, every volume was stored on our network-attached storage, and we could have replicated that on any virtual machine with a simple docker-compose command. But it wasn’t scalable. And if something would have happened to a container, by the time we figure it out is down and what is the problem, it was already too late.
To mitigate this issue, the solution at hand was to move everything in Kubernetes. Combined with Rancher, all the tools are scalable, every piece of data is backed-up every day and we get notified about what’s not working and why is not working.
By the beginning of 2020, our try to have some order and a prediction in the development process turned into a whole project, deployable and replicate-able with just one click, with self-scalable services that are constantly monitored for flaws and malfunctions. From versioning tools, task managers, error handlers and notifiers, code analysis tools, log monitoring, and ci/cd, all the services that we used are under one manager that works for us as our 24/7 DevOps, providing us with metrics, statistics, and forecasts about the infrastructure.
We believed that adding an overhead to the development process will impact performance, but in fact that overhead and the time allocated on building this process, saved us months of refactoring and most importantly our relationships with the clients as if one would have called us to report a problem, our developers were already preparing a fix for that problem, knowing which task, what commit, who added that commit, how was this issue introduced and why it wasn’t caught by the tests.
This is the trade that everyone has to make at some point, in order to bring order into chaos, and we found out the hard way that the overhead is unperceivable if you are using them right. We found a balance in our workflow, and we believe that this isn’t done yet. We are in the 3rd iteration and we still think that it can be better, every day improving our system, automating the automation, having the machines work for us and not the other way around, so it would make our life and the life of people to come easier.