The channel for cloud experts, by cloud experts
S01:E01 - Break Free from the Task Dungeon
The constant task stream is growing to become a big burden in cloud operations. Luke Walker talks about automating routine cloud operations and building operational efficiency.
S01:E01 - Break Free from the Task Dungeon - Podcast Transcript
Rick: A CloudOps workday is filled with many big and small tasks. Dealing with the constant task stream is becoming an ever-bigger burden than many realize. It’s time to talk about task automation.
This is OpsTalk Podcast. Here we talk about the real, everyday cloud ops reality. I am Rick Hebly, thanks for tuning in.
With me today, is Luke Walker. Luke runs product management at MontyCloud. Great to have you Luke! Today we are talking about the task stream. What does that mean? What problems do these task streams present?
Luke: Thanks Rick, great to join OpsTalk. Here is what I’m seeing. The reward for working hard is more work, and nothing could be more true for an Ops team. Companies are expanding their cloud footprints. They migrate more workloads, they refactor more applications and build more natively in their cloud.
Unlike a project with a fixed start/end, your typical CloudOps team has a never-ending stream of tasks, and it's no surprise to find teams can't hire or spend fast enough to meet this accelerating demand of work.
Rick: Can you tell us what kind of tasks we are talking about and how that can spin out of control? Is it simply related to the amount of resources living in the cloud?
Luke: Most look at the amount of resources deployed, but it’s really about your app consumption increasing, the more people using your apps, the more tasks appear within your ops backlog.
Now app consumption is a good thing, it means you’re delivering value to the business and that’s what you’re building for, but are you ready for the operations load that comes with it?
Your teams are then under pressure when you assess the priority, volume and sequencing of tasks that must take place to maintain service levels.
Yet trying to carve out time to find a more efficient way to cope with that pressure is a constant struggle that I hear from customers.
Rick: Can you give me an example?
Luke: Watching your Ops team be buried by report deployment is one place to start.
One of our early customers is a luxury jewellery retailer with stores all over the world. The customers built their Enterprise Data Hub on AWS, using 10 AWS services including ECS, EMR and Elasticsearch to run analytics on their point of sale data.
Once app consumption increased, the team was getting more and more requests for changes to existing, or new reports. That in itself created even more tasks to process the changes, re-deploy, and scale their clusters to meet the new demand.
But that’s where the bigger problem lies, like many other projects while Infrastructure as Code handled the deployment, those same tools generally don’t deal with the operational aspects of an application, then this was compounded with all of the customer’s operations being built organically.
Tooling, alerting, and the automation of onboarding tasks was completed only as required, so Ops found themselves needing days to push, test, measure simple report updates, and months to push a new release into production or even just implement new monitoring rules.
It gets worse as you realize the manual task execution, like deploying resources, integrating services and setting up IAM roles - across multiple teams - created many inconsistencies and errors. Even with 15 cloud engineers running operations it became unmanageable, compromising compliance.
And even if they could scale out the team fast enough, which they couldn’t, the operations and infrastructure cost would skyrocket, as does the size of your backlog and inability to meet business demand.
So, you can imagine what the weekly Ops review with business units started to look like after several months.
Rick: Oh boy, that’s quite a predicament, now I get where you’re coming from. It sounds like it comes down to preparedness, to control the task storm. Surely cloud teams want to avoid getting into a situation like that. How can they go about it?
Luke: With the retailer, after watching the ops team operate for a couple of weeks it was apparent that only 10 simple technical tasks were taking the team 40 hours spread over two weeks or more to implement, on top of all of their other tasks.
This is going to sound cliché, but you just really have to automate tasks Rick, it’s that simple. Getting those 40 hours back and fulfilling a change request in less than an hour or so isn’t going to happen unless you start to automate.
Rick: That is right, but also easier said than done, isn’t there a big automation debt?
Luke: Today it takes a lot of scripting and maintenance, based on the tasks you can schedule and events you can predict. You are right, writing the code to talk to APIs together across service boundaries is where things get more difficult. It takes significant DevOps skills, deep understanding of AWS services and architectures, as well as time. This is all before you get the reduction in tasks, cost and risk. Like I said, adding talent to teams is difficult nowadays.
Rick: I see another predicament, but you and our product and engineering teams at MontyCloud found a better way, right? How do you make CloudOps simple and help customers break out of the automation dept?
Luke: Sure thing. First of all, AWS does a brilliant job at offering rich API sets across both the resources that make the building blocks for your application, and tools to deploy, manage and secure those constituents. Think of CloudFormation to provision Infrastructure as Code, Systems Manager to manage your EC2 fleet, and IAM roles to secure access, just to name a few. To take full advantage of cloud-native API’s, you need the code to stitch them together. That is across service and tooling boundaries, so that it works for you, in your own application context.
You have different architectures powered by various infrastructure and database instances and clusters, managed by multiple tools, possibly even across accounts and regions. There’s lots of code to write there for DevOps, provided you understand really well what outcome you’re building for. You must do it deliberately and consistently. If not, it can get ugly, as it did for our customer in the luxury retail business.
Rick: I can only imagine – most IT are not skilled DevOps engineers and Python experts. So again, how do you solve this problem for IT teams? Or should I say Cloud Centers of Excellence?
Luke: We need to break down what a deployment really means, or in our case, a DAY2 blueprint. In order to attain Operational Excellence, you must take into account BOTH your deployment and operation templates.
Without an operation template, you hit the same predicament our retail customer encountered. The app is deployed, but here is the team drowning in tasks.
Rick: So why not just standardize your scripts, put them in a common repository and reduce scriptwriting time?
Luke: When you flip the coin over and try to establish a one-size-fits-all standard for ops but then the generic operations playbook has no tie back to the deployment template, and you end up with generic tooling and automations that will never make a dent in your backlog.
Standardizing your ops tooling does help, but you can find that it only meets 30 to 40 % of your needs because every application is different.
So, when you design your architecture, that is when you must also design how you operate it, in order to stay ahead of the task stream.
This is what we believe makes our DAY2 Blueprints different – we are planning for success by forcing the conversation on not just what needs to be built, but what must be defined and automated to successfully operate that application.
Rick: That’s a plan. Failure to plan is planning to fail?
Luke: Yeah – clichés exist because they’re true. Here is what we’ve done to combat the task stream problem.
We’ve set up an entire library for our customers, featuring Well-Architected Blueprints for over 20 common services, and growing every week. Those go from basic infrastructure build-out, pre-architected public and private VPCs, from EC2, RDS to container clusters, managed with Kubernetes and/or Fargate, all the way to complex data analytics applications such as Elastic MapReduce & Elasticsearch.
And as we see blueprints being a combination of deployment and operations, every blueprint is built with a health dashboard, monitoring metrics and routine tasks out-of-the-box.
By completing all of the heavy lifting to get these templates compliant with AWS’s Well-Architected Framework, so our customers just get to click and deploy any blueprint on a self-service basis and take advantage of all of this work.
Sounds like a value prop Luke, all the time I hear folks screaming for Self-Service deployment capabilities to lower the threshold and get going faster with AWS, instead of burning precious time reinventing the wheel (then hoping those didn’t turn out square).
Yes, this must certainly be based on No-Code operations to democratize cloud for the majority. Again, most cloud operators lack the skills or time to do it all themselves. And they are not the problem, the problem is the unrealistic expectation that everyone touching cloud can or should want to code.
Rick: Now I picked up on something else in what you just said. You talked about post-deployment task automation. It’s true that most operations come after the deployment. Managing IAM roles, as one unpopular example. Is that what you were going at?
Luke: That is the dilemma I was waiting for! Don’t get me wrong, it’s certainly challenging to write a CFN template to meet Well-Architected standards, but in an age where deployments are becoming more and more automated, it’s no longer the big problem to solve.
Let me share another war story – another customer we worked with had successfully automated the deployment of data sets for data scientists performing experiments on healthcare data. And these deployments create 100s of S3 buckets every day, reducing their time-to-experiment significantly but kept each project and researcher isolated, meeting their HIPPA needs.
Next thing you know, operations grounds to a halt when they discovered it was time-intensive to unwind all of these deployments, or even modify and remove access to researchers who had left their various projects, resulting in multiple P0 tickets and management calls.
Now, these tasks weren’t technically difficult but the sheer number of tasks simply crippled the team because Operations was not included in the design phase.
And this is EXACTLY why it takes more to go from a deployment to a Well-Managed Application.
Rick: Can I interrupt you there for a second? A Well-Managed Application. What does that mean?
Luke: Good question. In a nutshell, it is an application deployed in such a way that it can be managed efficiently, is secure and is compliant.
Remember it always starts with deployment. When you deploy an application that is manageable then you can manage the application well. (We seem to be all about clichés today – laugh) In all seriousness though, it is much easier when you do it right from the start.
What comes next is your Routine Management tasks. Depending on the application, resource and policy, this can include tasks like adding and removing nodes, backup & restore or configuring alerts. These management tasks are more commonly known as DAY2 tasks.
Bringing resources under management was usually a separate workstream, including agent installation, setting access permissions and governance policies. Instead of hoping that gets done, a Well-Managed Application is aware of the context and provisioned with the components and configurations at the time of deployment.
Rick: So you are suggesting that with MontyCloud your DAY2 operations are pre-configured in the application context?
Luke: Yes, I am, more than that actually. We created the ability to enable self-service tasks at an application level and at the time of deployment.
Then we also created a per application dashboard to actually monitor it, audit changes, perform routine tasks through simple clicks or on an automated schedule, track and forecast resource-level costs, and run reports for the business.
Rick: Now that is something different. You can see the entire production line come together here. Again, the No-Code Self Service is hitting it home. Talking of that, I have a question. Self-Service can be a little scary, without proper governance users can break policy, compliance, budget and what more. Have you thought about that?
Luke: Absolutely. Self-Service is governed through guardrails, and this is all part of the routine tasks bundled into the Blueprints, ensuring you can’t move out of bounds.
In the S3 bucket dilemma we talked about before, you want to have a self-service task that resets permissions, but that task should only be accessible to the researcher for that data set, and your Ops team, and perhaps scheduled to run on a regular basis. You may also want to have another task that allows alert configuration, but without the ability to remove a CloudTrail log.
Self-service guardrails are essential for a Well-Managed application.
Rick: Got it. Now how did it work out for that luxury jeweler you talked about?
Luke: I can tell you this, they’re in a far better place now. Remember it was taking weeks and months to execute change requests? We took the top 10 common tasks plaguing the team’s backlog, built them into the DAY2 blueprints, and change requests are completed now within the hour, down from avg. 40 hours spread over 2+ weeks.
So not only did automated deployment governed by guardrails reduce time and cost, they now have monitoring and task automation set up at the time of deployment.
Rick: Thanks, Luke. I appreciate you taking the time to unpack the task automation issues and sharing proven solutions that are also very accessible. Can our listeners take a look at it themselves?
Luke: You are welcome Rick. And yes, people can get a fully-featured free trial. DAY2 is SaaS and Cloud-Native, so there’s nothing to install – and that includes managing your instances, we are agentless.
Just go to montycloud.com, hit Get Started for Free, and you’ll be ready to connect your AWS account in no time, with no code and no agents.
Rick: Brilliant. Thanks for watching the OpsTalk podcast.
On our next episode, we’re going to be talking about the Remote Console feature, where you can gain shell-level access to Windows and Linux instances without the need for VPNs, bastion hosts, or even granting access to the AWS console.
Make sure you hit subscribe on our YouTube channel and don’t forget to click the bell to make sure you don’t miss that episode. We look forward to talking to you then!
S01:E02 - One Click Remote Access To AWS EC2 VMs with MontyCloud DAY2™
Server shell access is the most potent tool in a sysadmin's tool box. Luke Walker talks about how MontyCloud DAY2(TM) works with AWS Systems Manager to enable one-click shell access to EC2 Windows and Linux VMs.
S01:E02 - One Click Remote Access To AWS EC2 VMs with MontyCloud DAY2™
Rick: In an earlier episode of OpsTalk we talked about task automation.
As CloudOps teams chase automation, one sobering thought stands out. Automation works for predictable, routine tasks. But you can’t predict every single scenario, and therefore automate everything. So how do you deal with such unforeseen tasks on the fly?
That is the question we address today
SysAdmins and DevOps frequently need to get shell level access to their EC2 Instances precisely for such tasks on the fly. The on the fly tasks can be for routine maintenance or more often than not for serious break glass interventions. Getting shell level access to servers, remotely, can be surprisingly complicated, cumbersome and, error prone. Sometimes exposing security holes in your infrastructure.
Thank you for joining this session of OpsTalk podcast. I am Rick Hebly, and with me – is Luke Walker, head of Product Management at MontyCloud. Today, we discuss why it is worth the effort to simplify the process of getting shell access with one click.
Rick: Luke, welcome to OpsTalk, thank you for joining.
Luke: Thanks Rick, great to be here.
Rick: This may sound a bit oblivious, but haven’t compute infrastructure, operating systems and management tools evolved enough that you don’t need to get shell access to compute nodes anymore? Why do IT admins and DevOps engineers still require shell-level access to their virtual machines, even as they use immutable infrastructure?
Luke: Well Rick, while organizations are starting their journey to more deterministic builds, environments where systems are built and deployed to a standard and configuration management is automated and catered for, the ability to directly login to a target server and look under the hood while it’s in operation remains one of the most powerful tools in a CloudOps engineer’s toolkit.
Even when you’re managing a large server fleet with config management tools, something as simple as an inadvertent change can result in a bad system state and you need to ‘break glass’ to determine what change to the config is necessary.
Rick: Having to break glass is quite the dramatic moment, but it sounds like you still see this frequently occurring even in the most robust environments?
Luke: More than many realize. One of our customers – Leadsquared, a Marketing Automation platform powered by AWS, ran into this very scenario not too long ago.
While the application is generally well-managed, they pushed out a change that inadvertently caused the Guest OS to deny access to critical application traffic for about 20 minutes. In most cases you would roll back to the last known good configuration, but in this instance the only way to recover was to get shell access and directly remediate the application and supporting frameworks.
Rick: I just hope this is incidental? In such a scenario, why wouldn’t one immediately write a remediation script to prevent this in the first place?
Luke: It’s every Ops dream to be able to write that one single master script that does everything, including brew coffee, but the reality is LeadSquared did not anticipate that a routine software update would result in the GuestOS denying access. Not every scenario can be automated ahead of time. Well, maybe with the exception of brewing coffee, that has been done! But I think you get the picture here.
Rick: If only my own coffee would just magically appear at my table… but given we’re talking about direct access to a VM, in this age of automation, data collection, and AI, there’s reasons why direct access is an overlooked important piece of the Ops toolbox?
Luke: Absolutely – while inadvertent config drifts are one reason why you want shell access to a VM, it could also very well be necessary to directly review an application and server logs, fine-tune a configuration, or even just terminate a runaway process. These are not always tasks best suited to automation or remediation scripts, because they still require the human in the decision making process.
Rick: You’re saying planning for failure still doesn’t mean definitive automation and process, in effect you’re nowhere without humans holding the right tools. So what makes this so difficult that we’re even talking about it? Surely logging into a server is easier than it was pre-cloud.
Luke: Surprisingly, providing secure, robust remote access into one’s infrastructure can be time consuming or presents several operational and security risks. As you setup a well-architected AWS application environment with a blueprint, one of the core tenants is Security. You need to ensure that you are establishing controls and protect your systems from attack. But in order to be able to observe & react to incidents like LeadSquared’s where human interaction is still required, you need secure access into those target EC2 instances.
For most people today, whether you’re in AWS, Azure or an on-premise environment, that means creating a secure path such as Bastion Hosts, or a VPN, managing SSH keys, or an agent-based management approach. All of these mean more infrastructure, which also equally must be secured and well-managed.
The remediation scenarios are clear. Now, we are working with another customer who has an environment for data science. One of their top 3 support tickets is to re-generate SSH keys that have been lost or misplaced by users authorized to access instances within the environment.
Rick: If I get you right, that’s really no different than having to replace my house keys every week! A security nightmare. In a more controlled environment, I’m sure that this may be better managed, but especially with the majority of us working remotely in these extraordinary times, how exactly would you go about ensuring secure access?
Luke: Sometimes it just takes a fresh look at the environment to find a better way.
Recently, MontyCloud just shipped a feature that extends AWS System Manager’s Sessions Manager service. Quite the mouthful, but leveraging this service, DAY2™, we can open shell access directly into Windows and Linux EC2 instances in a single-click Customers no longer to build all of the additional security infrastructure that we’ve seen in LeadSquared and other accounts time and time again.
So, this means – no bastion hosts, no VPNs, no more managing SSH keys. Straightforward single-click from a web browser, and you’re right into a PowerShell session or a shell prompt.
Rick: If something almost sounds to simple to be true, you have me looking for the catch. I mean, remote access usually in my mind means directly connecting to a host, so how is this different and how do even keep this secure?
Luke: The interactive shell sessions are very secure and AWS native. DAY2™ is working in concert with Sessions Manager, these interactive shell sessions are handled securely through AWS’ services and endpoints through to the System Manager agent that’s bundled by default on nearly every AWS image.
You could have an instance deployed with no inbound rules. Because we work with Systems Manager agent, we can still ensure a secure path to gaining shell access to that host without compromising your infrastructure.
In addition to a secure path, with DAY2™, every session is logged – so all the commands you write, the output the operator sees back – are logged directly into a CloudWatch log trail and retained for 30 days. This means the session details are kept off-host from the original server and secured away in case they need to be inspected.
Rick: So all of the access with none of the tradeoffs, you still get to keep that powerful tool in your toolbox, and you get to make your infrastructure more secure by not having to build any special or crazy backdoors.
Luke: Correct. CloudOps engineers want to avoid creating complicated environments – I certainly do - and avoid being forced into making trade-offs between responsiveness and security.
Rick: Given that you are leveraging AWS services here, what does DAY2™ add to Sessions Manager? Can I not directly use Sessions Manager to do the same thing?
Luke: That is a fantastic question Rick. There’s two parts to the puzzle of gaining secure access. Configuration & actually gaining access. Where DAY2™ shines is by simplifying the process.
Seeing is believing. I’ll just show you.
DAY2™ makes the entire process of configuring Systems Manager, activating the agent present within your EC2 instances amazingly simple - a single task is all it takes, could be 1 instance, 10, 1000 instances – it’s a single, simple task. DAY2™ will go ahead and co-ordinate with AWS to ensure the necessary roles and agent activation requests are submitted and processed accordingly.
Once your hosts are under Systems Manager & DAY2™ management, you can view your entire server fleet across all accounts, all regions from a single pane and gain shell access directly into a Windows or a Linux host with a single click – within our Infrastructure page on the portal, there’s a Remote Console option directly on the interface, next to each of the servers you can see here, and in a single-click, up pops the shell dialog, and you’re in.
Rick: That’s really amazing to see action and you talk through it. You said this Remote Console functionality is GA right? Can I test it myself?
Luke: This is base functionality for DAY2™ – you can sign up a free trial today via our website, montycloud.com, or if you’re already a customer, just login with your account and by accessing the Infrastructure pane you’ll see the Remote Console option for any managed instance discovered within your attached cloud accounts.
Rick: Thanks Luke, and it’s worth mentioning there’s more details published upon our OpsTalk Blog at montycloud.com/blog, and I’d certainly recommend trying out this feature.
Rick: This concludes today’s OpsTalk. You can find more details on One-Click shell access and other Day2™ automations at montycloud.com. Stay tuned for our next OpsTalk. Luke will be joining me again to talk about Application cost management.
Subscribe to get the reminder. Thank you for joining us today!