Many companies like mine use AWS infrastructure as a service (IaaS) heavily. Sometimes we want to perform a potentially risky operation on an EC2 instance. As long as we do not work with immutable infrastructure it is imperative to be prepared for instant revert.
One of the solutions is to use a script that will perform instance duplication, but in modern environments, where unification is an essence it would be wiser to use more common known software instead of making up a custom script.
Here comes the Ansible!
Ansible is a simple automation software. It handles configuration management, application deployment, cloud provisioning, ad-hoc task execution, network automation, and multi-node orchestration. It is marketed as a tool for making complex changes like zero-downtime rolling patching, therefore we have used it for this straightforward snapshotting task.
Requirements
For this example we will only need an Ansible, in my case it was version 2.9 – in subsequent releases there is a major change with introducing collections so let’s stick with this one for simplicity.
Due to working with AWS we require a minimal set of permissions, which include permissions to create:
- AWS snapshots
- Register images (AMI)
- Start and stop EC2
Environment preparation
Since I am forced to work on Windows I have utilized Vagrant instances. Please find below a Vagrantfile content.
We are launching a virtual machine, with Centos 7 and Ansible installed.
For security reasons Ansible, by default, has disabled reading configuration from mounted location, therefore we have to implcity indicate path /vagrant/ansible.cfg.
Listing 1. Vagrantfile for our research
Vagrant.configure("2") do |config| config.vm.box = "geerlingguy/centos7" config.vm.hostname = "awx" config.vm.provider "virtualbox" do |vb| vb.name = "AWX" vb.memory = "2048" vb.cpus = 3 end config.vm.provision "shell", inline: "yum install -y git python3-pip" config.vm.provision "shell", inline: "pip3 install ansible==2.9.10" config.vm.provision "shell", inline: "echo 'export ANSIBLE_CONFIG=/vagrant/ansible.cfg' >> /home/vagrant/.bashrc" end
First tasks
In the first lines of the Ansible we specify few meta values. Most of them, like name, hosts and tasks are mandatory.
Others provide auxiliary functions.
Listing 2. duplicate_ec2.yml playbook first lines
—
- name: yolo hosts: localhost connection: local gather_facts: false become: false vars: instance_id: i-deadbeef007 tasks: - name: Getting minimal set of facts for datetime setup: gather_subset: 'min' - set_fact: current_datetime: "{{ ansible_date_time.iso8601 }}" - name: Install required pip packages become: yes pip: name: - boto3 - boto
From the top we assign a name, to determine what this playbook is about.
Since this will only connect to AWS we have to limit execution to run on localhost and, to avoid SSH attempts, we add connection type local, which actually means no-connection, straight execution on a machine.
Next we proceed with disabling facts gathering, to speed up our execution. Become keyword determines whether Ansible should use privileged account (e.g. sudo). Since we do not need it, it is a nice custom to disable rising privileges.
Vars section defines facts usable over the entire playbook, currently we have only one, an instance id.
Finally we begin actually working in the tasks section.
Firstly in the playbook we will need datetime value, so a minimal set of facts need to be collected. Possible values are:
– all – virtually all facts, default set if gather_facts in meta section is not specified,
– min- greatly reduced information, that does not require digging into system setup
– hardware,
– network,
– virtual,
– ohai & facter – two common facts providers, read more about ohai in Chef documentation and Puppet for facter specification.
Now we can obtain the datetime, and since it will be used multiple times over the tasks, we register it as fact in our second task.
Finally, modules for AWS control require boto and boto3 Python modules to work, so we can ensure they are present by executing the pip module.
Please notice that we can overwrite global values, in this example: we escalate our privileges by setting become: true.
Authentication
AWS authentication may have different forms, straight one – just login and password or more advanced that require assuming roles.
Moreover we might be forced to use multi factor authentication, which complicates authentication further.
To overcome this we need to provide: user login, user password, MFA token serial and current MFA code.
Since the vault-encrypted secrets are quite long, in order to save, they have been truncated in all listings.
Listing 3. duplicate_ec2.yml enhanced with authentication elements
(...) vars: instance_id: i-deadbeef007 aws_credentials: aws_region: eu-west-2 aws_access_key: !vault | $ANSIBLE_VAULT;1.1;AES256 34613664316337623136383935636262353361643736666432666331623563636333363431626134 6533636363383231... aws_secret_key: !vault | $ANSIBLE_VAULT;1.1;AES256 39386565376137663934333734316236346232643838623530386538303561393730373662626238 6337636363353938663664... mfa_serial_number: !vault | $ANSIBLE_VAULT;1.1;AES256 66306332383534343338373532633930373536663638303439633837613832643966303236396562 393861366333333562336231666...
tasks:
(...) - pause: prompt: "Please enter Your MFA code: " echo: yes register: mfa_code - sts_assume_role: role_arn: "arn:aws:iam::807777736438:role/User" aws_region: "{{ aws_credentials.aws_region }}" aws_access_key: "{{ aws_credentials.aws_access_key }}" aws_secret_key: "{{ aws_credentials.aws_secret_key }}" mfa_serial_number: "{{ aws_credentials.mfa_serial_number }}" mfa_token: "{{ mfa_code.user_input }}" role_session_name: "Snapshotting-{{ instance_id }}" register: assumed_role - set_fact: aws_secrets: &aws_secrets aws_access_key: "{{ assumed_role.sts_creds.access_key }}" aws_secret_key: "{{ assumed_role.sts_creds.secret_key }}" security_token: "{{ assumed_role.sts_creds.session_token }}" region: "{{ aws_credentials.aws_region }}"
First thing that catches our eye are the big blocks of digits. These are ansible-vault encrypted strings. We create them by calling `ansible-vault encrypt_string` and inserting required data.
Please be aware that ctrl+d must not be preceded with enter key, otherwise the new line character will be included in the secret’s value!
Listing 4. ansible-vault usage example
[vagrant@awx ~]$ ansible-vault encrypt_string New Vault password: Confirm New Vault password: Reading plaintext input from stdin. (ctrl-d to end input) This is the secret content!vault | $ANSIBLE_VAULT;1.1;AES256 34363966326337613933623331306331613939303661303530613466613036346336613032333637 3632313064336133383036396266633761643664656664620a626165343439393832643236613438 32623339396130323531643862366532623434343931613165633931663739353065396234313034 6165373262393764610a373763623865356131383133316638333635616665313463343563646564 35353161613039303437383135383165393661343132623133663231653035376338 Encryption successful
Now we need to obtain from the user the final element of authorization – the mfa code.
Since its lifetime is only 60 seconds it has be provided as late as possible, therefore we prompt for it, using module `pause` AFTER facts gathering and modules installation, just before it is needed. We register the value of this module into variable mfa_code for later reuse.
With module `sts_assume_role` we can finally assume our appropriate role. It requires a bunch of values, where some of them are static, some origin from vars section above and last one, “{{ mfa_code.user_input }}”, from the previous step. So the task’s result has to be stored in another variable.
Subsequent step is for out convenience, we assemble our variables, that will be further needed, into one fact.
Moreover: here we use a yaml feature called block referencing.
We can name a block, by using `&` character with a name and later use it where it will be needed.
Obtaining instance details
Since we want to make a clone of the instance we require to gather some information.
With Ansible it is just calling module ec2_instance_info (warning: this module was changing name like 3 times within the last 2 years).
Mandatory we need: aws_access_key, aws_secret_key:, security_token and region.
Luckily we have all of them under the Yaml block name `aws_secrets`. We can access it using `*` character before block’s name. In order to use it, we type the block insert operator `<<` and refer to the block using asterisk. Voila! No more need to re-type all the information line by line. Of course we also need to clarify which instance we want.
For this we use filters keyword and provide which values we use for searching.
Listing 5. Collecting instance data
- name: Get data about ec2 instance {{ instance_id }} ec2_instance_info: <<: *aws_secrets filters: instance-id: "{{ instance_id }}" register: instance_facts - set_fact: instance_data: "{{ instance_facts.instances[0] }}" - debug: msg: "{{ instance_data | to_nice_json }}"
Once again for convenience we assign interesting data under the new fact. It is easier to type `instance_data` instead `instance_facts.instances[0]`.
Finally we print the interesting details to the screen.
For better readability I recommend using filter `to_nice_json` when printing JSON, as well in ansible.cfg we can define a value for stdout_callback as debug, which will provide pretty print for out errors.
Listing 6. ansible.cfg content
[defaults] stdout_callback = debug #This will make our output look much better.
Creating snapshots and AMI
In order to make snapshots we have to call module `ec2_snapshot` on each volume attached to the original instance.
Some time ago Ansible team introduced the keyword `loop`, while in older playbooks we can find `with_items`, `with_list`, etc.
actually anything with `with_*`.
Loop has unified interface and with the help of filters can fulfill role of any `with_*`.
The most basic loop takes a list, e.g. of strings or dicts, and assigns subsequent values to variable `item` in each iteration.
In this task notable is also usage of asynchronous calls, by using keywords async – specifies wait time for each parallel execution and poll – defines interval between checks for completion. Both are defined in seconds.
Listing 7. Tasks for creating snapshots
- name: Create snapshot of volumes ec2_snapshot: instance_id: "{{ instance_id }}" device_name: "{{ item.device_name }}" <<: *aws_secrets snapshot_tags: Name: "{{ instance_data.tags.Name | default(instance_id, true) }}-{{ item.device_name }}" id_instance: "{{ instance_id }}" volume: "{{ item.device_name }}" date_created: "{{ current_datetime }}" loop: "{{ instance_data.block_device_mappings }}" register: snapshots_list async: 1200 poll: 5 - debug: msg: "{{ snapshots_list.results | to_nice_json }}" - name: Snapshot data modification set_fact: # the most ugly piece of code I ever wrote in Ansible snapshots_2_reuse: "{{ snapshots_2_reuse | default([]) + [ { 'device_name': item.item.device_name, 'snapshot': item.snapshot_id } ] }}" loop: "{{ snapshots_list.results }}" - debug: msg: "{{ snapshots_2_reuse | to_nice_json }}" - name: Create AMI from snapshot ec2_ami: <<: *aws_secrets name: "{{ instance_data.tags.Name | default(instance_id, true) | replace(' ','_') }}-{{ current_datetime | replace(':','-') }}" root_device_name: "{{ snapshots_2_reuse[0].device_name }}" device_mapping: - device_name: "{{ snapshots_2_reuse[0].device_name }}" snapshot_id: "{{ snapshots_2_reuse[0].snapshot }}" delete_on_termination: true register: created_ami - debug: msg: "{{ created_ami | to_nice_json }}"
Afterwards we need to mangle the output about created snapshots, since we need only a subset of data in the form of dictionaries.
Let’s break this expression into pieces for better understanding:
snapshots_2_reuse: “{{ snapshots_2_reuse | default([]) + … – first we take current value of variable `snapshots_2_reuse` and if it is undefined, we substitute it with an empty list.
… + [ { ‘device_name’: item.item.device_name, ‘snapshot’: item.snapshot_id } ] }}
Next we add to the list a dynamically created map, with correctly named parameters.
Finally we assign the created object to variable `snapshots_2_reuse` and proceed to the next element in the original list.
Please note that `item.item.device_name` is correct as we iterate with variable `item` but each element has its own sub-element named `item` – see the output of previous debug task.
Because it is impossible to create an instance directly from a snapshot, we need to have an AMI base on root device snapshots.
Note the extensive usage of filters while obtaining value for AMI name.
We try to use value of tag `Name`, if it is undefined or empty (`true` second parameter in filter) we replace it with `instance_id`, which always has to be present.
Also we need to ensure there are no spaces, which are common in `Name`.
Furthermore in datetime we need to get rid of colons, since they cannot be put in AMI name as well.
Creating an instance with fallback
Next, in our journey through the playbook, we encounter a block / rescue construction.
It is an Ansible version of exception catching.
If any step fails in the `block` section, it will call `rescue` tasks below.
In this particular example in block we have two tasks: stop the original instance and launch the new one.
If Ansible fails to achieve any of them, it will immediately proceed to printing warning message and start the original instance – if stop task fails, Ansible will notify that instance is running and mark this rescue task as OK.
Of course blocks can be nested, e.g. the rescue section can have another block/resue construction as well.
It should be noted that the play continues if a rescue section completes successfully as it ‘erases’ the error status (but not the reporting), this means it won’t trigger max_fail_percentage nor any_errors_fatal configurations but will appear in the playbook statistics.
Listing 8. Instance creation
- name: Stop original instance and start new one block: - name: Stop original running instance ec2: <<: *aws_secrets state: stopped instance_id: "{{ instance_id }}" - name: Launch new instance ec2: <<: *aws_secrets key_name: "{{ instance_data.key_name }}" group_id: "{{ instance_data.network_interfaces[0].groups | map(attribute='group_id') | list }}" instance_type: "{{ instance_data.instance_type }}" image: "{{ created_ami.image_id }}" wait: yes wait_timeout: 600 instance_tags: "{{ instance_data.tags }}" volumes: "{{ snapshots_2_reuse[1:] }}" vpc_subnet_id: "{{ instance_data.subnet_id }}" rescue: - debug: msg: "We've caught an error during new instance creation. Reverting by restoring previous instance" - name: Starting back the original running instance ec2: <<: *aws_secrets state: running instance_id: "{{ instance_id }}"
Summary
In this demo, we walked through setting up an Ansible playbook to log into AWS, create a snapshot of a given EC2 instance and create a new one based on the original one.
We also presented a few tips and tricks to enhance playbooks by improving their performance, stability and readability.
This paper proves that the Ansible, due to its simplicity, can outshine Chef or Puppet which for this simple task could be an overkill. In other words: Ansible just keeps it simple.