Matillion is a versatile ELT tool. It has many connectors to numerous data sources already at our disposal, but sometimes a situation arises when your use case is a little bit different than just data extraction—for example, extracting a list of files from a remote Linux/Unix host or executing something on a remote Linux/Unix host. There is nothing more pervasive than SSH for such purposes and where would you run your script or command if not in the mighty Matillion Bash Script component!
In this post, you will learn how to use the Matillion Bash component to connect and run commands on another host over SSH.
The procedure in this post was tested on a Matillion instance hosted in the AWS Cloud. The target host to be contacted was also in the same AWS Cloud account and rather in the same VPC and subnet. The target host was an Ubuntu virtual machine. However, as SSH is ubiquitous, there is no reason why this procedure would not work on any host, including Windows, when certain prerequisites like open access to an SSH port and running an SSH daemon are fulfilled.
Another item to note is that this exercise was done using a private key for the SSH, which is generally more secure than password authentication. Besides, by using a key file, you don’t have to store the password in a Matillion password manager or enable privileged user access on a Bash Script component.
Also, this post assumes knowledge and familiarity in working with the command line, Linux/Unix tools and Matillion.
- SSH access to the Matillion VM with privileged user (root/centos) credentials (How Matillion SSH access can be used has been explained in detail in this blog post by my colleague Chris Hastie)
- The private key for the user of the target host. If the target host is an AWS EC2 instance, AWS itself creates the key.
- The VM must have a public IP or Elastic IP on AWS (even in the same subnet and VPC, the communication does not work within the private IP addresses as AWS has it clamped down)
- Availability of remote connectivity and data/file transfer tools like SSH/Putty and SCP/WinSCP, etc., on your local working machine
- Additional setting for AWS: If the target host is an EC2 VM, and if there is a security group controlling the traffic to the SSH port (22), the Elastic/public IP of Matillion VM must be included in the inbound traffic rules. This will ensure the Matillion VM can reach the SSH port of the target host.
The Step-by-Step Process
The actual procedure is quite simple once you have all the things required (or checked all the things required). Simply start by connecting to your Matillion instance with ssh utility. The default user to connect to the Matillion instance is centos unless it was changed during the time of VM setup.
Once logged in, create a folder on the Matillion VM to store the key files. While running any commands on Matillion VM using the centos user, sudo qualifier is required to force access elevation:
sudo mkdir /ssh_keys
Using SFTP, transfer the key file (.pem file) for the remote host to the Matillion instance. On Windows, a utility like WinSCP can also be used. Move the file inside the folder.
As Matillion ELT processes are owned by the tomcat user in the VM, it is also a good idea to transfer the ownership of the folder to the tomcat user. Change the ownership of the folder recursively (i.e. also for the constituent files) to tomcat:tomcat (i.e. both user and group tomcat):
sudo chown tomcat:tomcat -R /ssh_keys/
Both the keys folder and the keys inside need to be locked down. Otherwise, using the key to connect to the remote host would yield an error like this:
Thus, change the mode of the directory to 755 and change the mode of the key file to 600:
sudo chmod 755 /ssh_keys/ sudo chmod 600 /ssh_keys/key-file-to-target-server.pem
Chris also discusses best practices and shares more details about storing keys in another blog post.
So, you have your key to the host and you have primed it up. It’s not difficult to predict what happens next. The ssh command, which is used in general to connect to a remote host, will also be used from within the Matillion Bash component. But there is a pitfall to avoid first. The next step is very important.
Before trying to connect directly from the Bash component, change to the tomcat user:
sudo su tomcat
Connect once from the command line using the key file:
cd /ssh_keys ssh -i key-file-to-target-server.pem user@host
The host key fingerprint confirmation prompt is shown. After confirmation, the host key (i.e. the fingerprint of the remote host we’re trying to connect to) will be permanently added to the tomcat’s known_hosts file:
If the step of connecting once from the console is skipped, the Bash component errors out with a permission error, even if everything has been set up correctly. The reason is simply that the remote host entry is not present in the known_hosts file of the tomcat user. The Matillion UI cannot automatically add the host, and the Bash component does not have the ability to report it in the UI. That’s why it fails. Those feeling adventurous can try it out.
Now, connection from the Bash script component will be successful. An example of the Bash component instance is shown below. The user and host details of the target server can be stored in job variables:
An example job using the above technique could look like this:
Here, a file list is obtained from a remote host and then files are staged to S3 using the Data Transfer component.
I hope you found this helpful! Let us know if we can help you troubleshoot or sort through other areas of your data management and strategy.