Overcoming Challenges for MLOps in DistributedAI

As we saw in our previous blog posts, running MLOps in a Distributed AI environment has its own challenges. Some of these challenges are inherent to the distributed nature of this implementation, while others manifest due to the specific requirements of running MLOps for Distributed AI.

Below is a list of these challenges and how Aikaan’s various features can help overcome them:

Remote Devices

When we have Distributed AI Models working across thousands of devices spread over a large geographical area, connectivity could be an issue. This is especially important during key phases of Distributed AI, like Model Training. Frequent connection drops, high latency, long rund trip times, could all affect the training phase of an AI model. As seen in previous posts, this is already addressed by the architecture of Distributed AI by using techniques like Model pruning, quantization etc. Other ways to overcome this is to have a dedicated backbone to these devices. But this is not feasible in all situations. 

Aikaan provides tools for such scenarios. Aikaan’s Network Monitoring feature helps the MLOps engineers to keep an eye on connectivity issues of the entire Distributed AI network. On login to Aikaan Controller, you can see the list of devices and their connectivity status. The screenshot below shows the same and the “Status” field gives a birds eye view of the connectivity status of each of the devices connected to the platform. If a device loses connectivity the status immediately changes to a Red ‘X’ mark. This way the MLOps engineer can look out for devices with frequent or lengthy connectivity issues and address it straightaway.

System Monitoring

Another challenge of running ML Models in distributed environments is that of device health. Each of the thousands of devices needs to be monitored constantly for anomalies like irregular CPU usage, device temperature, OS/application level crashes,etc. For an MLOps engineer, managing a huge array of devices is easier with Aikaan’s Reports and Events features.

You can access the Reports and Events from the main page of the Aikaan Controller. As seen in the screenshot below, with a few simple clicks, the MLOps engineer can generate reports to monitor CPU Usage, device temperature, device uptime etc. 

The same is true for the Events tab as well. The Ops engineer can configure Events to trigger when specific sets of conditions are met. Furthermore, all the generated events can be forwarded to any mail id, slack id or whatsapp account too.

These two features help the MLOps engineer to constantly monitor his system and take necessary actions when needed.


System Security

As with any distributed architecture, security is paramount here as well. Apart from securing the physical aspects of the remote devices with access protection, network firewall etc, it’s critical to make sure the security patches at the OS level are up to date. Managing such security updates for thousands of devices across a wide geographical area can be challenging.

Aikaan provides all the necessary tools for an MLOps engineer to easily push the necessary security patches across multiple devices via OTA (Over The Air). The ‘Device Profile’ parameter helps the engineer to group the devices based on various parameters like OS Version, Hardware Version, Security patch version, etc. Next, the binary that needs to be pushed to the device can be further targeted towards each device profile. This is done by generating  different artifact for each of the device profiles defined.

The next step is to deploy these artifacts onto thousands of devices across the network. Aikaan manages these deployments internally and makes sure all the deployments are carried out with multiple retries along with re-initiation of failed deployments. Aikaan also provides a live logging of each of the deployments for easy monitoring and debugging.

With the help of Remote SSH sessions, the engineer can check the status of these deployments from a shell terminal as well. The same sessions can also be used to check the current patch level, OS versions,etc. Below is a screenshot of a shell terminal opened from within the Aikaan Controller:

Below is the OTA Upgrade section of the Aikaan Controller, which gives a list of artifacts ready to be deployed.

With the help of the flexible OTA Upgrade feature on Aikaan, the MLOps engineer can make sure the security and integrity of the whole Distributed network is maintained at all times.

System Uptime

One of the key performance indicators (KPIs) for an MLOps engineer is to monitor the system uptime and work towards a 100% uptime. In a distributed environment, this is quite challenging. However, with all the tools mentioned above and with the various reports that can be generated, the engineer’s task of minimizing system failures, eliminating anomalous behavior can be made more efficient and manageable.

Aikaan gives a holistic view of the whole system, as well as the uptime of an individual device from the Controller dashboard. Below are the screenshots of controller and the individual device uptime:

With the help of above tools, the MLOps engineer can navigate the complex architecture of a Distributed AI setup. Aikaan provides all the necessary tools and various features to overcome various challenges and make the day-to-day operations of such a network easily manageable.

To know more about Aikaan and its various features visit www.aikaan.io 

You can also open a free account and explore the platform by signing up at https://experience.aikaan.io/signup