With AI and Machine Learning growing at a rapid pace, companies evolve their data infrastructure to benefit from the latest technological developments and stay ahead of the curve.

Shifting a company’s data infrastructure and operations to one that is “AI ready” entails several critical steps and considerations for data and analytics leaders looking to leverage Artificial Intelligence at scale, from ensuring that the required data processes to feed these technologies are in place, to securing the right set of skills for the job.

Therefore, companies usually begin their journey to “AI proficiency” by implementing technologies to streamline the operation (and orchestration) of data teams across their organisation and rethinking business strategy — what data do they actually need?  This is a natural first step for most organisations, given that Machine Learning and other AI initiatives rely heavily on the availability and quality of input data to produce meaningful and correct outputs. Guaranteeing that the pipelines producing these outputs operate under desirable performance and fault tolerance requirements becomes a necessary, but secondary step.

As a recent O’Reilly Media study showed, more than 60% of organizations plan to spend at least 5% of their IT budget over the next 12 months on Artificial Intelligence.

Budget allocation over the next year-AI-CTC

Considering that interest in AI continues to grow and companies plan to invest heavily in AI initiatives for the remainder of the year, we can expect a growing number of early-adopter organisations to spend more IT budgets on foundational data technologies for collecting, cleaning, transforming, storing and making data widely available in the organization. Such technologies may include platforms for data integration and ETL, data governance and metadata management, amongst others.

Still, the great majority of organisations that set out on this journey already employ teams of data scientists or likewise skilled employees, and leverage the flexibility of infrastructure in the cloud to explore and build organisation-wide data services platforms. Such platforms ideally support collaboration through multi-tenancy and coordinate multiple services under one roof, democratising data access and manipulation within the organisation. It comes as no surprise that technology behemoths like Uber, Airbnb and Netflix have rolled out their own internal data platforms that empower users by streamlining difficult processes like training and productionising Deep Learning models or reusing Machine Learning models across experiments.

But how do companies step up their infrastructure to become “AI ready”? Are they deploying data science platforms and data infrastructure projects on premises or taking advantage of a hybrid, multi-cloud approach to their infrastructure? As more and more companies embrace the “write once, run anywhere” approach to data infrastructure, we can expect more enterprise developments in a combination of on-prem and cloud environments or even a combination of different cloud services for the same application. In a recent O’Reilly Media survey, more than 85% of respondents stated that they plan on using one (or multiple) of the seven major public cloud providers for their data infrastructure projects, namely AWS, Google Cloud, Microsoft Azure, Oracle, IBM, Alibaba Cloud or other partners.

cloud providers used for data infrastructure-CTC

Enterprises across geographies expressed interest in shifting to a cloud data infrastructure as a means to leveraging AI and Machine Learning with more than 80% of respondents across North America, EMEA and Asia replying that this is their desired choice. A testament to the growing trend towards a hybrid, multi-cloud application development is the finding in the same survey that 1 out of 10 respondents uses all three major cloud providers for some part of their data infrastructure (Google Cloud Platform, AWS and Microsoft Azure).

Without question, once companies become serious about their AI and Machine Learning efforts, technologies for effectively collecting and processing data at scale become not just a top priority, but an essential necessity. This is no surprise, given the importance of real-time data for developing, training and serving ML models for the modern enterprise. Continuous processing and real-time data architectures also become key when Machine Learning and other Artificial Intelligence use cases move into production.

This is where Apache Flink comes into play as a first-class open source stream processing engine: built from the bottom-up for stream processing, with unbeatable performance characteristics, a highly scalable architecture, strong consistency and fault tolerance guarantees, Flink is used and battle-tested by the largest streaming production deployments in the world, processing massive amounts of real-time data with sub-second latency.

flink

Examples of such large scale use cases include Netflix, using Apache Flink for real-time data processing to build, maintain and serve Machine Learning models that power different parts of the website, including video recommendations, search results ranking and selection of artwork, and Google using Apache Flink, together with Apache Beam and TensorFlow to develop TensorFlow Extended (TFX), an end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet.

The journey to Artificial Intelligence proficiency might seem like an overwhelming and daunting task at first. Making the right investments and decisions upfront to nurture the right data engineering and analytics infrastructure, thinking cloud-based and considering stream processing as a real-time business enabler will help tech leaders to navigate through the journey successfully and accelerate their enterprise into an AI-led organisation of the future.