
Inria
vacanciesin.eu
2023-05844 – PhD Position F/M Dynamic Adaptation of Machine Learning and Deep Learning Workflows across the Cloud-Fog-Edge Continuum
Contract type :
Fixed-term contract
Level of qualifications required :
Graduate degree or equivalent
Fonction :
PhD Position
Level of experience :
Recently graduated
About the research centre or Inria department
The Inria Rennes – Bretagne Atlantique Centre is one of Inria’s eight centres and has more than thirty research teams. The Inria Center is a major and recognized player in the field of digital sciences. It is at the heart of a rich R&D and innovation ecosystem: highly innovative PMEs, large industrial groups, competitiveness clusters, research and higher education players, laboratories of excellence, technological research institute, etc.
Context
Description
The recent spectacular rise of the Internet of Things (IoT) and the associated augmentation of the data deluge motivated the emergence of Edge computing [1] as a means to distribute processing from centralized Clouds towards decentralized processing units close to the data sources. The key idea is to leverage computing and storage resources at the “edge” of the network, i.e., near the places where data is produced (e.g., sensors, routers, etc.). They can be used to filter and to pre-process data or to perform (simple) local computations (for instance, a home assistant may perform a first lexical analysis before requesting a translation to the Cloud).
This shift towards the edge of the networks has been particularly leveraged by Artificial Intelligence (AI) applications collecting data from a huge number of sensors. AI has recently gained an unprecedented momentum in a rapidly increasing number of application areas, Deep Neural Networks (DNN) are becoming a pervasive tool across a large range of domains, including autonomous driving vehicles, industrial automation, and pharmaceutical research to name just a few. As these neural network architectures and their training data are getting more and more complex, so are the infrastructures that are needed to execute them sufficiently fast.
Problem statement
However, this extreme geographic distribution of the computation comes with new challenges related to the heterogeneity of the underlying resources (e.g., spanning from energy constrained devices at the Edge to supercomputers and HPC nodes in the cloud) and to their availability (e.g., parts of the network could become inaccessible due to volatile nodes in the edge or power shortages). This thesis aims to address these heterogeneity and availability challenges.
Assignment
Thesis goal
One of the major challenges of these distributed heterogeneous systems lies in the ability to have relevant data at a given location and at a given time. For this, three approaches must be studied.
- Data locality. Data sharing between devices, which includes the ability to locate, transfer and guarantee data integrity and consistency.
- Task scheduling. The scheduling of computational tasks on devices, which includes the possibility to know which tasks can be executed on each device, depending on the computational resources (which hardware, which software stack) and on the state of the system at a given time (which rate of use of the hardware, which state of charge in the case of a system running on batteries).
- Orchestration. An orchestration mechanism that has global knowledge or multiple local knowledge of the system, and is able to make or propose decisions on the two mechanisms mentioned above. The orchestration mechanism includes the evaluation of different scenarios to satisfy a same set of instructions, each scenario implying potentially very different costs for the communications and calculations on each equipment. It also includes the knowledge or prediction of task computation and data transfer times.
These issues rely on many well-studied solutions in homogeneous distributed systems such as supercomputers or large-scale file-sharing systems, but they run into new problems when the targeted system is made up of several heterogeneous systems, each bringing different paradigms for task and data management. The research context of this thesis is the management of this heterogeneity for data management.
This PhD thesis aims to propose a set of techniques to detect on the fly or to predict potential network partitions or resource failures on the Computing Continuum and to react proactively. This reaction includes a set of strategies to efficiently deal with network partitioning, fault tolerance and extreme heterogeneity issues. We will study the trade-offs of different solutions (e.g., relocation of datasets in case of network inaccessibility, application state backup, smart caches in disconnected mode, restarting computations on different nodes) from a performance and energy consumption perspective. We plan to integrate these approaches into a unified framework for seamless and efficient deployment, execution and adaptation of geo-distributed applications.
To this end, an important objective is to propose methodologies and supporting tools enabling researchers to:
- describe in a representative way the application behavior;
- reproduce it in a reliable, controlled environment for extensive experiments, and
- understand how the end-to-end performance of applications is correlated to various algorithm-dependent or infrastructure-dependent parameters
The thesis will validate its outcomes through large scale experimentation on real-life applications (described below) on experimental testbeds like Grid’5000 [4].
Target use-cases
Early warning systems for disaster risk contention
Earthquakes cause substantial loss of life and environmental damage in areas hundreds of miles from their origins. These large ground movements often lead to hazards such as tsunamis, fires and landslides. To mitigate the disastrous effects, a number of earthquake early warning systems have been proposed. These critical systems, operating 24 hours a day, 7 days a week, are supposed to automatically detect and characterize earthquakes as they occur, and issue alerts before ground motion reaches sensitive areas so that protective measures can be taken. It is essential for such a system to detect all large earthquakes with 100% accuracy because the decisions following a large earthquake warning involve important measures for the potentially affected population. This type of detection can be likened to a classification problem, where the input is sensor data and the output is a class (normal activity / medium / large earthquake). Recent machine learning approaches designed to combine large volumes of data from multiple data sources can be applied. The challenge remains the integration and real-time processing of high-throughput data streams from multiple sensors dispersed over a large area, with some sensors becoming isolated from the network. A traditional centralized approach that transfers all data to a single point may be impractical. Thus, detection solutions based on distributed machine learning and relying on high-performance computing techniques and equipment are needed to enable real-time alerts.
Enabling technologies
In the process of designing this adaptive and resilient deployment and execution framework for AI workloads across Edge-Cloud, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:
E2Clab [2,3] approach initiated in the KerData team at Inria is a framework that implements a rigorous methodology that provides guidelines to move from real-life application workflows to representative settings of the physical infrastructure underlying this application in order to accurately reproduce its relevant behaviors and therefore understand end-to-end performance. In addition to potential parallelization strategies for learning and inference tasks, it enables reproducible experimentation of complex AI workloads across hybrid infrastructures and helps optimize deployment strategies depending on multiple factors including the application characteristics, the target performance metrics and the features of available execution hardware.
Main activities
International visibility and mobility
The thesis is financed by the PEPR CLOUD STEEL project, involving several other French research groups. The thesis will include collaborations with other partners of the KerData and Myriads teams, focused on stream processing across the Computing Continuum: the team of Bogdan Nicolae at Argonne National Laboratory (working on Exascale storage and ML/DL processing models).
The PhD position is mainly based in Rennes, at IRISA. The candidate is also expected to be hosted for 3-6 month internships at the partners mentioned above (i.e., ANL)
Interdisciplinarity
The targeted use-case of this PhD proposal will provide a perfect opportunity to illustrate the impact of research in computer science (more specifically in AI-based Big Data analytics) on the domain of earth science. In particular, we plan to show that the earthquakes can be monitored more efficiently and that agencies can react faster to the perceived phenomena, using the Edge-Cloud processing techniques designed in this thesis.
References
[1] M. Satyanarayanan, “The emergence of edge computing,” Computer, 2017.
[2] Daniel Rosendo, Pedro Silva, et al. (2020) E2Clab: Exploring the Computing Continuum through Repeatable, Replicable and Reproduc Edge-to-Cloud Experiments. Cluster 2020 – IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan.
[3] The E2Clab project: https://team.inria.fr/kerdata/e2clab/ (https://team.inria.fr/kerdata/e2clab/).
[4] The Grid’5000 experimental testbed: https://www.grid5000.fr/w/Grid5000:Home (https://www.grid5000.fr/w/Grid5000:Home).
[5] Steven C.H. Hoi, Doyen Sahoo, Jing Lu, Peilin Zhao. Online Learning: A Comprehensive Survey. 2018. https://arxiv.org/abs/1802.0287 (https://arxiv.org/abs/1802.02871)
[6] Doyen Sahoo, Quang Pham, Jing Lu, Steven C.H. Hoi. Online Deep Learning: Learning Deep Neural Networks on the Fly. 2017. https://arxiv.org/abs/1711.03705 (https://arxiv.org/abs/1711.03705)
Skills
Requirements of the candidate
- An excellent Master degree in computer science or equivalent
- Strong knowledge of computer networks and distributed systems
- Knowledge on storage and (distributed) file systems
- Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
- Strong programming skills (e.g. C/C++, Java, Python).
- Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
- Very good communication skills in oral and written English.
- Open-mindedness, strong integration skills and team spirit
Benefits package
- Subsidized meals
- Partial reimbursement of public transport costs
- Possibility of teleworking (90 days per year) and flexible organization of working hours
- Partial payment of insurance costs
Remuneration
monthly gross salary amounting to 2051 euros for the first and second years and 2158 euros for the third year
View or Apply
To help us track our recruitment effort, please indicate in your cover//motivation letter where (vacanciesin.eu) you saw this job posting.