Loading…
In-person
11-12 December
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon India 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in India Standard Time (UTC+5:30)To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday December 11, 2024 2:55pm - 3:30pm IST
Large Language Model (LLM) finetuning on enterprise private data has emerged as an important strategy for enhancing model performance on specific downstream tasks. This process however demands substantial compute resources, and presents some unique challenges in Kubernetes environments. This session offers a practical, step-by-step guide to implementing multi-node LLM finetuning on Kubernetes clusters with GPUs, utilizing PyTorch FSDP and the Kubeflow training operator. We'll cover - preparing a Kubernetes cluster for LLM finetuning, optimizing cluster, system, and network configurations, and comparing performance of various network topologies including pod networking, secondary networks, and GPU Direct RDMA over ethernet for peak performance. By the end of this session, the audience will have a comprehensive understanding of the intricacies involved in multi-node LLM finetuning on Kubernetes empowering them to introduce the same in their own production Kubernetes environments.
Speakers
avatar for ASHISH KAMRA

ASHISH KAMRA

Senior Manager, Red Hat
Dr. Ashish Kamra is an accomplished engineering leader with over 15 years of experience managing high-performing teams in AI, machine learning, and cloud computing. He joined Red Hat in March 2017, where he currently serves as the Senior Manager of AI Performance at Red Hat. In this... Read More →
avatar for Boaz Ben Shabat

Boaz Ben Shabat

Principle AI Performance Engineer, Red Hat
Boaz is an Engineer with extensive experience in optimizing large-scale, high-performance computing environments. his expertise includes network architecture, system performance tuning, and cloud infrastructure. he excel in solving complex technical challenges and improving efficiency... Read More →
Wednesday December 11, 2024 2:55pm - 3:30pm IST
Auditorium
  AI + ML
Feedback form is now closed.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link