AI&

Member of Technical Staff - Inference Optimization

Japan (Hybrid)RemotePosted 27 days ago¥4,800,000 – ¥4,800,000

Full TimeSeniorRemoteJP

See how this job matches your profile

Job Description

As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and R

Key Highlights

Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution. Own the full cycle from profiling to production deployment across LLM inference, training, and generative model workloads.
Attention & Linear Algebra Primitives Build and tune fused attention kernels (Flash Attention variants, MLA, paged attention), GEMM primitives, and quantized compute paths (INT8, FP8, AWQ, GPTQ) that push the hardware to its limits.
Precision & Numerical Stability Prototype and evaluate precision formats including FP16, BF16, FP8, e5m2, and stochastic rounding. Understand the accuracy and performance trade-offs at a deep level and make principled decisions about where each format belongs.
Profiling & Bottleneck Analysis Use Nsight Compute, rocprof, Perfetto, VTune, and custom instrumentation to identify and eliminate performance bottlenecks. Translate profiling data into concrete architectural improvements.
Operator Fusion Identify opportunities to fuse multi-step operations into single kernel launches, reducing memory round-trips and kernel launch overhead across the inference and training execution graphs.

Qualifications

Required Qualifications

Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos

Skills & Technologies

PyTorchC++

About the Company

AI&

View company profile →

Interested in this role?

Sign in or create a free account to see how this job matches your skills, apply with one click, and let our AI tailor your resume.

AI-powered resume optimization

Save and track your applications

Job Details

Employment Type

Full Time

Experience Level

Senior

Salary Range

¥4,800,000 – ¥4,800,000

Location

Japan (Hybrid)

Work Mode

Remote

Posted

27 days ago

Country