
AI&
Member of Technical Staff - Inference Optimization
Japan (Hybrid)RemotePosted 27 days ago¥4,800,000 – ¥4,800,000
Full TimeSeniorRemoteJP
See how this job matches your profile
Sign in for an AI-powered fit score, breakdown, and a tailored resume.
Job Description
As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and R
Key Highlights
- Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution. Own the full cycle from profiling to production deployment across LLM inference, training, and generative model workloads.
- Attention & Linear Algebra Primitives Build and tune fused attention kernels (Flash Attention variants, MLA, paged attention), GEMM primitives, and quantized compute paths (INT8, FP8, AWQ, GPTQ) that push the hardware to its limits.
- Precision & Numerical Stability Prototype and evaluate precision formats including FP16, BF16, FP8, e5m2, and stochastic rounding. Understand the accuracy and performance trade-offs at a deep level and make principled decisions about where each format belongs.
- Profiling & Bottleneck Analysis Use Nsight Compute, rocprof, Perfetto, VTune, and custom instrumentation to identify and eliminate performance bottlenecks. Translate profiling data into concrete architectural improvements.
- Operator Fusion Identify opportunities to fuse multi-step operations into single kernel launches, reducing memory round-trips and kernel launch overhead across the inference and training execution graphs.
Qualifications
Required Qualifications
- Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
- Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
- Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
- Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
- Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
- Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
- Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos
Skills & Technologies
PyTorchC++
About the Company
AI&
View company profile →
Interested in this role?
Sign in or create a free account to see how this job matches your skills, apply with one click, and let our AI tailor your resume.
Sign in to applyAI-powered resume optimization
Save and track your applications
Job Details
Employment Type
Full Time
Experience Level
Senior
Salary Range
¥4,800,000 – ¥4,800,000
Location
Japan (Hybrid)
Work Mode
Remote
Posted
27 days ago
Country
JP