
Bright Vision Technologies
Site Reliability Engineer (SRE)
New JerseyRemotePosted Today
Full TimeSeniorRemote
See how this job matches your profile
Sign in for an AI-powered fit score, breakdown, and a tailored resume.
Job Description
Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cuttin
Key Highlights
- Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services, and use those measures to drive concrete engineering and prioritization decisions.
- Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed, and ensuring high-quality post-incident reviews that drive lasting improvements.
- Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling so that operators have rich, actionable visibility into system behavior.
- Build and maintain robust on-call processes, runbooks, and escalation paths that reduce mean time to detect and mean time to resolve while protecting the well-being of the engineers on rotation.
- Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages, replacing manual workflows with reliable, auditable automation.
Qualifications
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
- Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems.
- Strong programming skills in at least one of Python, Go, or Java, with the ability to build robust automation and tooling.
- Deep, hands-on experience operating Linux at scale, including networking, performance tuning, and systems-level troubleshooting.
- Production experience operating Kubernetes and container-based workloads.
- Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents.
- Hands-on experience designing and operating CI/CD pipelines for both infrastructure and applications.
- Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
- Demonstrated experience leading incident response and conducting effective post-incident reviews.
- Excellent communication and documentation skills.
- Experience defining and operationalizing SLOs and error budgets in real production environments.
- Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus.
- Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
- Background in capacity planning, performance engineering, or large-scale load testing.
- Familiarity with service mesh technologies such as Istio, Linkerd, or Consul.
Skills & Technologies
Adobe Creative SuiteAdobe PhotoshopAnalyticsArchitectArchitectureBlog WritingAnalytics ToolsEditingAudio EditingBrand Identity DesignPythonGoBashKubernetesCI/CDJavaLinuxAWSAzureGCP
About the Company
Bright Vision Technologies
View company profile →
Interested in this role?
Sign in or create a free account to see how this job matches your skills, apply with one click, and let our AI tailor your resume.
Sign in to applyAI-powered resume optimization
Save and track your applications
Job Details
Employment Type
Full Time
Experience Level
Senior
Location
New Jersey
Work Mode
Remote
Posted
Today