Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) has an opening for a Data Management Engineer. In this role, you will provide a variety of engineering support services to manage a data warehouse and notification infrastructure. You will ensure that the cluster is accessible, reliable, secure and available to continue collecting or queuing data from heterogeneous data sources in the NERSC computational facility. The infrastructure has a 125 TB immediately available time-series data that collects at a rate of 25K data points/second. The types of datasets range from the facility environment (power, temperature, humidity) to storage I/O to system logs of the HPC systems and support services. This information is used to provide alerts to manage the facility but also to correlate data to provide business decisions and analysis to provide future trends. The position will not only manage the data warehouse but also provide assistance for other groups to create plots of and analysis of their data.
This position supports a 24x7 operation. While the current schedule is during the day, you will be in the on-call schedule to support night shift, weekend hours and holidays depending on operational needs and a 24x7 on-call rotation.
This position will be hired at a level commensurate with the business needs; and skills, knowledge, and abilities of the successful candidate.
What You Will Do:
Apply working knowledge of clustered Linux systems to manage the reliability of the data warehouse cluster and ensure that it continues to collect data 24/7.
Apply demonstrated skills as a Linux Systems Administrator and a site reliability engineer using skill sets in container management like Kubernetes, virtualization technologies like oVirt, systems monitoring software like Prometheus and data warehouse management system like the Elastic stack or Victoria Metrics.
Apply demonstrated skills in a data warehouse stack's visualization software like Kibana and Grafana and assist other groups to create plots of and analysis of their data.
Solve problems of diverse scope related to maintaining the critical services of the data collection infrastructure functioning, creating alerting, notification and problem-solving programs to prevent problem recurrence with the goal of automating the response to all routine service conditions.
Collaborate with others in developing and maintaining diagnostic tools used to support the HPC community within NERSC using programming languages like C, C++, python, java or Perl or within the NOW framework, using knowledge of standard software development practices.
Provide accurate information in the trouble ticketing system for outages, maintenances and other incidents such that the workflow and protocols can be appropriately tracked by others.
Work closely in cooperation with other NERSC groups to manage maintenance for the cluster, to perform tasks like upgrades, to shut down batch queues. Manage diagnostic and notification software.
What is Required:
Bachelor's Degree in Computer Science or similar discipline and a minimum of 5 years of hands-on experience? or an equivalent combination of education, certification and experience.
3 years of experience as a Linux (or similar type of operating system) system administrator or system engineer in a customer facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continued availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements.
Hands - on experience in Red Hat Enterprise Linux or another Linux variant in a shell or command line environment.
Minimum of 3 years of experience in a UNIX or Linux environment with Networking, IT infrastructure environment or cluster management experience in a distributed computing environment.
Hands-on experience configuring distributed, server based or cluster based infrastructure supporting a high volume of transactions in a Linux environment. An understanding of VM's and Containers, how to manage them and an understanding of the IoT technologies.
Demonstrated skill sets in container management like Kubernetes, in systems monitoring software like Prometheus and data collection management system like the ELK stack.
Demonstrated skills in the ELK stack's visualization software like Kibana and Grafana with knowledge to assist other groups to create plots of or analysis of their data.
Hands-on experience with developing and maintaining diagnostic tools using programming languages like C, C++, python, java or Perl, or within the NOW framework, using knowledge of standard software development practices. Feel free to share your GitHub repository for us to look at.
Experience with network theory such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.
An understanding of the different monitoring implementations like those in the NOW framework and the solution's system administration.
Exposure to Oracle or other high end Storage Infrastructure.
Excellent problem-solving skills. Must be able to think independently, work collaboratively and contribute to the final resolution.
Additional Desired Qualifications:
Experience with network security: configuring/maintaining ACLs, knowledge of firewalls.
Network programming or a network certification.
A certification in a system administration area.
The posting shall remain open until the position is filled.
This is a full-time 1 year term appointment with the possibility of extension or conversion to Career appointment based upon satisfactory job performance, continuing availability of funds and ongoing operational needs.
This position supports a 24x7 operation and may support night shift, weekend hours and holidays depending on operational needs and a 24x7 on-call rotation.
This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
Work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA.
Learn About Us:
Working at Berkeley Lab has many rewards including a competitive compensation program, excellent health and welfare programs, a retirement program that is second to none, and outstanding development opportunities. To view information about the many rewards that are offered at Berkeley Lab- Click Here.
Berkeley Lab (LBNL) addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the U.S. Department of Energy's Office of Science.
Equal Employment Opportunity: Berkeley Lab is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status. Berkeley Lab is in compliance with the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4. Click here to view the poster and supplement: "Equal Employment Opportunity is the Law."
Internal Number: 88307
About Lawrence Berkeley National Laboratory
In the world of science, Lawrence Berkeley National Laboratory (Berkeley Lab) is synonymous with excellence. Thirteen scientists associated with Berkeley Lab have won the Nobel Prize. Fifty-seven Lab scientists are members of the National Academy of Sciences (NAS), one of the highest honors for a scientist in the United States. Thirteen of our scientists have won the National Medal of Science, our nation's highest award for lifetime achievement in fields of scientific research. Eighteen of our engineers have been elected to the National Academy of Engineering, and three of our scientists have been elected into the Institute of Medicine. In addition, Berkeley Lab has trained thousands of university science and engineering students who are advancing technological innovations across the nation and around the world. Berkeley Lab is a member of the national laboratory system supported by the U.S. Department of Energy through its Office of Science. It is managed by the University of California (UC) and is charged with conducting unclassified research across a wide range of scientific disciplines. Located on a 200-acre site in the hills above the UC Berkeley campus that offers spectacular... views of the San Francisco Bay, Berkeley Lab employs approximately 4,200 scientists, engineers, support staff and students. Its budget for 2011 is $735 million, with an additional $101 million in funding from the American Recovery and Reinvestment Act, for a total of $836 million. A recent study estimates the Laboratory's overall economic impact through direct, indirect and induced spending on the nine counties that make up the San Francisco Bay Area to be nearly $700 million annually. The Lab was also responsible for creating 5,600 jobs locally and 12,000 nationally. The overall economic impact on the national economy is estimated at $1.6 billion a year. Technologies developed at Berkeley Lab have generated billions of dollars in revenues, and thousands of jobs. Savings as a result of Berkeley Lab developments in lighting and windows, and other energy-efficient technologies, have also been in the billions of dollars. Berkeley Lab was founded in 1931 by Ernest Orlando Lawrence, a UC Berkeley physicist who won the 1939 Nobel Prize in physics for his invention of the cyclotron, a circular particle accelerator that opened the door to high-energy physics. It was Lawrence's belief that scientific research is best done through teams of individuals with different fields of expertise, working together. His teamwork concept is a Berkeley Lab legacy that continues today.