Pete Fritchman 120 Sandbar Road Egg Harbor Township, NJ 08234 cell: +1.408.393.3757 Work Experience * Goldman Sachs, remote/NYC Site Reliability Engineer - 8/2018 - current Build and manage the SRE team responsible for the Apple Card backend: a 99.99% available service hosted in the public cloud, with a 5-minute response 24x7 on-call rotation. Introduced the concepts of SLOs and built out a stack based around Prometheus for metrics gathering, visualization, and real-time alerting. Drove the adoption of a blameless post-mortem culture, monitoring and decision making revolving around SLOs, and generally instilling SRE principles into the larger organization around the Apple Card project. * FXCM, Inc., remote/NYC Operations Engineer - 11/2013 - 8/2018 Operations lead for designing and deploying updated system management, including: configuration databases, configuration management, monitoring, graphing, and other at-scale system related issues. Most of the current contract has been spent writing an in-house CMDB and importing/auditing existing data. * OnLive2, Inc, remote Principal Operations Engineer - 10/2012 - 10/2013 Operations lead for existing game infrastructure. Key member in team designing/building architecture for our next-generation platform. * XDN, Inc, remote Principal Operations Engineer - 8/2012 - 10/2012 Operations lead for production cloud GSLB, CDN, and all supporting infrastructure. Implemented higher-visibility monitoring via statsd/graphite/cepmon. Worked closely with engineering on performance/scalability issues. * Mozilla Corporation, Mountain View, CA, US Sr. Operations Engineer - 1/2011 - 7/2012 [remote since 10/2010] Early member in a new operations group, Services Operations, responsible for running large-scale user-facing services for Mozilla. Responsibilities include leading new projects into production, problem firefighting, and everything from working on new project architecture/design to actual day-to-day operations and deploys. Large focus on monitoring and metric collection; lots of work on open source tools around this. * Dasient, Inc., Palo Alto, CA, US Operations Engineer - 1/2010 - 1/2011 Member of a small engineering team, focusing on operations and some backend engineering work. On the engineering side, lead and implemented a proxy network build-out, major performance enhancements to critical parts of our scanning pipeline, and championing a major re-architecture (moving from MySQL to MongoDB and central queuing). On the operations front, all pushes and release management have been automated, a machine database built as the single source of truth, and puppet for config management. * Rearden Labs / OnLive, Inc., Palo Alto, CA, US Sr. Systems Administrator - 3/2008 - 12/2009 Lead systems administrator for all of production (including dev and staging). Building out all infrastructure from zero machines to hundreds of machines (planning to scale into thousands soon). Designed end-to-end infrastructure for running production: a machine database, OS-imaging system (Linux & Windows), config management, software deployment, monitoring, escalation, etc. Everything done with automation and revision control. I am the go-to guy for all production-related architecture and have a close relationship with the engineering teams. I also act as a tech lead in the sysadmin group, providing guidance and direction. * Enfold, Redwood City, CA, US Engineering Operations Manager - 9/2007 - 2/2008 Responsible for all operations in the company (corporate & production, networks & systems, etc). Designing an infrastructure to run a large Ruby on Rails application with different Java-based backends (search, rules engine, etc). Designed a high-availability Xen cluster, wrote management tools, and deployed a distributed file system for achieving near-100% uptime for VMs, which would in turn be running the site. Managing all aspects of moving the site to production (deployment, colo space / dedicated servers, etc) and all operational aspects (oncall, monitoring, backups, HIPAA compliance). All systems deployed under config management (Puppet) with automation covering everything else (deployment, actual system build, generating monitoring configs, etc). * LiveOps, Palo Alto, CA, US Sr. System Administrator - 5/2007 - 9/2007 Responsible for all production infrastructure, specifically building out infrastructure to support more machines and a faster rate of growth (automation, more central control, documented procedures, best practices, etc). Consult with and help ops/tools group design inventory management, RPM repository management, and revision control (subversion) software and scripts. Rolled out central single-sign-on authentication for production and pre-production environments, helping move away from the "everyone with root" environment. Worked with "Puppet" for configuration and system management. Write and maintain ruby scripts for automation (and puppet integration to our existing inventory system). * Google, Mountain View, CA, US Ads Site Reliability Engineer 3/2005 - 5/2007 Part of the "SRE" group, responsible for keeping Google up 24x7x365, running quickly, and scaling for growth. Specifically responsible for Ads-related customer-facing frontend systems. Taken a key role in enhancing our production monitoring both for ads and all of Google production (member of the production monitoring team). Also a member of the production DNS team for internal and external DNS. On-call for all user-facing interactive Ads frontends (AdWords, AdSense); responsible for designing, monitoring, maintaining, and scaling the infrastructure behind it. Key participant in disaster recovery exercises and moving our services to new datacenters (identifying dependencies, timelines, key players and making it happen). Written many internal utilities to assist with day-to-day operational work (logs analysis, draining/moving traffic, working with monitoring data, etc). Established as the "go-to guy" for ads production and monitoring issues. Mentor new hires in our group, provide training as needed, and jump in to help debug a problem when necessary. * FedEx Services, Memphis, TN, US Senior Technical Analyst 1/2003 - 3/2005 Part of the "SA&C" group -- Systems Administration and Consulting. SA&C is responsible for all Internet-related Unix systems and some of the network infrastructure they are hooked to. Responsible for everything from routine daily maintenance (adding users, installing/updating software, and backups) to dealing with production emergencies to enhancing/building out infrastructure (backup, MX, NTP, DNS, administrative command and control involving config management and remote console services, etc) to architecting new highly-available applications for FedEx.com. Approximately 620 servers in scope, 60% Sun, 35% Linux, and 5% FreeBSD/HPUX/AIX. Develop and enhance local tools to scale well for maintenance of our machines. Take a "work smarter, not harder" approach in all aspects: writing automation to help do work, find problems, monitor systems, etc. Helped out on automated FreeBSD class builds. Responsible for rolling out an implementation of HP OpenView on SA&C Internet-facing machines for 24x7 monitoring by another Ops group in the company. Engineered a scalable backup system to meet our needs and integrate with existing vendor backup products. Assisted in new hire interviews and training of the new hires. Re-wrote our user accounting system and integrated it with LDAP for user authentication. FedEx.com has been in the top-10 Keynote Business 40 list for over 200 weeks consecutively with consistently fast page load times. * JH Compunet LLC, Jackson, WY, US System Engineer 6/2002 - 12/2002 Responsible for maintaining and upgrading the network which consists of 14 T1s, and 10 Unix servers (Solaris and FreeBSD). Responsibilities also included maintaining the wireless network consisting of 13 separate access points in the 2.4ghz spectrum, and 7 access points in the 5.2ghz spectrum. On-call 24x7 and completely implemented our Intranet (ticketing software, calendar/scheduling software, custom scripts to manage IP space and users, etc.) from scratch. * NetReach, Inc., Ambler, PA, US System Engineer 4/1998-12/2001 Responsible along with 2 other people for maintaining a group of approximately 75 Unix machines for web hosting and backend infrastructure services, as well as our production ISP network. Network-wise, responsibilities included administering all routers (mostly Ciscos, ranging from 800s to 7200s) and switches (HP ProCurves and Cisco Catalysts). Worked with dial-up terminal servers (Portmasters), and hardware for ISDN, frame relay, T-1, channelized T-1s, and SMDS DS3 lines. On the Unix side, responsibilities included purchasing new hardware and upgrades for existing machines, deploying these new machines and upgrades, as well as administering all the existing systems. Administering the existing systems involved keeping the OSes up to date with critical patches, keeping software up to date, and testing all the changes on a test bed of machines. I assisted in writing a NOC ticket system and a network/host monitoring system which discovered problems within our network and they were escalated them appropriately. I was one of the main contacts for the system engineering group within the company and for customers. I worked full time during the summer, and part time (20 hours/week) during the school year. * Databits Network Services, Inc., Collegeville, PA, US President 5/1998-9/2001 Databits Network Services was a small web hosting company I ran on a few co-located FreeBSD machines. Everything was done in-house (nothing outsourced). I gained experience in implementing services from scratch, general business skills, and professional communication. Databits became a fully incorporated entity, and is currently at rest because I do not have the time to devote to running a company. * Feith Systems and Software, Fort Washington, PA, US Programmer 6/1997-9/1997 I worked as a programmer creating a test suite for their image database software in Perl and some Visual Basic. The tests involved their Windows end-user product and the Unix-backed server end. Skills * Programming/scripting languages: Python, Ruby, shell, Go * Monitoring geek: logstash co-author, using and authoring tools such as graphite, pencil (graphite dashboard), statsd, prometheus, grafana, etc. * OS/Platform experience with administration: Linux (x86), FreeBSD (x86), Solaris 6-10 (x86, x86-64, Sparc), NetBSD (x86, Sparc), OpenBSD (x86), SunOS 4.1.4 (Sparc) * Linux specifics: CentOS, RHEL, Ubuntu, Debian. Also experience in building and maintaining an embedded distro (using ptxdist and buildroot). * Networking experience: basic routing, VLANs, familiar with IOS, debugging. * Experience with wide-scale implementation of the above services (horizontal scaling and redundancy across multiple machines & datacenters, central administration scripts & monitoring, config management, etc). * Very solid knowledge of the Unix environment and development environment (including compilers, makefiles, linkers, shells, vi, etc). * MySql and NoSql (specifically MongoDB) administration experience. * Excellent debugging skills. Education * Rochester Institute of Technology, Rochester, NY, US Computer Science, one year completed (3.0 GPA) Completed my freshman year working towards a computer science degree. Took mostly programming and introductory liberal arts courses, and finished up to Calculus 3. Left after one year to pursue career interests. * Germantown Academy, Fort Washington, PA, US High School, June 2001 $Id: resume.txt,v 1.30 2020/08/18 20:25:49 petef Exp $