Pete Fritchman
120 Sandbar Road
Egg Harbor Township, NJ 08234
cell: +1.408.393.3757

<petef@databits.net>
<https://github.com/fetep>

Work Experience

* Goldman Sachs, remote/NYC
  Site Reliability Engineer - 8/2018 - current

  Build and manage the SRE team responsible for the Apple Card backend: a
  99.99% available service hosted in the public cloud, with a 5-minute response
  24x7 on-call rotation. Introduced the concepts of SLOs and built out a stack
  based around Prometheus for metrics gathering, visualization, and real-time
  alerting. Drove the adoption of a blameless post-mortem culture, monitoring
  and decision making revolving around SLOs, and generally instilling SRE
  principles into the larger organization around the Apple Card project.

* FXCM, Inc., remote/NYC
  Operations Engineer - 11/2013 - 8/2018

  Operations lead for designing and deploying updated system management,
  including: configuration databases, configuration management, monitoring,
  graphing, and other at-scale system related issues.  Most of the current
  contract has been spent writing an in-house CMDB and importing/auditing
  existing data.

* OnLive2, Inc, remote
  Principal Operations Engineer - 10/2012 - 10/2013

  Operations lead for existing game infrastructure. Key member in team
  designing/building architecture for our next-generation platform.

* XDN, Inc, remote
  Principal Operations Engineer - 8/2012 - 10/2012

  Operations lead for production cloud GSLB, CDN, and all supporting
  infrastructure. Implemented higher-visibility monitoring via
  statsd/graphite/cepmon. Worked closely with engineering on
  performance/scalability issues.

* Mozilla Corporation, Mountain View, CA, US
  Sr. Operations Engineer - 1/2011 - 7/2012
  [remote since 10/2010]

  Early member in a new operations group, Services Operations, responsible
  for running large-scale user-facing services for Mozilla. Responsibilities
  include leading new projects into production, problem firefighting,
  and everything from working on new project architecture/design to actual
  day-to-day operations and deploys.  Large focus on monitoring and metric
  collection; lots of work on open source tools around this.

* Dasient, Inc., Palo Alto, CA, US
  Operations Engineer - 1/2010 - 1/2011

  Member of a small engineering team, focusing on operations and some backend
  engineering work. On the engineering side, lead and implemented a proxy
  network build-out, major performance enhancements to critical parts of our
  scanning pipeline, and championing a major re-architecture (moving from
  MySQL to MongoDB and central queuing). On the operations front, all pushes
  and release management have been automated, a machine database built as
  the single source of truth, and puppet for config management.

* Rearden Labs / OnLive, Inc., Palo Alto, CA, US
  Sr. Systems Administrator - 3/2008 - 12/2009

  Lead systems administrator for all of production (including dev and
  staging). Building out all infrastructure from zero machines to hundreds
  of machines (planning to scale into thousands soon).  Designed end-to-end
  infrastructure for running production: a machine database, OS-imaging system
  (Linux & Windows), config management, software deployment, monitoring,
  escalation, etc.  Everything done with automation and revision control. I
  am the go-to guy for all production-related architecture and have a close
  relationship with the engineering teams. I also act as a tech lead in the
  sysadmin group, providing guidance and direction.

* Enfold, Redwood City, CA, US
  Engineering Operations Manager - 9/2007 - 2/2008

  Responsible for all operations in the company (corporate & production,
  networks & systems, etc). Designing an infrastructure to run a large Ruby on
  Rails application with different Java-based backends (search, rules engine,
  etc). Designed a high-availability Xen cluster, wrote management tools,
  and deployed a distributed file system for achieving near-100% uptime
  for VMs, which would in turn be running the site.  Managing all aspects
  of moving the site to production (deployment, colo space / dedicated
  servers, etc) and all operational aspects (oncall, monitoring, backups,
  HIPAA compliance). All systems deployed under config management (Puppet)
  with automation covering everything else (deployment, actual system build,
  generating monitoring configs, etc).

* LiveOps, Palo Alto, CA, US
  Sr. System Administrator - 5/2007 - 9/2007

  Responsible for all production infrastructure, specifically building
  out infrastructure to support more machines and a faster rate of growth
  (automation, more central control, documented procedures, best practices,
  etc).  Consult with and help ops/tools group design inventory management,
  RPM repository management, and revision control (subversion) software
  and scripts.  Rolled out central single-sign-on authentication for
  production and pre-production environments, helping move away from the
  "everyone with root" environment.  Worked with "Puppet" for configuration
  and system management.  Write and maintain ruby scripts for automation
  (and puppet integration to our existing inventory system).

* Google, Mountain View, CA, US
  Ads Site Reliability Engineer 3/2005 - 5/2007

  Part of the "SRE" group, responsible for keeping Google up 24x7x365,
  running quickly, and scaling for growth.  Specifically responsible for
  Ads-related customer-facing frontend systems.  Taken a key role in enhancing
  our production monitoring both for ads and all of Google production (member
  of the production monitoring team).  Also a member of the production DNS
  team for internal and external DNS.

  On-call for all user-facing interactive Ads frontends (AdWords, AdSense);
  responsible for designing, monitoring, maintaining, and scaling the
  infrastructure behind it.  Key participant in disaster recovery exercises
  and moving our services to new datacenters (identifying dependencies,
  timelines, key players and making it happen).  Written many internal
  utilities to assist with day-to-day operational work (logs analysis,
  draining/moving traffic, working with monitoring data, etc).

  Established as the "go-to guy" for ads production and monitoring issues.
  Mentor new hires in our group, provide training as needed, and jump in to
  help debug a problem when necessary.

* FedEx Services, Memphis, TN, US
  Senior Technical Analyst 1/2003 - 3/2005

  Part of the "SA&C" group -- Systems Administration and Consulting.  SA&C is
  responsible for all Internet-related Unix systems and some of the network
  infrastructure they are hooked to.  Responsible for everything from
  routine daily maintenance (adding users, installing/updating software,
  and backups) to dealing with production emergencies to enhancing/building
  out infrastructure (backup, MX, NTP, DNS, administrative command and
  control involving config management and remote console services, etc)
  to architecting new highly-available applications for FedEx.com.

  Approximately 620 servers in scope, 60% Sun, 35% Linux, and 5%
  FreeBSD/HPUX/AIX.  Develop and enhance local tools to scale well for
  maintenance of our machines.  Take a "work smarter, not harder" approach
  in all aspects: writing automation to help do work, find problems, monitor
  systems, etc.

 Helped out on automated FreeBSD class builds.  Responsible for rolling
  out an implementation of HP OpenView on SA&C Internet-facing machines for
  24x7 monitoring by another Ops group in the company.  Engineered a scalable
  backup system to meet our needs and integrate with existing vendor backup
  products.  Assisted in new hire interviews and training of the new hires.
  Re-wrote our user accounting system and integrated it with LDAP for user
  authentication.

  FedEx.com has been in the top-10 Keynote Business 40 list for over 200
  weeks consecutively with consistently fast page load times.

* JH Compunet LLC, Jackson, WY, US
  System Engineer 6/2002 - 12/2002

  Responsible for maintaining and upgrading the network which consists of
  14 T1s, and 10 Unix servers (Solaris and FreeBSD).  Responsibilities also
  included maintaining the wireless network consisting of 13 separate access
  points in the 2.4ghz spectrum, and 7 access points in the 5.2ghz spectrum.
  On-call 24x7 and completely implemented our Intranet (ticketing software,
  calendar/scheduling software, custom scripts to manage IP space and users,
  etc.) from scratch.

* NetReach, Inc., Ambler, PA, US
  System Engineer 4/1998-12/2001

  Responsible along with 2 other people for maintaining a group of
  approximately 75 Unix machines for web hosting and backend infrastructure
  services, as well as our production ISP network.  Network-wise,
  responsibilities included administering all routers (mostly Ciscos, ranging
  from 800s to 7200s) and switches (HP ProCurves and Cisco Catalysts).
  Worked with dial-up terminal servers (Portmasters), and hardware for
  ISDN, frame relay, T-1, channelized T-1s, and SMDS DS3 lines. On the Unix
  side, responsibilities included purchasing new hardware and upgrades for
  existing machines, deploying these new machines and upgrades, as well
  as administering all the existing systems.  Administering the existing
  systems involved keeping the OSes up to date with critical patches,
  keeping software up to date, and testing all the changes on a test bed
  of machines.  I assisted in writing a NOC ticket system and a network/host
  monitoring system which discovered problems within our network and they
  were escalated them appropriately.  I was one of the main contacts for the
  system engineering group within the company and for customers.  I worked full
  time during the summer, and part time (20 hours/week) during the school year.

* Databits Network Services, Inc., Collegeville, PA, US
  President 5/1998-9/2001

  Databits Network Services was a small web hosting company I ran on a
  few co-located FreeBSD machines.  Everything was done in-house (nothing
  outsourced).  I gained experience in implementing services from scratch,
  general business skills, and professional communication.  Databits became
  a fully incorporated entity, and is currently at rest because I do not
  have the time to devote to running a company.

* Feith Systems and Software, Fort Washington, PA, US
  Programmer 6/1997-9/1997

  I worked as a programmer creating a test suite for their image database
  software in Perl and some Visual Basic.  The tests involved their Windows
  end-user product and the Unix-backed server end.


Skills

* Programming/scripting languages: Python, Ruby, shell, Go

* Monitoring geek: logstash co-author, using and authoring tools
such as graphite, pencil (graphite dashboard), statsd, prometheus,
grafana, etc.

* OS/Platform experience with administration: Linux (x86), FreeBSD (x86),
Solaris 6-10 (x86, x86-64, Sparc), NetBSD (x86, Sparc), OpenBSD (x86),
SunOS 4.1.4 (Sparc)

* Linux specifics: CentOS, RHEL, Ubuntu, Debian. Also experience in building
and maintaining an embedded distro (using ptxdist and buildroot).

* Networking experience: basic routing, VLANs, familiar with IOS,
debugging.

* Experience with wide-scale implementation of the above services
(horizontal scaling and redundancy across multiple machines & datacenters,
central administration scripts & monitoring, config management, etc).

* Very solid knowledge of the Unix environment and development environment
(including compilers, makefiles, linkers, shells, vi, etc).

* MySql and NoSql (specifically MongoDB) administration experience.

* Excellent debugging skills.


Education

* Rochester Institute of Technology, Rochester, NY, US
  Computer Science, one year completed (3.0 GPA)

 Completed my freshman year working towards a computer science degree.
 Took mostly programming and introductory liberal arts courses, and
 finished up to Calculus 3.  Left after one year to pursue career
 interests.

* Germantown Academy, Fort Washington, PA, US
  High School, June 2001

$Id: resume.txt,v 1.30 2020/08/18 20:25:49 petef Exp $