Building A Data Platform From Scratch At Collectors: Part 3 of 3

This blog post is part 3 of a 3-part series about how we built a data platform from scratch at Collectors. It is part memoir, part instructional manual for data teams embarking on a “build a data platform” journey. Most of it is based on facts. All of it is based on my own personal experience and opinions. In case you missed it, check out part 1 and part 2.

Data for the people

Well, here we are now, we have data flowing through daily updated pipelines, we have a few “prototype” dashboards, business stakeholders across the organization start asking about this “data platform” you’re building, and you might even get the odd request to “pull some numbers”. So far, I’ve mainly focused on the technical side of what I’ve been doing, and how I got to working data infrastructure. In this part of the post, I’m going to talk a little bit about the other side of a data platform: People, aka, how do data users interact with the platform?

Remember how I talked to “an enormous list of stakeholders” early on when I first started the data platform adventure? As someone who’s mostly worked for smaller startups in the past, the complexity of an established organization like Collectors was certainly overwhelming for me at first. Trying to understand the different business units which all had their own processes, data workflows, acronyms, and stakeholders seemed like a barely manageable task – where do I even start? This was my first time dealing with such a vast task, so I approached it using a loosely structured “expand – contract” method. At first, I talked to everyone and took unstructured notes on pretty much everything. Aside from learning about the data questions at hand, this also helped me build relationships with folks across my new company, put faces to names, and communicate the intent of what I was building.

Next, myself and the DBA team which owns the existing data reporting infrastructure at Collectors, went through a “user stories” exercise. If you’ve ever worked with a product manager, you may be familiar with “As a <Persona A>, I want to do <Action X>” type user stories. We wrote out a few dozen of these stories related to data use cases and grouped these together to help us identify key “themes” which roughly aligned with business functions, such as the product owners, the grading operations team, the finance team, HR and recruiting, and others. Finally, I identified the key stakeholders for each function and, with guidance from my colleague Colin, created a lightweight interview script which I used to interview these stakeholders in a more structured way. The interview mostly focused on key business goals and metrics to measure success for each goal, which gave me a clear understanding of what kinds of data and metrics could really help us “move the needle”.

In addition to talking to people directly, I also made sure to have a clear way of communicating intent and status of the data platform work to relevant stakeholders. In put together a “data platform vision doc” which outlined the key goals of the project, and started a “data platform newsletter”. This was initially sent out weekly, and I still send them out when we reach major milestones. The newsletter helped get everyone on the same page about what was going on in data platform land and gave me a great reason to reflect on accomplishments every week, as well as give shout-outs to everyone who had put effort into supporting the work. Now, how do we actually decide what kind of data sets and analyses to work on?

So many data questions, so little time

Another aspect of being a “data platform team” that was important to me was the operating model of the team. Previously, any stakeholder at the organization who needed some data would file a ticket with the IT team, which would then route the ticket to the DBA team that owned the existing reporting platform. The DBA team would work through tickets on a first-in-first-out basis, either create a report using stored procedures and SSRS, or send over numbers to the requester via email or in a spreadsheet. This kind of service model for a data team often leads to long turnaround times unless something is flagged as urgent, many ad-hoc requests, and little feedback for the data producers as to how their work impacted business decisions. I believe that having business context and collaborating closely with data users empowers data producers to prioritize and scope data needs better, ask the right questions to reduce back-and-forth, and be proactive about shaping how data is being used for decision making.

This is why I implemented an “embedded consultant” or “hub and spokes” type model for a centralized data team (which most closely resembles Pardis Noorzad’s product data science model): The centralized team allows streamlining processes and tooling across data producers, while the majority of analysts and analytics engineers on the team are staffed directly to one (or more) teams or business units, where they participate in key meetings and gain a deep understanding of the business context. Aside from leading to better and more tailored data output, I believe that this type of model also contributes to employee happiness: Being part of the team that uses your data and directly seeing the impact your work has on business decision is often significantly more satisfying than running some queries, throwing the results over the fence, and not really knowing how your work “moved the needle” towards business goals.

All the conversations and interviews I had at the beginning of the project led me to a list of key metrics across the organization and an idea of the different functions we needed to support, but it wasn’t entirely clear to me how to prioritize these. As a small team (myself and a new analytics engineer), there was only a limited amount of work we could do – and remember, the entire end-to-end process would require us to perform exploratory data analysis, reverse-engineer any existing reporting logic, add integrations to Stitch to replicate the base tables, write models in dbt, and build out dashboards in Tableau. After some amount of debating we decided to prioritize supporting the PSA Product and Marketing team for the following reasons: PSA is our most active and most visible brand at Collectors that’s currently undergoing a lot of exciting updates and innovations. Unlike the internal Operations team, the Product and Marketing teams barely had any content in the existing reporting infrastructure, and did not have any SQL savvy users that knew their way around our data. This meant that any data work we could do for the team that provided them insights into key metrics would close a significant gap and, well, move the needle.

And this is where we are at today!

After five months of conversations, documents, meetings, code hacking, and dashboard building, we now have one member of the data platform team staffed to the Product and Marketing teams to support them with data questions such as “Which was the most popular graded card in the past month?”, “How many people created new sets in the PSA Set Registry?”, and “What’s our customer lifetime value?”. We built a solid and scalable foundation of infrastructure and workflows that allows us to quickly explore and integrate new data sets and deliver value to our data users. While there is still some infrastructure work outstanding, such as more comprehensive testing, incremental dbt models to improve run performance, and workflow orchestration, we’re in a great place to focus on creating data insights for our stakeholders and move us a little closer to the wonderful world of data-driven decision making.

We’re hiring at Collectors! We’ve got open roles on the Data Platform team and many other teams: careers.collectors.com

Sam Bail

Sam is a Principal Data Engineer at Collectors Universe. She is a "data person" with extensive experience in healthcare data analytics, building data pipelines, running engineering teams, developer relations, and strategic partnerships. Her toolkit includes a broad data engineering and analytics stack, including SQL, Python, Pandas, Airflow and other workflow orchestration tools, data warehouses and ETL tools, cloud providers and dev ops, you name it - whatever gets the job done.

Author posts