You can access a wealth of marketing-related data — from web analytics and customer journey behavior to competitor analysis and product usage.

However, if the data isn’t clean, you can’t truly tap into its value. Or worse, you could steer your marketing in the wrong direction and see diminishing returns.

James Hunt, principal consultant at Vivanti, says data cleaning and modeling are essential to extract value and gain knowledge and wisdom from the information. In his presentation at the Marketing Analytics & Data Science Conference, he details why it’s necessary, the basics of data cleaning, and the role of governance and observability.

What is data modeling?

Data models turn data into something useful, and you need to understand data modeling so you can understand the best cleaning options. James explains that data modeling involves three parts — additive, context, and domain.

Additive means you let the machines figure out how to standardize the data. You don’t manually “fix” the data, such as lowercasing the sporadic all-cap names on a spreadsheet. That would actually be data destruction because, as James says, “As humans, we’re really bad at doing the same thing twice.”

Context organizes the data to tell a story. You don't add new information; you impute the existing data. For example, the context of a sales transaction could include the marketing emails the buyer saw, the social media content the buyer engaged with, and the other products they viewed.

Domain is the set of all possible data values for a given element. It can be qualitative and quantitative. James points to these five common domain types:

  • Identity — a unique value that distinctly and discretely pinpoints somebody, such as an email address, Social Security number, or customer ID

  • Nominative — a supplemental identity not strong enough to stand on its own, such as a person’s full name or a product name

  • Categorical — a grouping across arbitrary boundaries, such as customer type or industry; often used for cohort subdivision

  • Monetary — the currency which can be compared, ordered, aggregated, and disaggregated, such as order total or unit price

  • Temporal — a point or span of dates and times, such as sign-up date, last purchase date, or loyalty period

With this foundational understanding of modeling, you’re ready to learn about cleaning the data.

What types of data cleaning exist?

James details the three types of data cleaning — mechanical, explicit mappings, and patterns and rules:

With mechanical cleaning, the data is cleaned up without changing the meaning of the information, such as normalizing the case for names and removing unnecessary spaces. “These are all things that I can do all by myself as a data engineer that nobody gets mad (about),” James says. “Nobody says, ‘Well, you took the spaces out of their first name, so it's a different person."

Explicit mapping uses an activity called "cardinality reduction" to decrease the number of unique values associated with an attribute. It simplifies the dataset by grouping values while retaining the relevant information. These datasets are more manageable and can improve model performance.

For example, James says, perhaps a customer status field started with two values — active and inactive. Over time, the field expanded to include suspended, on-hold, and prospective options. An explicit mapping cleaning might move the “suspended” customer status into the “active” value.  

A cleaning for patterns and rules identifies and corrects inconsistencies, inaccuracies, or errors in the data based on identifiable structures (i.e., patterns) and constraints (i.e., rules).

Standard patterns encompass data like email addresses, date strings, and phone numbers. Deviations from that structure indicate data that needs to be cleaned.

Rules refer to logical conditions or constraints. So, for example, if the monetary data for an insurance policy exceeds its maximum value, the entry needs to be cleaned.

James says you also can set rules and patterns to map the customer journey. Let’s say a brand doesn’t care how many times a person opens and clicks its email. Instead, it cares about identifying who is susceptible to purchasing from an email marketing campaign. It could set up rules to clean the data for that goal.

For example, all emails sent would be labeled “E”, and all clicks would be labeled “C”, while an order would be recognized as “O.” Those rules collapse the data so it’s most helpful for the brand and its marketing goals.


What is governance’s role in data cleaning?

"Anytime you are cleaning data, you are making a decision. You are deciding what is relevant; you are deciding what is important. You're deciding what to keep and what to surface," James says.

You must document those data-cleaning decisions in an internal repository, such as a spreadsheet, or use a version control system like the open-source Git.

Each decision should answer these four questions:

  • What decision was made?

  • When was it made? This point-in-time reference helps with historical analysis.

  • Who made the decision?

  • Why was this decision made? It’s helpful to inform future actions. For example, if the decision was made because of a government update, reversing it probably isn’t possible. But, if the decision was made because the data team thought it was a better way to do it, reversing course may remain a viable option, James says.

Let’s go back to the example of collapsing the customer status fields so the “suspended” status was grouped into “active” customers. Here’s how that decision might be recorded:

Datasets where the integrity of data is not pristine or perfect

  • Datasets with a high number of unique values (i.e., for which cardinality reduction can help processing and analysis)

  • Where would you find that data? It could come from a multitude of sources, such as:

    • CRM platforms

    • Customer contact records wreckords

    • Customer questionnaires and feedback forms

    • Survey responses

    • Web analytics

    • Customer behaviors

    • Product or platform information

    • Competitor analyses

    Start with the ones that would most benefit from one or more of the three types of data cleaning, proper governance, and observability. Then, you can decide whether to engage with data teams in your organization to assist.

    Visit the Marketing Analytics & Data Science conference website to keep up with all things MADS.

    HANDPICKED RELATED CONTENT:

    Cover image by Joseph Kalinowski/Content Marketing Institute

    About the Author

    Dennis Shiao

    Dennis is the founder of B2B marketing agency Attention Retention and organizer of the Bay Area Content Marketing Meetup. He curates a marketing-related email newsletter called Content Corner that comes out every other Friday. Feel free to reach out to Dennis on Twitter @dshiao.