Our Blog

Audit Sampling: It’s a Numbers Game

by Melissa Rach on November 10th, 2011

A few months ago, a client called us about a content audit. For a site with hundreds of millions of pages. That’s 100,000,000+ pages. Yep, those zeros are correct.

Now, old-timey (2007) content strategy logic says you need to audit all of your content. You need to see it all with your own human eyes. And if you have a small site (less than 5,000 pages or pieces of content), you probably still should review every piece.

But manual audits get more and more unrealistic as sites get bigger. Who has the time to review 25,000, 100,000, or 1,000,000 pieces of content? For our client with 100,000,000 pages, it would take 20 people working full-time for 270 years to manually view all of those pages. Obviously, that’s not an option. Unless you could hire an army of robots. Speaking of …

Can’t robots just do it?

It depends on what kind of audit you’re performing. If you are doing a quantitative audit—simply finding out how much content you have, where it lives, and associated keywords; yes, there are technical tools that can help.

But, if you’re doing a qualitative audit—where you’re trying to learn about the substance, accuracy, and quality of your content, robots can’t help you out. Well, maybe if you had a robot like C-3PO (fluent in six million forms of communication), or this guy:

Content strategy robot

Image by Sean Tubridy. All rights reserved. 

But you don’t. So, what else can you do?

Sample-size it!

You can pick a sample—a subset of your content—to review. Although a sample doesn’t replace a total site audit, it does help you reduce uncertainty about your content. Scientific and marketing researchers have been doing sampling for years, and when done correctly, sampling can give you a fairly accurate indication of your overall content situation.

You can choose your sample randomly or base it on various factors, such as user segments, product categories, content purpose, location on the site, etc. It all depends on why you’re doing the audit and what you want to learn.

How much is enough?

There’s no rule or benchmark to use. It would seem like the more content you could review, the better off you’d be. That is somewhat true, but mostly you just have to look at enough content to see patterns emerge. On a relatively small site (i.e., 10,000 pieces of content), you might need to look at half of the content before the patterns become obvious.

On a million-page site—you might look at only 0.01% of the content. That’s still 10,000 pages … so you’re not exactly off the hook. But hopefully you’d recognize some kind of valuable patterns by then. You might not have the same level of certainty as you did with the smaller site audit, but you’ll have some ideas. And you probably aren’t going to learn anything else by auditing another 1,000 or 10,000 items—the percentage of items reviewed is still so low that the change in the margin of error is microscopic.

So where do I begin?

Your sample depends on the size of your site. Here’s a rough table of suggested sample sizes (adapted from market research sampling guidelines):

Total number of pages/pieces
Sample size
<5,000 Review all
10,000 5,000
25,000 7,000
50,000 8,000
100,000 9,000
>1,000,000 10,000–16,000


Sampling doesn’t lead to a perfect picture of your content. But a sample audit can provide useful information to support arguments for funding, make the case for content work, or demonstrate progress.

  • http://hobbsontech.com David Hobbs

    Thanks for the useful post.  As a general rule, I think sampling is a great principle for getting a handle on larger sites.  And I’m a fan for random samples, which implies you have enumerated all the content somewhere (otherwise it isn’t really random).  That said, and as you mention above, you probably don’t want a pure random sample either.  I would argue that you want to make sure to sample the different layers of content that have accumulated over time, ensuring you cover different “reigns” of management, etc.  In addition, you can do an initial coarse sample, which then informs further exploration in areas that require more analysis.  Anyway, thanks for the post.

  • http://twitter.com/GabbyAtConfab Gabby


    Great article! I’ve always been disappointed that many content strategy books
    only talk about strategies and tactics that are applicable to small or medium
    sized sites. Many of my CS peeps work on large retail or document repository
    sites where’s it’s simply not practical to do content audits and similar
    activities for every page.


    You’re also right about not simply doing random sampling
    of pages. In fact, adding sophistication around how that is done would be
    beneficial. I’m thinking of more or less representative sampling techniques
    applied to your content. Perhaps weighting by most viewed pages, newest/oldest,
    business prioritized areas, and the like.


    Keep up the good work!


  • http://twitter.com/johntmohr John Mohr

    Excellent article. I like the idea of a combination sampling of both random and selected to ensure you do get some sampling of the, what I imagine would be many, content types.

    I am curious, what is an estimate on the hours to review 16,000 pages? 

  • Melissa Rach

    Glad you enjoyed the article. It’s hard to do an exact estimate  without knowing what kind of content it is (takes longer to do scientific, complicated stuff than entertainment content, for example.

    But off the top of my head I’d say somewhere around 1100 hours (4 people for 6 weeks) to do the audit, analysis, and report. Which would not be cheap–but for organizations that have millions of pages, it can be money well spent.

  • Melissa Rach

    Gabby! Have you been found? (I should pay more attention to Twitter.)

    I like the sound of your representative techniques–those are just the kind of ideas that work well.  Thanks!

  • Melissa Rach

    Glad you liked the post. You mention some great ideas for choosing a sample. 

  • http://twitter.com/siteupdate4you Site Update Service

    I am all for sampling pages when it gets over 5k pages.  My only worry with sampling is this may not always find the pages that need updating. Because of the new guidelines involving SERPs and pagerank involve quality content, we keep track of our post-Panda content, topic, title, keywords etc. This is all because it is suggested by SEOs that updates to past articles help content quality “grade”.  Thoughts on a better system on searching a site for articles that need updating?

  • Melissa Rach

    Unfortunately, that’s a pretty difficult question to answer without knowing more about the specifics of your content, your CMS, and your specific SEO goals (and the business goals behind them). I suspect you’re already doing the obvious things like looking at your CMS to see what date things were published, assigning review-by dates, etc.

    But, as for auditing, since you have your keywords recorded so nicely (well done!), I would choose my samples based on the keywords that your audiences (and business leaders) are most interested in.

    Sorry, I know this isn’t particularly helpful–it’s hard to give a real answer without more context.

  • http://twitter.com/Mengkai Gareth Morgan

    “If you are doing a quantitative audit—simply finding out how much content you have, where it lives, and associated keywords; yes, there are technical tools that can help.”
    …are you able to elaborate on some of those technical tools? Names, pros/cons, etc? 

blog comments powered by Disqus