Swear
Words

My wife and I have been loving the Dropout.tv show Game Changer. But we've found it a bit difficult for us to share episodes with my in-laws. Browsing the Dropout subreddit I found other users having the same issue.

The solution, a site that has the number of swear words per episode.

Data

The first step, get all the video URLs. The sitemap.xml provides a great overview of all the episodes, and regexing I got all the urls that contain game changer.

With URLs in hand, we now need to get the transcripts of each video. When enabling subtitles the browser receives a single file with the full transcript of the video.

It seems like the easiest way to get the transcripts is going to be through the subtitles, but this requires a session (logged in) and is from an iframe, so we're probably going to have to use a browser based scraping tool.

Violentmonkey is a chrome plugin for automating scripts on websites, someone had already made a script to interact with the video iframe to enable the subtitles. This is perfect but we're not going to use Violentmonkey, we're going to use puppeteer a javascript tool that uses chrominum to run a browser session.

Next step, write a script that will

Log into dropout.tv
Go to a video, and enable subtitles
Collect some meta data from the page (episode, season...)
Wait for the transcript to be sent
Go to the next video, and loop until it's done

I'll share the most interesting scripts:

Selecting the subtitles, this is run by puppeteer in browser.

await page.evaluate(() => {
  const iframe = document.getElementById("watch-embed");
  // if we found the iframe
  if (iframe) {
    // add the api=1 to the src
    iframe.src += "&api=1";
    // create a new player
    const player = new VHX.Player("watch-embed");
    // when the video is loaded
    player.on("loadeddata", (event) => {
      // get the subtitles
      const languages = player.getSubtitles();
      // if there are subtitles
      if (languages.length > 0) {
        // set the first subtitle
        player.setSubtitle(languages[0].language);
      }
    });
  }
});

Waiting for the transcript to be loaded:

await new Promise((resolve) => {
  const callback = (response) => {
    if (response.url().includes(".vtt")) {
      transcripts.push({ url, vtt: response.url(), meta: content });
      page.off("response", callback);
      fs.writeFileSync(
        "transcripts.json",
        JSON.stringify(transcripts, null, 2)
      );
      resolve();
    }
  };
  page.on("response", callback);
});

Parsing the data

Now that we have the data, we just go through each transcript with a list of cuss words, and collecting the totals of each.

const getSwearWords = (filename) => {=
  const swearWords = [...cuss words...];
  const text = fs
    .readFileSync(`./downloads/${filename}.vtt`, "utf-8")
    .toLowerCase();
  const out = {};
  swearWords.forEach((word) => {
    const newText = text;
    out[word.trim()] = newText.split(word).length - 1;
  });
  return out;
};

We then package up all this into a nice json

As always I threw together a simple vue app with vite.

Check it out here