It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:
Those Old Scientists
Name | Total Lines | Percentage of Lines |
---|---|---|
KIRK | 8257 | 32.89 |
SPOCK | 3985 | 15.87 |
MCCOY | 2334 | 9.3 |
SCOTT | 912 | 3.63 |
SULU | 634 | 2.53 |
UHURA | 575 | 2.29 |
CHEKOV | 417 | 1.66 |
The Next Generation
Name | Total Lines | Percentage of Lines |
---|---|---|
PICARD | 11175 | 20.16 |
RIKER | 6453 | 11.64 |
DATA | 5599 | 10.1 |
LAFORGE | 3843 | 6.93 |
WORF | 3402 | 6.14 |
TROI | 2992 | 5.4 |
CRUSHER | 2833 | 5.11 |
WESLEY | 1285 | 2.32 |
Deep Space Nine
Name | Total Lines | Percentage of Lines |
---|---|---|
SISKO | 8073 | 13.0 |
KIRA | 5112 | 8.23 |
BASHIR | 4836 | 7.79 |
O’BRIEN | 4540 | 7.31 |
ODO | 4509 | 7.26 |
QUARK | 4331 | 6.98 |
DAX | 3559 | 5.73 |
WORF | 1976 | 3.18 |
JAKE | 1434 | 2.31 |
GARAK | 1420 | 2.29 |
NOG | 1247 | 2.01 |
ROM | 1172 | 1.89 |
DUKAT | 1091 | 1.76 |
EZRI | 953 | 1.53 |
Voyager
Name | Total Lines | Percentage of Lines |
---|---|---|
JANEWAY | 10238 | 17.7 |
CHAKOTAY | 5066 | 8.76 |
EMH | 4823 | 8.34 |
PARIS | 4416 | 7.63 |
TUVOK | 3993 | 6.9 |
KIM | 3801 | 6.57 |
TORRES | 3733 | 6.45 |
SEVEN | 3527 | 6.1 |
NEELIX | 2887 | 4.99 |
KES | 1189 | 2.06 |
Enterprise
Name | Total Lines | Percentage of Lines |
---|---|---|
ARCHER | 6959 | 24.52 |
T’POL | 3715 | 13.09 |
TUCKER | 3610 | 12.72 |
REED | 2083 | 7.34 |
PHLOX | 1621 | 5.71 |
HOSHI | 1313 | 4.63 |
TRAVIS | 1087 | 3.83 |
SHRAN | 358 | 1.26 |
Discovery
Important Note: As the source material is incomplete for Discovery, the following table only includes line counts from seasons 1 and 4 along with a single episode of season 2.
Name | Total Lines | Percentage of Lines |
---|---|---|
BURNHAM | 2162 | 22.92 |
SARU | 773 | 8.2 |
BOOK | 586 | 6.21 |
STAMETS | 513 | 5.44 |
TILLY | 488 | 5.17 |
LORCA | 471 | 4.99 |
TARKA | 313 | 3.32 |
TYLER | 300 | 3.18 |
GEORGIOU | 279 | 2.96 |
CULBER | 267 | 2.83 |
RILLAK | 205 | 2.17 |
DETMER | 186 | 1.97 |
OWOSEKUN | 169 | 1.79 |
ADIRA | 154 | 1.63 |
COMPUTER | 152 | 1.61 |
ZORA | 151 | 1.6 |
VANCE | 101 | 1.07 |
CORNWELL | 101 | 1.07 |
SAREK | 100 | 1.06 |
T’RINA | 96 | 1.02 |
If anyone is interested, here’s the (rather hurried, don’t judge me) Python used:
#!/usr/bin/env python
#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/StarTrek/ http://www.chakoteya.net/StarTrek/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#
import re
from collections import defaultdict
from pathlib import Path
EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")
EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
TOS = EPISODES / "StarTrek"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"
NAMES = {
TOS.name: "Those Old Scientists",
TNG.name: "The Next Generation",
DS9.name: "Deep Space Nine",
VOY.name: "Voyager",
ENT.name: "Enterprise",
DISCO.name: "Discovery",
}
class CharacterLines:
def __init__(self, path: Path) -> None:
self.path = path
self.line_count = defaultdict(int)
def collect(self) -> None:
for episode in self.path.glob("*.htm*"):
if EPISODE_REGEX.match(episode.name):
for line in episode.read_text().split("\n"):
if m := LINE_REGEX.match(line):
self.line_count[m.group("name")] += 1
@property
def as_tablular_data(self) -> tuple[tuple[str, int, float], ...]:
total = sum(self.line_count.values())
r = []
for k, v in self.line_count.items():
percentage = round(v * 100 / total, 2)
if percentage > 1:
r.append((str(k), v, percentage))
return tuple(reversed(sorted(r, key=lambda _: _[2])))
def render(self) -> None:
print(f"\n\n# {NAMES[self.path.name]}\n")
print("| Name | Total Lines | Percentage of Lines |")
print("| ---------------- | :---------: | ------------------: |")
for character, total, pct in self.as_tablular_data:
print(f"| {character:16} | {total:11} | {pct:19} |")
if __name__ == "__main__":
for series in (TOS, TNG, DS9, VOY, ENT, DISCO):
counter = CharacterLines(series)
counter.collect()
counter.render()
Maybe the two Dax hosts on DS9 should be combined, as they didn’t overlap.
Fascinating! It would be illuminating to see this broken up by season as well. Seven of Nine’s relatively low ratio, for instance, can definitely be attributed to her late arrival to the series. In the latter seasons, I suspect her percentage could be rivalling Janeway’s.
Conversely, it’s impressive Lorca ranks as highly as he does, given he was gone by the end of Disco season one. But since he was simultaneously captain and antagonist while he was around, I guess it isn’t that surprising.
This is really cool stuff! Thanks for posting the code!
This definitely goes to show why people felt Discovery was the Micheal Burnham show. Not that she had an unusual number of lines but that no one else spoke even half as much as her, with all of the other percentages of lines broken up by more characters than the other series.
Also does GEORGIOU count for both prime and mirror versions of the character?
This is beautiful! I love data and I’m delighted you were inspired by my post to gather the data.
Thank you for doing this!
Thanks for sharing. I notice chakoteya.net has TOS scripts. Is there any reason they weren’t included in the analysis?
Honestly, it’s 'cause I forgot to include it! I’ll see if I can add it tonight. Check back in 24hrs :-)