The number of lines for each character by percentage of the series

posted 4 months ago

It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:

Those Old Scientists

Name	Total Lines	Percentage of Lines
KIRK	8257	32.89
SPOCK	3985	15.87
MCCOY	2334	9.3
SCOTT	912	3.63
SULU	634	2.53
UHURA	575	2.29
CHEKOV	417	1.66

The Next Generation

Name	Total Lines	Percentage of Lines
PICARD	11175	20.16
RIKER	6453	11.64
DATA	5599	10.1
LAFORGE	3843	6.93
WORF	3402	6.14
TROI	2992	5.4
CRUSHER	2833	5.11
WESLEY	1285	2.32

Deep Space Nine

Name	Total Lines	Percentage of Lines
SISKO	8073	13.0
KIRA	5112	8.23
BASHIR	4836	7.79
O’BRIEN	4540	7.31
ODO	4509	7.26
QUARK	4331	6.98
DAX	3559	5.73
WORF	1976	3.18
JAKE	1434	2.31
GARAK	1420	2.29
NOG	1247	2.01
ROM	1172	1.89
DUKAT	1091	1.76
EZRI	953	1.53

Voyager

Name	Total Lines	Percentage of Lines
JANEWAY	10238	17.7
CHAKOTAY	5066	8.76
EMH	4823	8.34
PARIS	4416	7.63
TUVOK	3993	6.9
KIM	3801	6.57
TORRES	3733	6.45
SEVEN	3527	6.1
NEELIX	2887	4.99
KES	1189	2.06

Enterprise

Name	Total Lines	Percentage of Lines
ARCHER	6959	24.52
T’POL	3715	13.09
TUCKER	3610	12.72
REED	2083	7.34
PHLOX	1621	5.71
HOSHI	1313	4.63
TRAVIS	1087	3.83
SHRAN	358	1.26

Discovery

Important Note: As the source material is incomplete for Discovery, the following table only includes line counts from seasons 1 and 4 along with a single episode of season 2.

Name	Total Lines	Percentage of Lines
BURNHAM	2162	22.92
SARU	773	8.2
BOOK	586	6.21
STAMETS	513	5.44
TILLY	488	5.17
LORCA	471	4.99
TARKA	313	3.32
TYLER	300	3.18
GEORGIOU	279	2.96
CULBER	267	2.83
RILLAK	205	2.17
DETMER	186	1.97
OWOSEKUN	169	1.79
ADIRA	154	1.63
COMPUTER	152	1.61
ZORA	151	1.6
VANCE	101	1.07
CORNWELL	101	1.07
SAREK	100	1.06
T’RINA	96	1.02

If anyone is interested, here’s the (rather hurried, don’t judge me) Python used:

#!/usr/bin/env python

#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/StarTrek/ http://www.chakoteya.net/StarTrek/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#

import re
from collections import defaultdict
from pathlib import Path

EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")

EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
TOS = EPISODES / "StarTrek"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"

NAMES = {
    TOS.name: "Those Old Scientists",
    TNG.name: "The Next Generation",
    DS9.name: "Deep Space Nine",
    VOY.name: "Voyager",
    ENT.name: "Enterprise",
    DISCO.name: "Discovery",
}


class CharacterLines:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.line_count = defaultdict(int)

    def collect(self) -> None:
        for episode in self.path.glob("*.htm*"):
            if EPISODE_REGEX.match(episode.name):
                for line in episode.read_text().split("\n"):
                    if m := LINE_REGEX.match(line):
                        self.line_count[m.group("name")] += 1

    @property
    def as_tablular_data(self) -> tuple[tuple[str, int, float], ...]:
        total = sum(self.line_count.values())
        r = []
        for k, v in self.line_count.items():
            percentage = round(v * 100 / total, 2)
            if percentage > 1:
                r.append((str(k), v, percentage))
        return tuple(reversed(sorted(r, key=lambda _: _[2])))

    def render(self) -> None:
        print(f"\n\n# {NAMES[self.path.name]}\n")
        print("| Name             | Total Lines | Percentage of Lines |")
        print("| ---------------- | :---------: | ------------------: |")
        for character, total, pct in self.as_tablular_data:
            print(f"| {character:16} | {total:11} | {pct:19} |")


if __name__ == "__main__":
    for series in (TOS, TNG, DS9, VOY, ENT, DISCO):
        counter = CharacterLines(series)
        counter.collect()
        counter.render()

Sort:

Hot Top Controversial New Old

[ - ]

Clay_pidgin@sh.itjust.works

3 points

4 months ago

Maybe the two Dax hosts on DS9 should be combined, as they didn’t overlap.

permalink

report

[ - ]

usernamefactory@lemmy.ca

5 points

4 months ago

Fascinating! It would be illuminating to see this broken up by season as well. Seven of Nine’s relatively low ratio, for instance, can definitely be attributed to her late arrival to the series. In the latter seasons, I suspect her percentage could be rivalling Janeway’s.

Conversely, it’s impressive Lorca ranks as highly as he does, given he was gone by the end of Disco season one. But since he was simultaneously captain and antagonist while he was around, I guess it isn’t that surprising.

permalink

report

[ - ]

milkisklim@lemm.ee

8 points

4 months ago

This is really cool stuff! Thanks for posting the code!

This definitely goes to show why people felt Discovery was the Micheal Burnham show. Not that she had an unusual number of lines but that no one else spoke even half as much as her, with all of the other percentages of lines broken up by more characters than the other series.

Also does GEORGIOU count for both prime and mirror versions of the character?

permalink

report

[ - ]

exocrinous@startrek.website

0 points

4 months ago

Georgiou also got fridged for Michael’s character development. And then we follow Michael over the timeskip. Right out the gate, the universe exists to tell a story about Michael.

permalink

report

parent

[ - ]

Rob T Firefly@lemmy.world

1 point

4 months ago

As the prime version of Georgiou’s lines basically amounted to “Hi!” “Oh crap!” “Bye!” the overall math shouldn’t be too affected.

permalink

report

parent

[ - ]

Daniel Quinn@lemmy.caOP

6 points

4 months ago

That was my takeaway as well. I just wish I had data for the other seasons. It’d be interesting to see how that might change the percentages as they are.

As for GEOGIOU, I’m reasonably sure that this refers to both versions of her.

permalink

report

parent

[ - ]

Indy@startrek.website

7 points

4 months ago

This is beautiful! I love data and I’m delighted you were inspired by my post to gather the data.

Thank you for doing this!

permalink

report

[ - ]

deegeese@sopuli.xyz

9 points

4 months ago

Thanks for sharing. I notice chakoteya.net has TOS scripts. Is there any reason they weren’t included in the analysis?

permalink

report

[ - ]

Daniel Quinn@lemmy.caOP

12 points

4 months ago

Honestly, it’s 'cause I forgot to include it! I’ll see if I can add it tonight. Check back in 24hrs :-)

permalink

report

parent

[ - ]

deegeese@sopuli.xyz

4 points

4 months ago

Thanks for the update.

Poor Chekov has almost no lines, but Koenig was great as Bester on B5.

permalink

report

parent

Star Trek

!startrek@startrek.website

Create post

r/startrek: The Next Generation

Star Trek news and discussion. No slash fic…

Maybe a little slash fic.

New to Star Trek and wondering where to start?

Rules

1 Be constructive

All posts/comments must be thoughtful and balanced.

2 Be welcoming

It is important that everyone from newbies to OG Trekkers feel welcome, no matter their gender, sexual orientation, religion or race.

3 Be truthful

All posts/comments must be factually accurate and verifiable. We are not a place for gossip, rumors, or manipulative or misleading content.

4 Be nice

If a polite way cannot be found to phrase what it is you want to say, don’t say anything at all. Insulting or disparaging remarks about any human being are expressly not allowed.

5 Spoilers

Utilize the spoiler system for any and all spoilers relating to the most recently-aired episodes, as well as previews for upcoming episodes. There is no formal spoiler protection for episodes/films after they have been available for approximately one week.

6 Keep on-topic

All submissions must be directly about the Star Trek franchise (the shows, movies, books etc.). Off-topic discussions are welcome at c/quarks.

7 Meta

Questions and concerns about moderator actions should be brought forward via DM.

Upcoming Episodes

Date	Episode	Title
10-31	LD 5x03	“The Best Exotic Nanite Hotel”
11-07	LD 5x04	“A Farewell to Farms”
11-14	LD 5x05	“Star Base 80?”
11-21	LD 5x06	“Of Gods and Angels”
11-28	LD 5x07	“Fully Dilated”