I’ve been reading Mastering Regular Expressions by Jeffrey E.F. Friedl, and since nobody in my life (aside from my wife) cares, I thought I’d share something I’m pretty proud of. My first set of regular expressions, that I wrote myself to manipulate the text I’m working with.

What’s I’m so happy about is that I wrote these expressions. I understand exactly what they do and the purpose of each character in each expression.

I’ve used regex in the past. Stuff cobbled together from stack overflow, but I never really understood how they worked or what the expressions meant, just that they did what I needed them to do at the time.

I’m only about 10% of the way through the book, but already I understand so much more than I ever did about regex (I also recognize I have a lot to learn).

I wrote the expressions to be used with egrep and sed to generate and clean up a list of filenames pulled out of tarballs. (movies I’ve ripped from my DVD collection and tarballed to archive them).

The first expression I wrote was this one used with tar and egrep to list the files in the tarball and get just the name of the video file:

tar -tzvf file.tar.gz | egrep -o '\/[^/]*\.m(kv|p4)' > movielist

Which gives me a list of movies of which this is an example:

/The.Hunger.Games.(2012).[tmdbid-70160].mp4

Then I used sed with the expression groups to remove:

  • the leading forward slash
  • Everything from .[ to the end
  • All of the periods in between words

And the last expression checks for one or more spaces and replaces them with a single space.

This is the full sed command:

sed -Eie 's/^\///; s/\.\[[a-z]+-[0-9]+\]\.m(p4|kv)//; s/[^a-zA-Z0-9\(\)&-]/ /g; s/ +/ /g' movielist

Which leaves me with a pretty list of movies that looks like this:

The Hunger Games (2012)

I’m sure this could be done more elegantly, and I’m happy for any feedback on how to do that! For now, I’m just excited that I’m beginning to understand regex and how to use it!

Edit: fixed title so it didn’t say “regex expressions”

0 points
*

Good job !

I highly recommend trying out the various online regex editor.

These WISIWIG kind of editors are great because you immediately see what the regex is catching and for what reason.

I took the first one in my search results but try different ones.

https://regex101.com/

Also I used GPT to get some regex for some specific strings and it can be helpful to get a quickstart at building a specific regex.

In that case I was building a regex for a specific log from postfix.

PS: just make sure to select the correct flavor of regex you are using in these online tools.

Edit: Also one of my favorite YT channels has pretty cool videos on RegEx : https://youtu.be/6gddK-cOxYc?si=0bnNkSDzifjdxwjU

permalink
report
reply
0 points

Regex101 is amazing, definitely use that to learn, explain and check the regexes.

permalink
report
parent
reply
0 points

Very cool :) A little tip: the ex in regex stands for expression. So saying regex expression is saying regular expression expression.

Fun facts: Regexes are the basis of building interpreters and compilers, as they are a major tool in identifying and defining formal languages. That’s what theoretical computer science 3 is about, at least in my university.

permalink
report
reply
0 points
*

You can most likely combine your multiple (grep and) sed calls into a single one. In your case, I’d use capture groups to remove the empty brackets and whitespace.

Consider ripgrep for a faster grep and sd (search and displace) as a less cumbersome sed alternative.

permalink
report
reply

Linux

!linux@lemmy.ml

Create post

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word “Linux” in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

  • Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
  • No misinformation
  • No NSFW content
  • No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

Community stats

  • 9.2K

    Monthly active users

  • 3.2K

    Posts

  • 37K

    Comments