cross-posted from: https://lemmy.sdf.org/post/12950329
There’s a difference between ‘processing’ the text and ‘parsing’ it. The processing described in the section you posted it fine, and you can manage a similar level of processing on HTML. The tricky/impossible bit is parsing the languages. For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence, and you can’t write a regex that’ll break an HTML document down into a hierarchy of tags as regexs don’t support counting depth of recursion, and HTML is irregular anyway, meaning it can’t be reliably parsed with a regular parser.
For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence
Identifying parts of speech isn’t a requirement of the word parse. That’s the linguistic definition. In computer science identifying tokens is parsing.