Problems parsing a string with pyparsing
from prettydarknwild@lemmy.world to python@programming.dev on 06 Dec 2023 00:02
https://lemmy.world/post/9197847
from prettydarknwild@lemmy.world to python@programming.dev on 06 Dec 2023 00:02
https://lemmy.world/post/9197847
i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs, i was using this expression to do it:
OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))
But when i parse the following string with this expression:
return abc(1, ULLONG_MAX)
All the words inside the parentheses get split:
[‘return’, ‘abc’, ‘(’, ‘1’, ‘,’, ‘U’, ‘L’, ‘L’, ‘O’, ‘N’, ‘_’, ‘M’, ‘A’, ‘X’, ‘)’, ‘;’]
But if i use this expression:
OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))
Only a part of the string gets parsed:
[‘return’, ‘abc’, ‘(’]
What is wrong with those expressions?
#python
threaded - newest
Personally I would recommend to use regex instead for parsing, which would also allow you to more easily test your expressions. You could then get the list as
As for what’s wrong with your expressions:
First expression: Once you hit
(
,OneOrMore(Char(printables))
will take over and continue matching every printable char. Instead you should use OR (|
) with the alphanumerical first for priorityOneOrMore(word | Char(printables))
Second expression. You’re running into the same issue with your use of
+
. Once string.punctuation takes over, it will continue matching until it encounters a char that is not a punctuation and then stop the matching. Instead you can write:Do note that underscore is considered a punctutation so ULLONG_MAX will be split, not sure if that’s what you want or not.
Haven't used that particular library, but have written libraries that do similar sorts of things and have played with a few other similar libraries in C++ and Haskell. I've taken a quick glance at the documentation here, but since I don't know this library specifically apologizes in advance if I make a mistake.
For
OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))
it looks it matches as many alphanum Words as it can (whitespace sequences being an acceptable separator between tokens by default) and when it hits(
it cannot continue with that so tries to match the next expression in the sequence. (i.e.OneOrMore(Char(printables))
)The documentation says:
Presumably, that means it will not group the characters together, which is why you get individual character matches after that point for all the remaining non-whitespace characters. (Your result also seems to imply there was a semicolon at the end of your input?)
For
OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))
it looks like it cannot match further than(
since1
is not a punctuation character; so, you got the tokens for the parts of the string that matched. (If you chained the parser expression with something like+ Word(alphanum)
I'd expect you'd get another token [i.e."1"
] added onto the end of your result.) You may eventually want StringEnd/LineEnd or something like that -- I'd expect they'd fail the parser expression if there's unconsumed input (for error detection), but again, haven't used this specific library, so it may work different than I expect.There appears to be a
Combine
class you can use to join string results together; that might be useful for future reference.Have not tested it (since I don't have a copy of the library installed anywhere and can't set up an environment for it easily right now) but perhaps something like
OneOrMore(Word(alphanums)|Char(string.punctuation))
would be more like what you are looking for?