Python regex to find multiple consecutive punctuations -
i streaming plain text records via mapreduce , need check each plain text record 2 or more consecutive punctuation symbols. 12 symbols need check are: -/\()!"+,'&.
.
i have tried translating punctuation list array this: punctuation = [r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']
i can find individual characters nested loops, example:
for t in test_cases: print t p in punctuation: print p if re.search(p, t): print 'found match!', p, t else: print 'no match'
however, single backslash character not found when test , don't know how results 2 or more consecutive occurrences in row. i've read need use + symbol, don't know correct syntax use this.
here test cases:
the quick '''brown fox &&quick brown fox quick\brown fox quick\\brown fox -quick brown// fox quick--brown fox (quick brown) fox,,, quick ++brown fox "quick brown" fox quick/brown fox quick&brown fox ""quick"" brown fox quick,, brown fox quick brown fox… quick-brown fox ((quick brown fox quick brown)) fox quick brown fox!!! 'quick' brown fox
which when translated pythonic list looks this:
test_cases = [ "the quick '''brown fox", 'the &&quick brown fox', 'the quick\\brown fox', 'the quick\\\\brown fox', 'the -quick brown// fox', 'the quick--brown fox', 'the (quick brown) fox,,,', 'the quick ++brown fox', 'the "quick brown" fox', 'the quick/brown fox', 'the quick&brown fox', 'the ""quick"" brown fox', 'the quick,, brown fox', 'the quick brown fox...', 'the quick-brown fox', 'the ((quick brown fox', 'the quick brown)) fox', 'the quick brown fox!!!', "the 'quick' brown fox" ]
how use python regex identify , report matches punctuation symbol appears 2 or more times in row?
the punctuation characters can put character class square brackets. depends, whether series of 2 or more punctuation characters consists of punctuation character or whether punctuation characters same.
in first case curly braces can appended specify number of minimum (2) , maximum repetitions. latter unbounded , left empty:
[...]{2,} # min. 2 or more
if repetitions of same character needs found, first matched punctuation character put group. same group (= same character) follows 1 or more:
([...])\1+
the reference \1
means first group in expression. groups, represented opening parentheses numbered left right.
the next issue escaping. there escaping rules python strings , additional escaping needed in regular expression. character class not require escaping, backslash must doubled. following example quadruplicates backslash, 1 doubling because of string, second because of regular expression.
raw strings r'...'
useful patterns, here both single , double quotation marks needed.
>>> import re >>> test_cases = [ "the quick '''brown fox", 'the &&quick brown fox', 'the quick\\brown fox', 'the quick\\\\brown fox', 'the -quick brown// fox', 'the quick--brown fox', 'the (quick brown) fox,,,', 'the quick ++brown fox', 'the "quick brown" fox', 'the quick/brown fox', 'the quick&brown fox', 'the ""quick"" brown fox', 'the quick,, brown fox', 'the quick brown fox...', 'the quick-brown fox', 'the ((quick brown fox', 'the quick brown)) fox', 'the quick brown fox!!!', "the 'quick' brown fox" ] >>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.]{2,})') >>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)') >>> t in test_cases: match = pattern_same_punctuation.search(t) if match: print("{:24} => {}".format(t, match.group(1))) else: print(t) quick '''brown fox => ''' &&quick brown fox => && quick\brown fox quick\\brown fox => \\ -quick brown// fox => // quick--brown fox => -- (quick brown) fox,,, => ,,, quick ++brown fox => ++ "quick brown" fox quick/brown fox quick&brown fox ""quick"" brown fox => "" quick,, brown fox => ,, quick brown fox... => ... quick-brown fox ((quick brown fox => (( quick brown)) fox => )) quick brown fox!!! => !!! 'quick' brown fox >>>
Comments
Post a Comment