^^Regex. Regular expression, grammar, language.

espressione regolare (di un linguaggio)

e' un procedimento applicabile ad ogni stringa, con un risultato binario si/no, interpretato come: la stringa ha la struttura descritta dall'espressione, oppure no.

alfabeto di un'espressione regolare  i caratteri con cui viene formulata.

linguaggio di un'espressione regolare  l'insieme delle stringhe che la soddisfano.

es regular expression

regex     matches, generate
a|b a    b
b*   ε      "b"      "bb"     "bbb"    ...
b+           "b"      "bb"     "bbb"    ...
a.b axb  x is any character
   
ab*c "ac"  "abc" "abbc"  "abbbc"    ...
ab+c         "abc", "abbc"  "abbbc"    ...
a*bc* the set of all strings consisting of: arbitrarily many "a"s,
followed by a single "b", followed by arbitrarily many "c"s
(ab)* ε    ab    abab    ababab    ...
(a|b)(a|b) aa, ab, ba, bb
(a|b)* ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, …

Operations to construct regular expressions

Most formalisms provide

Boolean "or"

A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey".

Quantification

A quantifier after an element (such as a token, character, or group) specifies how many times the preceding element is allowed to repeat.

? question mark indicates 0 or 1 occurrences of the preceding element.

ex: colou?r matches both "color" and "colour".

* asterisk (derived from the Kleene star) indicates 0 or more occurrences of the preceding element.

ex: ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

+ plus sign (Kleene plus) indicates 1 or more occurrences of the preceding element.

ex: ab+c matches "abc", "abbc", "abbbc", ... , but not "ac".

. full stop is wildcard

ex:  a.b matches any string that contains an "a", and then any character and then "b"

ex:  a.*b matches any string that contains an "a", and then the character "b" at some later point.

{n}  The preceding item is matched exactly n times

Storia

1968 Regular expressions entered popular use in pattern matching in a text editor.

“Regular expressions aren’t random jumbles of punctuation—they’re carefully thought-out jumbles of punctuation!” -- The Perl Cookbook

Linguaggio regolare

puo' essere definito in 3 modi

  1. espressione regolare
  2. grammatica regolare
  3. definizione formale

Strictly regular grammar.

Examples

Grammar  Grammar
A → ε

A → aA

A → a

A → aA

Productions Productions
ε

aA → a

aA → aaA → aa

aA → aaA → aaaA → aaa

aA → aaA → aaaA → aaaaA → aaaa

ecc ...

a

aA → aa

aA → aaA → aaa

aA → aaA → aaaA → aaaa

aA → aaA → aaaA → aaaaA → aaaaa

ecc ...

Language Language
L= {an | n≥0 } L= {an | n>0 }
regex regex
a* a+

 

quello che voglio notare e':   la produzione  A → aA  

  1. lascia invariata il nr di lettere sostituibili, poiche' i 2 lati della produzione hanno lo stesso nr di lettere sostituibili, quindi
  2. la stringa aggiunge sempre all'estremo destro.

I 3 tipi di produzione di una grammatica regolare destra

A → ε

A → a
A → aB

parte sx:  solo 1 non terminale, come per Context-free grammar.

parte dx:  max 1 non terminale, sempre a dx;  o 1 terminale, o stringa vuota.

Esempi

Grammar  Grammar
A → a

A → aA

A → b

A → bA

A → a

A → aA

A → bB

B → bB

B → ε

Productions Productions
a

b

 

aA → aa

aA → ab

bA → ba

bA → bb

 

aA → aaA → aaa

aA → abA → aba

bA → baA → baa

bA → bbA → bba

 

aA → aaA → aab

aA → abA → abb

bA → baA → bab

bA → bbA → bbb

 

ecc ...

A → a

A → bB → b

 

A → aA → aa

A → aA → abB → ab

A → bB → bbB → bb

 

A → aA → aaA → aaa

A → aA → aaA → aabB → aab

A → aA → abB → abbB → abb

A → bB → bbB → bbbB → bbb

 

A → aA → aaA → aaaA → aaaa

A → aA → aaA → aaaA → aaabB → aaab

A → aA → aaA → aabB → aabbB → aabb

A → aA → abB → abbB → abbbB → abbb

A → bB → bbB → bbbB → bbbbB → bbbb

 

ecc ...

regex regex
(a|b)+ a*b*

 

Grammatica che produce 1 sola voluta stringa

o un nr qualsiasi di volte

Grammar  Grammar 
A → xB

B → yC

C → z

A → xB

B → yC

C → zA

C → z

Productions Productions
A → xB → xyC → xyz A → xB → xyC → xyzA →

xyzxB → xyzxyC → xyzxyz

regex regex
xyz (xyz)+

 

Extended Regular grammar.

per poter scrivere la grammatica in modo piu' compatto, si introduce una notazione piu' compatta-potente, che pero' genera gli stessi linguaggi. Analogo a BNF Backus–Naur Form . EBNF Extended BNF.

2 regole

A → w     A, B non-terminal,  w∈Σ*
A → wB

In particolare e' possibile scrivere una produzione del tipo

A → B

 

Grammar  Grammar              Grammar
A → a

A → aA
 

 

 

 

extended Strict
A → a

A → aA


A → B

B → b
B → bB

A → a

A → aA
 

A → bB

B → b
B → ε

extended Strict
A → a

A → aA
 

A → B

B → b
B → bB
 

B → C

C → c
C → cC

A → a

A → aA
 

A → bB

B → b
B → ε

 

B → cC

C → c

C → ε

Language Language Language
L= {an

      | n>0 }

L= {anbm
      | n,m>0 }
L= {aibjck
      | i,j,k>0 }
regex regex regex
a+ a+b+ a+b+c+

 

Progressivo

Grammar  Language regex
A → ε L= {ε}  
A → aA L= {an | n≥0 } a*
A → B

B → ε

idem  
B → bB L= {anbm | n,m≥0 } a*b*
B → C

C → ε

idem  
C → cC L= {aibjck | i,j,k≥0 } a*b*c*

 

Grammar  Grammar  Language regex
A → ε

A → aA
A → B

B → bC
C → ε
C → cC

A → ε

A → aA
A → B

B → bC
C → ε
C → cC

L= {aibck
      | i,k≥0 }
a*bc*

 

Grammar  Language regex
A1 → aA2
A2 → aA3

A3 → a

L= {a3}   a{3}
A1 → aA2
A2 → aB1

B1 → bB2
B2 → bB3

B3 → bC1

C1 → c

L= {a2b3c}   a{2}b{3}c

Definizione formale

wp/Regular_expression#Formal_definition

 

 

Links

  1. wp/Regular_expression
  2. wp/Regular grammar
  3. wp/Regular_language
  4. wp/Stephen_Cole_Kleene
  5. regex-based-lexical-analysis-in-python-and-javascript  Eli Bendersky
  6. gnosis.cx/regular_expressions
  7. Why are regular expressions difficult? johndcook 14 comments
  8. johndcook/regex-approximation

Approfond

Pattern recognition ≠ pattern matching

Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to perform "most likely" matching of the inputs, taking into account their statistical variation.

es: riconoscere la foto di un cane.

pattern matching algorithms look for exact matches in the input with pre-existing patterns.

es: pattern-matching by regex.

 

Talk

Alter

espressione regolare (di un linguaggio) descrive un tipo di struttura di stringhe tramite un procedimento che permette di calcolare se una qualsiasi stringa ha o no tale struttura.

 

 

A → ε

A → aA
 

L=

{an |n≥0 }

 

regex

a*

A → ε

A → aA


A → B

B → ε
B → bB
 

L= {anbm
      | n,m≥0 }

 

regex

a*b*

 

A → ε

A → aA
 

A → B

B → ε
B → bB
 

B → C

C → ε
C → cC
 

L= {aibjck
      | i,j,k≥0 }

 

regex

a*b*c*

 

Grammar  Grammar Grammar
A → a

A → aA
 

L= {an

      | n>0 }

 

regex

a+

A → a

A → aA


A → B

B → b
B → bB
 

L= {anbm
      | n,m>0 }

 

regex

a+b+

 

A → a

A → aA
 

A → B

B → b
B → bB
 

B → C

C → c
C → cC
 

L= {aibjck
      | i,j,k>0 }

 

regex

a+b+c+