Skip to main content

Table 1 Orthographical features.

From: Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization

Feature name

Regular expression pattern

FG

^\d(,\d)*\-?\w+

INITCAPS

^[A-Z].+

CAPWORD

^[A-Z][a-z]+$

ALLCAPS

^[A-Z]+$

CAPSMIX

^[A-z]*([A-Z][a-z]|[a-z][A-Z])[A-z]*$

ALPHANUMMIX

^[A-z0-9]*([0-9][A-z]|[A-z][0-9])[A-z0-9]*$

ALPHANUM

^[A-z]+[0-9]+$

UPPERCHAR

^[A-Z]$

LOWERCHAR

^[a-z]$

SHORTNUM

^[0-9]?$

INTEGER

^-?[0-9]+$

REAL

^-?[0-9]\.[0-9]+$

ROMAN

^[IVX]+$

HASDASH

-

INITDASH

^-

ENDDASH

-$

PUNCTUATION

^[,.;:?!]$

QUOTE

^[\"`']$