Named Entity Recognition with Transformers

Named Entity Recognition is the computational engine that turns unstructured text into structured knowledge, powering everything from intelligent search engines to automated compliance systems. By identifying and classifying key information—like people, organizations, and dates—NER provides the foundational data layer for more complex AI tasks. Mastering modern NER, particularly with transformer models, is essential for anyone building intelligent text applications that require precise information extraction.

What is NER and Sequence Labeling?

At its core, Named Entity Recognition is a sequence labeling task. This means you take a sequence of tokens (words or subwords) as input and assign a label to each token that describes its role within an entity. For example, in the sentence "Apple launched the iPhone in Cupertino," a model would label "Apple" as an organization, "iPhone" as a product, and "Cupertino" as a location. The challenge lies in understanding context; "Apple" could also be a fruit, but the surrounding words "launched" and "iPhone" provide the necessary clues. The goal is to map from a sequence of words $[w_{1}, w_{2}, ..., w_{n}]$ to a corresponding sequence of labels $[l_{1}, l_{2}, ..., l_{n}]$ .

Traditional Powerhouse: The BiLSTM-CRF Architecture

Before the dominance of transformers, a hybrid architecture combining a Bidirectional LSTM with a Conditional Random Field was the state-of-the-art for NER. A Bidirectional Long Short-Term Memory network processes the input sequence in both forward and backward directions, allowing each word's representation to be informed by its full context. The LSTM generates a set of scores for each possible label for each word.

However, labeling each word independently can lead to incoherent sequences, like B-Organization I-Person. A Conditional Random Field layer is added on top to model dependencies between adjacent labels. It applies a transition matrix that scores the likelihood of moving from one label to the next (e.g., an I-Location is very likely to follow a B-Location). During training, the CRF learns global constraints, and during prediction, it uses the Viterbi algorithm to find the single most likely label sequence rather than just the most likely label for each independent position. This combination of contextual word encoding (BiLSTM) and structured prediction (CRF) made this architecture highly effective.

The Transformer Revolution: Fine-Tuning for NER

Transformer models like BERT, RoBERTa, and ELECTRA revolutionized NER by providing deeply contextualized word representations from pre-training on massive text corpora. The standard approach is fine-tuning, where you take a pre-trained transformer and add a simple classification layer (often just a linear layer) on top of its output representations to predict the NER label for each token.

Fine-tuning for NER has a key nuance: tokenization. Transformers use subword tokenizers (like WordPiece or SentencePiece), meaning a single word like "iPhone" might be split into ["i", "##Phone"]. You must decide on a strategy for aligning these subword tokens to the single word-level NER label. The common practice is to use the representation of the first subtoken for classification and ignore the subsequent subtokens (##Phone), or to feed all subtoken representations through the classifier and then use the first subtoken's prediction as the label for the whole word.

Tagging Schemes: BIO and BILOU

To distinguish between single-word entities and multi-word entities, we use standardized tagging schemes. The most common is the BIO scheme:

B- (Beginning): The first token of a multi-token entity.
I- (Inside): A subsequent token within a multi-token entity.
O (Outside): A token that is not part of any entity.

For the sentence "New York City," the labels would be B-Location, I-Location, I-Location. A more informative variant is the BILOU scheme (Begin, Inside, Last, Outside, Unit):

B-: First token of a multi-token entity.
I-: Middle token(s).
L-: Last token of a multi-token entity.
U-: A single-token entity.
O: Outside.

BILOU provides more explicit structural information (e.g., an L- tag must follow a B- or I-), which can help the model learn constraints more easily, often leading to slightly better performance.

Advanced Challenges: Nested Entities and Domain Adaptation

Real-world text introduces complexities that simple sequence labeling must address. Nested entities occur when one entity sits inside another. For example, in "The University of California Berkeley library," "University of California Berkeley" is an organization, and "Berkeley" is also a city. A flat BIO/BILOU scheme cannot capture this. Solutions include:

Span-based approaches: Instead of labeling tokens, classify all possible text spans.
Stacked models: Use one model to detect outer entities and another to detect inner ones.
Hypergraphs: Use more complex CRFs that can represent overlapping structures.

Another major challenge is adapting to specialized domains. Domain-specific entity types are critical in fields like medicine (e.g., "Disease," "Dosage," "Body Part") and law (e.g., "Clause," "Statute," "Party"). The vocabulary and context in these domains differ vastly from general news text, on which models like BERT are pre-trained. Successful adaptation often requires continued pre-training of the transformer on in-domain text (like medical journals or legal contracts) before fine-tuning on the smaller, annotated NER dataset for that domain.

Few-Shot NER with Prompt-Based Approaches

Annotating thousands of sentences for a new entity type is expensive. Few-shot NER aims to learn from just a handful of examples. A promising modern approach uses prompt-based learning. Instead of adding a classifier head that predicts B-Person/I-Person, you reformulate the task to match the model's pre-training. For example, you might create a prompt: "John Smith worked at Google. John Smith is a [MASK] entity." You then train the model to fill the [MASK] with a word from a defined set, like "person" or "organization". By leveraging knowledge already in the pre-trained model, this method can achieve surprising performance with very little labeled data.

Evaluation: Precision, Recall, and F1 at the Entity Level

You don't evaluate NER on individual tokens; you evaluate on full, extracted entities. An entity is correct only if its span (start and end indices) and its type both match the gold-standard annotation.

The standard metrics are entity-level precision, recall, and the F1 score.

Precision ( $P$ ): Of all the entities the system extracted, how many were correct?

$P = \frac{Correct Predictions}{Total Predictions}$

Recall ( $R$ ): Of all the entities that exist in the text, how many did the system find?

$R = \frac{Correct Predictions}{Total Actual Entities}$

F1 Score ( $F_{1}$ ): The harmonic mean of precision and recall, providing a single balanced metric.

$F_{1} = 2 \cdot \frac{P \cdot R}{P + R}$

A strict evaluation requires an exact match of the entity boundary. A looser "partial match" evaluation may give credit for overlapping spans, but the F1 score based on exact matches is the most common benchmark for reporting model performance.

Common Pitfalls

Ignoring Tokenization Mismatch: Directly applying word-level labels to subword tokens without a proper alignment strategy (like using the first subword) is a frequent error that corrupts your training data. Always implement a robust mapping from tokens to labels during data preprocessing.

Misunderstanding Evaluation Metrics: Reporting token-level accuracy is misleading, as most tokens are "O". A model that labels everything as "O" would have high accuracy but zero recall for entities. Always use entity-level precision, recall, and F1 to get a true picture of performance.

Overfitting on Small, Specialized Datasets: When fine-tuning a large transformer (like BERT) on a small medical NER dataset, the model can quickly memorize the limited examples and fail to generalize. Counter this with strong regularization (e.g., dropout, weight decay), using a smaller learning rate, and employing techniques like early stopping.

Neglecting Post-Processing: Raw model output can have invalid sequences (e.g., I-Organization following an O). While a CRF layer learns to avoid this, if you're not using one, you must implement post-processing rules to clean up the label sequence based on the constraints of your chosen tagging scheme (BIO or BILOU).

Summary

Named Entity Recognition is a fundamental sequence labeling task that extracts structured information (persons, organizations, etc.) from unstructured text.
The BiLSTM-CRF architecture was a pre-transformer standard, using bidirectional networks for context and a conditional random field for label sequence coherence.
Modern NER typically involves fine-tuning pre-trained transformer models (like BERT), carefully handling subword tokenization alignment.
Tagging schemes like BIO and BILOU provide a framework for labeling multi-word entities, with BILOU offering more granular structural signals.
Advanced challenges include detecting nested entities and adapting models to specialized domains (medical, legal) through continued pre-training and fine-tuning.
Few-shot learning techniques, such as prompt-based approaches, are emerging to reduce the need for vast amounts of annotated data.
Model performance is rigorously evaluated using entity-level precision, recall, and the F1 score, which require both the span and type of an entity to be correct.

Named Entity Recognition with Transformers

Named Entity Recognition with Transformers

What is NER and Sequence Labeling?

Traditional Powerhouse: The BiLSTM-CRF Architecture

The Transformer Revolution: Fine-Tuning for NER

Tagging Schemes: BIO and BILOU

Advanced Challenges: Nested Entities and Domain Adaptation

Few-Shot NER with Prompt-Based Approaches

Evaluation: Precision, Recall, and F1 at the Entity Level

Common Pitfalls

Summary

Write better notes with AI