Named Entity Recognition Guidelines

This document provides instructions on the accepted formats for entity descriptions, documents, and their annotations for the Named Entity Recognition use case. Properly formatted annotations are crucial for accurately representing document layouts and identifying entity names.

Models Supported

The NER use case provides multiple models for your needs.

We support learned models which can learn from your training data and make NER predictions:

Recurrent Neural Network with BERT Transformer base
MLP with Finetuned BERT Transformer base
NNLM with sliding window

We also provide a state of the art LLM backed NER service which doesn't require any training and serves zero shot predictions. Optionally, you can also provide examples to this LLM for few shot predictions

LLM backed In-Context-Learning NER

Feature group specifications

Input Feature Groups: - Documents with Annotations (Optional) - Entity Description (Optional)

Note: At least one feature group is required as input

Entity Description Feature Group

This feature group provides the details of entities the NER model is expected to identify Each row should contain as columns: - Entity Tag: A unique tag for the entity. The corresponding feature mapping is ENTITY_ID. - Entity Description: Describing when to label a word or phrase as this entity. The user can add examples here for better performance. The corresponding feature mapping is ENTITY_DESCRIPTION.

An example feature group looks like this:

Entity Tag	Entity Description
Plot	This refers to the main events or storyline of a film. It's represented by descriptions summarizing key events and major twists. Examples: "boy meets girl", "heist-gone-wrong", "alien invasion".
Actor	This is an individual who plays a character in a movie. It's represented by the real names of people playing roles in films. Examples: "Brad Pitt", "Meryl Streep", "Leonardo DiCaprio".
Genre	This represents the category to which a film belongs. It's represented by specific types of films based on mood, format, or subject matter. Examples: "comedy", "thriller", "action", "romantic".

Document Feature Group

Document annotations help capture the precise positions of tokens or elements within documents. This guide outlines the required format for annotations for document-related tasks.

Each row in the feature group should contain (as columns): - A unique Document ID or Row ID [Optional] - Document - List of annotations

As described further down in the annotations column section, multiple annotation styles are supported.

An example feature group that uses the Entity Annotations format looks like this:

document_id	document	annotations
0	"steve mcqueen provided a thrilling motorcycle chase in this greatest of all ww 2 prison escape movies"	[{"entity_tag": "Actor", "entity_name": "steve", "explanation": ""}, {"entity_tag": "Plot", "entity_name": "thrilling", "explanation": ""}]
1	"what is the movie where a group of 3 male friends try to plot and murder their employers"	[{"entity_tag": "Plot", "entity_name": "3 male friends try to plot and murder their employers", "explanation": ""}]
2	"whats the name of the sergeant"	[{"entity_tag": "Character Name", "entity_name": "sergeant", "explanation": ""}]

An example feature group that uses the Text Based Annotations format looks like this:

document_id	document	annotations
0	"a master martial artist with impeccable capabilities in his craft"	[{"displayName":"Plot","textExtraction":{"textSegment":{"endOffset":65,"startOffset":0}}}]
1	"what is the movie where a group of 3 male friends try to plot and murder their employers"	[{"displayName":"Plot","textExtraction":{"textSegment":{"endOffset":88,"startOffset":26}}}]
2	"whats the name of the sergeant"	[{"displayName":"Character_Name","textExtraction":{"textSegment":{"endOffset":30,"startOffset":22}}}]

Document column

The document column can have the text or the extracted features. When importing raw documents into the platform, a column containing page infos is created which can be used as well. The corresponding feature mapping is DOCUMENTS.

Annotations column

Each row in annotations column comprise a list of annotations which are structured elements (JSON). These correspond to the ANNOTATIONS feature mapping. There are 3 types of annotations currently supported: - Entity Annotations for few shot examples - Text based annotations - Bounding box annotations

Note: A feature group must have all annotations of the same type. We currently do not support different annotation types in the same feature group

Entity Annotations for few shot examples

This is a simple annotation type, meant for providing few shot examples for the LLM backed NER model

A single annotation contains these fields:

entity_tag: The unique identifier for the entity extracted
entity_name: The word or phrase from the document identified as belonging to this entity
explanation: (Optional) An explanation describing why the word or phrase belongs to this entity type. Meant to give context to the model

Text Based Annotations

This annotation type is more detailed, Primarily meant for training models on user provided documents. However, this annotation type can also be used as few shot examples for the LLM backed NER model

A single annotation contains these fields:

displayName: Label Name
textExtraction: Contains a textSegment field having:
- startOffset: The (0 based) index of the starting character in the document text
- endOffset: The (0 based) index of the ending character in the document text

Example

If the text (in the DOCUMENT column) is as follows:

This is an example of the text in a document

And if we want to label "example" with label "NAME", then it's corresponding annotation would be:

{
  "displayName":"NAME",
  "textExtraction": {
                      "textSegment":{
                                      "endOffset":11,
                                      "startOffset":17
                                    }
                    }
}

Bounding Box Annotations

This is similar to text based annotations but has one extra field for bounding boxes. Generally this is used when raw documents are imported in the platform and bounding boxes are generated in the feature group.

boundingBoxes:
- page: Page number of the document
- boundingBox: The coordinates of the bounding box
- boundingBoxIds: The IDs of the bounding boxes. These are the token indices

Example

[
  {
     "displayName":"Customer Signature",
     "textExtraction": {
                         "textSegment":{
                                         "endOffset":142,
                                         "startOffset":124
                                       }
                       },
    "boundingBoxes":[
                      {
                        "page":2,
                        "boundingBox":[130,497,541,573],
                        "boundingBoxIds":[28,29,30,31,32,33]
                      }
                    ]
  }
]

This example represents an annotation with label "Customer Signature" that uses bounding boxes. It is located on page 2 of the document and enclosed by a bounding box defined by the coordinates [230, 340, 290, 380]. It is associated with bounding box IDs 101, 102, and 103.

Supported Document Formats

For the NER tasks, we support the following document formats:

TOKENS

This format is used when each token in the text is individually annotated with an entity label and includes bounding box information, typically extracted from OCR.

Example with Bounding Boxes (OCR extracted format)

[
  {
    "page": 0,
    "block": 1,
    "content": "UNITED",
    "line": 0,
    "boundingBox": [274, 41, 305, 51],
    "startOffset": 0,
    "endOffset": 6
  },
  {
    "page": 0,
    "block": 1,
    "content": "STATES",
    "line": 0,
    "boundingBox": [306, 41, 338, 51],
    "startOffset": 7,
    "endOffset": 13
  }
  ...
]

Example without Bounding Boxes

[
  {
    "page": 0,
    "block": 1,
    "content": "UNITED",
    "line": 0,
    "startOffset": 0,
    "endOffset": 6
  },
  {
    "page": 0,
    "block": 1,
    "content": "STATES",
    "line": 0,
    "startOffset": 7,
    "endOffset": 13
  }
  ...
]

TEXT

This format is used when you have raw text and corresponding annotations that include the start and end character indices of entities. It's suitable for JSON input and is often used in web-based annotation tools or when annotations need to be serialized in a text-based format that is both human-readable and machine-parseable.

Example

{
  "text": "im thinking of a 2010 fantasy adventure film starring nicolas cage as a sorcerer",
  "annotations": [
    {
      "displayName": "Year",
      "textExtraction": {
        "textSegment": {
          "endOffset": 21,
          "startOffset": 17
        }
      }
    },
    {
      "displayName": "Genre",
      "textExtraction": {
        "textSegment": {
          "endOffset": 39,
          "startOffset": 22
        }
      }
    },
    {
      "displayName": "Actor",
      "textExtraction": {
        "textSegment": {
          "endOffset": 66,
          "startOffset": 54
        }
      }
    },
    {
      "displayName": "Plot",
      "textExtraction": {
        "textSegment": {
          "endOffset": 80,
          "startOffset": 72
        }
      }
    }
  ]
}

For additional information on the supported document formats and how to work with extracted features from PDFs, please refer to our Extracted Features from PDF Documentation.