labelbuddy documentation

This document describes labelbuddy version 0.1.0.

1. Introduction

labelbuddy is an open-source desktop Graphical User Interface (GUI) application for annotating documents. It can be used for example for Named Entity Recognition, sentiment analysis, document classification, etc.

It is easy to install and use, and can efficiently handle many documents and annotations.

1.1. labelbuddy compared to other annotation tools

There are many tools for annotating documents. Most of them, such as doccano and labelstudio, are meant to run on a web server and be used through a web browser. If you are crowdsourcing annotations and want many users to annotate documents online this is necessary.

However if you do not plan to host such a tool on a server, labelbuddy may be a simpler and more lightweight solution.

A labelbuddy database is just an ordinary file that you can copy, share or delete like any other file. Therefore managing annotations created with labelbuddy is just as simple as if the annotations were stored in plain text formats or LibreOffice Calc or Microsoft Excel spreadsheets. But being a dedicated tool, labelbuddy provides a convenient interface for creating and editing the annotations.

labelbuddy works completely offline, making it suitable to annotate confidential data.

1.2. Quick start

Start by downloading labelbuddy. Then to give it a try, start labelbuddy and select File Demo in the menu. This will open a temporary demo database pre-loaded with a few example documents and labels. You can play around with labelbuddy’s features in this temporary database. You can also open the demo database by starting labelbuddy with:

labelbuddy --demo

If you decide to start creating annotations that you want to keep, read on to the next section. It explains how to open a new (persistent) database and import your documents and labels. The short version is:

Create a file for your documents that looks like this.
Create a file for your labels that looks like this (optional — labels can be created from within labelbuddy).
Start labelbuddy and open a new database (File Open or pass a file name on the command-line).
Import the labels and documents and start labelling!

For a short description of the command line interface run:

labelbuddy -h

2. Using labelbuddy

This section describes how to annotate text with labelbuddy. We start with a list of terms used in the rest of the documentation:

Document	A piece of text that we want to annotate, along with optional attributes such as a `display_title`.
Label	A term that can be used to annotate (tag) documents or portions of text. Labels can be for example types of named entities such as “City”, Part-Of-Speech tags such as “Verb”, etc. Labels can also be given a `color` and a `shortcut_key`.
Character	A Unicode code point (labelbuddy provides annotation positions both as Unicode character indices and as byte indices in the UTF-8 encoded text, see details).
Annotation	A label attached to a specific portion of text in a document. For example, the first 5 characters of document #2 tagged with the label “City” constitute an annotation. Additional free-form text can also be attached to annotations.
Database	A binary file, created and modified by labelbuddy, containing labels, documents and annotations.
Importing and exporting data	Reading (writing) labels, documents and annotations from (to) files in non-binary formats that can easily be read by humans or other tools, such as JSON.

The labelbuddy interface is organized around three tabs: the “Import & Export” tab for importing and exporting labels, documents and annotations, the “Labels & Documents” tab to get an overview and manage documents and labels in the database, and the “Annotate” tab is where annotations are edited.

Documents and labels can be imported into a labelbuddy database from plain text, JSON or JSONLines, as described in Appendix A. Once they are imported, you can annotate the documents. Finally, you can export your annotations.

The import and export formats are the same, meaning that exported labels, documents or annotations can readily be imported back into any labelbuddy database.

Importing and exporting data can be done from the graphical or the command line interface. Labels, documents and annotations that are already in the database are skipped if you try to import them again (but you can import new annotations for an existing document).

2.1. Importing documents (and annotations)

In the “Import & Export” tab, click Import docs & annotations and select a file containing the documents you plan to annotate. If you want try this before creating your own, you can download example documents from the labelbuddy website.

When importing a new document into labelbuddy, several attributes can be specified:

text

The text (content) of the document — mandatory.

All other attributes are optional:

`metadata`	A mapping of user-defined metadata. You can use it to associate some information with the document, for example an identifier, DOI, author… This data is not used by labelbuddy. It is stored and bundled with the document when you export it.
`display_title`	Displayed in the “Annotate” tab when annotating the document.
`list_title`	Displayed in the document list in the “Labels & Documents” tab.

You can use the display_title to display essential metadata or short instructions specific to a document. It can contain links by using an html <a> tag.

This information can be provided in several formats. The format is deduced from the filename extension.

Importing documents and annotations can also be done from the command line, for example:

labelbuddy mydatabase.labelbuddy --import-docs mydocs.jsonl

2.2. Importing or creating labels

To import labels, click Import labels in the “Import & Export” tab and select a file. If you want try this before creating your own labels, you can download example labels from the labelbuddy examples page.

Labels have three attributes: a mandatory name, and an optional color and shortcut_key. The shortcut_key is a lower-case letter (a-z), an upper-case letter (A-Z) or a digit (0-9) that helps quickly annotating text with that label.

As for documents, the format is deduced from the filename extension when importing labels, and is described in Appendix A.

It is also possible to manually enter a new label or to change the labels’ name, color and shortcut key from within the GUI application, in the “Labels & Documents” tab.

Labels can also be imported using the command line, for example:

labelbuddy mydatabase.labelbuddy --import-labels mylabels.json

In the “Labels & Documents” tab, labels can be reordered by dragging and dropping them.

2.3. Annotating documents

Once you have imported labels and documents you can see them in the “Labels & Documents” tab.

You can filter which documents are shown by the labels they have been annotated with. In addition there is a search bar to filter documents based on their content (possibly combined with the filter set on their labels). The search is performed on the titles, the metadata, and the documents' text. At the moment the search is very basic. It looks for exact matches. By default, search is case-insensitive (although unfortunately that only applies to ASCII characters), and leading and trailing whitespace is stripped from the search term or phrase. If the search term contains any upper-case letters, the search becomes case-sensitive. If the search term is wrapped in "" or '', whitespace is preserved and the search is case-sensitive. The surrounding quotes will be removed but can be doubled to search for a term wrapped in literal quotes in the documents.

You can delete labels or documents, add labels and change the color and shortcut associated with each label. You can drag and drop labels to change their order. You then go to the “Annotate” tab. (To jump to annotating a specific document in the list you can double-click it or select it and either press Enter or click Annotate.)

Once in the “Annotate” tab, use the mouse to select the region you want to annotate and click on the appropriate label. It is also possible to do the same thing with the keyboard. Press / or Ctrl+F to search for the term you want to annotate and the first match will be selected. The selection can be adusted with the keyboard using the bindings described below. Then press the shortcut key associated with the label you want to set.

You can also attach additional information to the annotation by typing it in the “Extra annotation data” box. You can use this to add a comment to the annotation. You can also use the extra data for free-form labelling. For example, you can enter a number, the normalized name of an entity, a URI, etc. — anything that an Information Extraction system should find in the labelled region. The “Extra annotation data” box offers a list of possible completions: all the extra data values for annotations with the same label in the current document. This makes it easy to link different annotations, by giving them the same extra data, which thus acts as an identifier.

Once you have created annotations, you can select any of them by clicking it. It becomes bold and underlined and you can edit its additional data, change its label by clicking on a different one or remove the annotation by clicking Delete annotation. You can also do this with the keyboard: jump to the next annotation with the Space key and change its label with a label shortcut or remove it with Backspace.

You can control whether the selected annotation is displayed in a bold font by checking or unchecking Show selected annotation in bold font.

If you are doing document classification and need global labels for the documents, just annotate any arbitrary portion of text. If you need to tag some document status such as "approved", "in progress", etc., add a label for that! You can then use it to filter documents in the “Annotate” tab. If you need free-form labels, use a generic label name and enter the free-form annotation in the “Extra annotation data” box.

A list of all annotations for the current document is shown on the right. The label is displayed first, then the selected text surrounded by its context. Any extra text attached to the annotation (from the “Extra annotation data” box) is shown in an italic font next to the label name. You can use this list to navigate the annotations in the document: if you click an annotation in the list it becomes active and the text is scrolled so that it appears on the screen. Conversely, if you create a new annotation or click on an annotation in the text it also becomes selected in the annotations list on the right.

2.3.1. Overlapping annotations

When two or more annotations overlap, the whole group is shown in white text on a gray background. As you click the gray region or press the Space key, each annotation is selected in turn and shown in its label’s color. You can also see and select these annotations from the annotations list on the right of the “Annotate” tab.

The status bar on the bottom of the window shows a caret (“^”) next to the label name when the selected annotation is the first of its overlapping group (and “^^” when it is the first in the document).

If we click a new label while an annotation is selected, a new annotation is created on top, at the same position as the previously selected annotation, with the new label. It is not possible to have two different annotations with the same label at the exact same (start and end) position. If we click a label corresponding to an annotation that already exists at the current position, the existing annotation is selected instead of creating a new one.

2.3.2. Summary of key bindings in the “Annotate” tab

Searching and navigation
`Ctrl` and scroll the mouse	zoom or dezoom the text (for persistent settings, use Preferences Choose font )
`Ctrl`+`F`, `/`	search
`Enter`	next search match
`Shift`+`Enter`	previous search match
`Ctrl`+`J`, `Ctrl`+`N`, `Down`	scroll down one line
`Ctrl`+`K`, `Ctrl`+`P`, `Up`	scroll up one line
`Ctrl`+`D`, `PageDown`	scroll down one page
`Ctrl`+`U`, `PageUp`	scroll up one page
`Ctrl`+`L`	cycle between placing the cursor at the center, top and bottom of the window
`Home`	go to start of document
`End`	go to end of document

Searching and navigation

Ctrl and scroll the mouse

zoom or dezoom the text (for persistent settings, use Preferences Choose font )

Ctrl+F, /

Enter

next search match

Shift+Enter

previous search match

Ctrl+J, Ctrl+N, Down

scroll down one line

Ctrl+K, Ctrl+P, Up

scroll up one line

Ctrl+D, PageDown

scroll down one page

Ctrl+U, PageUp

scroll up one page

Ctrl+L

cycle between placing the cursor at the center, top and bottom of the window

Home

go to start of document

End

go to end of document

Manipulating annotations

Manipulating annotations
`A-Z`, `0-9`, `Shift`+`A-Z` (label’s `shortcut_key`)	set corresponding label for the currently selected region or annotation
`Backspace`	remove selected annotation
`Alt`+`E`	edit the extra annotation data (then press `Enter` to return focus to the text)
`Space`	jump to next annotation and select it
`Shift`+`Space`	jump to previous annotation and select it
`Esc`	un-select selected annotation

A-Z, 0-9, Shift+A-Z (label’s shortcut_key)

set corresponding label for the currently selected region or annotation

Backspace

remove selected annotation

Alt+E

edit the extra annotation data (then press Enter to return focus to the text)

Space

jump to next annotation and select it

Shift+Space

jump to previous annotation and select it

Esc

un-select selected annotation

Manipulating the text selection
`]`	move the end of the selection by one word to the right
`[`	move the end of the selection by one word to the left
`}`	move the beginning of the selection by one word to the right
`{`	move the beginning of the selection by one word to the left
`Ctrl`+`]`	move the end of the selection by one character to the right
`Ctrl`+`[`	move the end of the selection by one character to the left
`Ctrl`+`}`	move the beginning of the selection by one character to the right
`Ctrl`+`{`	move the beginning of the selection by one character to the left
`Shift`+`Right`	select next character
`Shift`+`Left`	select previous character
`Ctrl`+`Shift`+`Right`	select next word
`Ctrl`+`Shift`+`Left`	select previous word
`Shift`+`Down`	select next line
`Shift`+`Up`	select previous line

Manipulating the text selection

]

move the end of the selection by one word to the right

[

move the end of the selection by one word to the left

}

move the beginning of the selection by one word to the right

{

move the beginning of the selection by one word to the left

Ctrl+]

move the end of the selection by one character to the right

Ctrl+[

move the end of the selection by one character to the left

Ctrl+}

move the beginning of the selection by one character to the right

Ctrl+{

move the beginning of the selection by one character to the left

Shift+Right

select next character

Shift+Left

select previous character

Ctrl+Shift+Right

select next word

Ctrl+Shift+Left

select previous word

Shift+Down

select next line

Shift+Up

select previous line

Navigating documents
`>`	go to next document
`<`	go to previous document

Navigating documents

>

go to next document

<

go to previous document

Underlined letters in the GUI indicate Alt-key shortcuts: for example in the “Labels & Documents” tab “New label” indicates you can jump to creating a new label by pressing Alt+N. (These shortcuts are not available on MacOS.)

2.4. Exporting documents and annotations

Once you are satisfied with your annotations you can export them in order to share them or use them in other applications.

Back in the “Import & Export” tab, click Export docs & annotations. You can choose to export all documents or only those that have annotations. You can choose to export the text of the documents or not. If you do not export the text, the documents can be identified from metadata you may have associated with them, or by the MD5 checksum of the text that is always exported. You can also choose to only export the documents, without the annotations.

When clicking Export docs & annotations you are asked to select a file and the resulting format will depend on the filename extension. The export format is the same as the import format. Exported documents and annotations can thus be imported back into a labelbuddy database. See Appendix A for details.

Exporting documents and annotations can also be done from the command line, for example:

labelbuddy mydatabase.labelbuddy --export-docs my-annotated-docs.jsonl

2.5. Exporting labels

You may also want to export labels, especially if you created them or edited their colors or shortcut keys within labelbuddy. In the “Import & Export” tab, click Export labels. Here also, the format is deduced from the filename extension.

This can also be done from the command line:

labelbuddy mydatabase.labelbuddy --export-labels mylabels.json

2.6. Managing projects

Each labelbuddy database (containing labels, documents and annotations) is an SQLite database. That is a single regular file on your disk that you can copy, backup, or share, like any other file. Therefore managing labelbuddy data is as simple as if you were storing annotations in a LibreOffice Calc or Microsoft Excel spreadsheet, for example.

Using SQLite (with the SQLite command-line interface or one of the many programming languages that provide SQLite bindings) you can also open a connection directly to the database to query it or even modify it. If you do so, set PRAGMA foreign_keys = ON.

After starting labelbuddy, you can create a new database or open an existing one by selecting File Open. If you used labelbuddy before, by default at startup it opens the last database that you used. The database to open can also be specified when invoking labelbuddy from the command line:

labelbuddy /path/to/my_project.labelbuddy

The path to the current database is displayed in the “Import & Export” tab.

If you just want to give labelbuddy a try and do not have documents or labels yet, you can also select File Demo to open a temporary database pre-loaded with a few examples.

As it is easy to create, copy and delete databases (an empty labelbuddy database is just 60K), and to copy labels, documents and annotations from one to another, you have some freedom in the organization of annotation work. For example, you can break down the annotations into several files to reflect the structure of your project or to limit the number of documents in each labelbuddy file.

2.7. Command-line interface

labelbuddy can also be used from the command line to create databases, import and export labels, documents and annotations without opening the GUI. See the labelbuddy(1) man page, or labelbuddy -h for a short list of options reproduced here:

Usage: labelbuddy [options] [database]
Annotate documents.

Options:
  -h, --help                              Displays this help.
  -v, --version                           Displays version information.
  --demo                                  Open a temporary demo database with
                                          pre-loaded docs.
  --import-labels <labels file>           Labels file to import in database.
  --import-docs <docs file>               Docs & annotations file to import in
                                          database.
  --export-labels <exported labels file>  Labels file to export to.
  --export-docs <exported docs file>      Docs & annotations file to export to.
  --labelled-only                         Export only labelled documents.
  --no-text                               Do not include doc text when
                                          exporting.
  --no-annotations                        Do not include annotations when
                                          exporting.
  --vacuum                                Repack database into minimal amount
                                          of disk space.

Arguments:
  database                                Database to open.

If any of the import or export options are used, labelbuddy does not start a GUI but performs the required import or export operations and exits. It is possible to specify these options several times. To use these options, the database path must be provided explicitly.

Labels are imported first, then documents, then export operations are performed. Therefore it is possible to import documents and then export them in one execution of labelbuddy. As an example, to strip the annotations from previously exported documents you could run:

labelbuddy tempdb --import-docs docs.jsonl --export-docs unlabelled-docs.jsonl --no-annotations; rm tempdb

Or even (using SQLite's in-memory database):

labelbuddy :memory: --import-docs docs.jsonl --export-docs unlabelled-docs.jsonl --no-annotations

Regarding vacuum: when data is deleted from an SQLite database, the file does not shrink. The freed up space is not lost; it is kept and reused when new data is added to the database. To shrink the database to occupy a minimal amount of disk space after deleting some documents, we can use:

labelbuddy --vacuum /path/to/db.labelbuddy

or equivalently:

sqlite3 /path/to/db.labelbuddy 'VACUUM;'

See more details here. When the vacuum option is used, other options are ignored and labelbuddy shrinks the database then exits without starting the GUI.

3. Conclusion

labelbuddy was created using C++, Qt, SQLite, tools from the GNU project, and more. The documentation relies on asciidoctor and antora.

If you find a bug or have suggestions to improve labelbuddy, kindly open an issue on the labelbuddy GitHub repository.

Appendix A: Documents and labels file formats

A.1. General remarks

Documents, annotations and labels can be imported (exported) from (to) several simple formats, described below. In all cases, the format is deduced from the filename extension.

The examples shown in this section can be found as separate files on the labelbuddy examples page. Annotations are imported or exported together with information about the document to which they belong, so they are discussed together in the next section. However note that it is possible to export document metadata and annotations without the documents' text to save storage space.

labelbuddy expects all the files it imports to be encoded with UTF-8. Note that ASCII files are also valid UTF-8.

All files produced by labelbuddy are UTF-8 encoded.

The import and export formats of labelbuddy are the same, meaning that once you have exported documents (and annotations) or labels to a file, you can always import that file back into any labelbuddy database.

Note that the content (text) of the documents must be in plain text. Other formats will need to be converted to text, for example using pdftotext if they are in PDF, or pandoc (specifying the target format as "plain") for many other formats.

A.2. File formats for documents and annotations

Documents and annotations are kept together in imported or exported files. When importing new documents (and possibly their annotations), each document will have a text attribute containing its content. For documents that are already in the database, it is possible to import new annotations from a file where the text is omitted, if the documents have the utf8_text_md5_checksum attribute instead. The associated value must be the hexadecimal representation of the MD5 checksum of the (UTF-8 encoded) text of the document. Typically, you will not add this attribute yourself. It is added by labelbuddy when exporting documents and annotations. It allows you to export annotations without the document’s text, and then import them into a database that already contains the document.

When exporting documents and annotations, the document’s metadata (metadata, if you provided it), and utf8_text_md5_checksum are always included. You can choose not to include the document’s text (in which case display_title and list_title are not included either). This is useful to export annotations while avoiding to store the text several times. Exported annotations can be linked to the original text using the MD5 checksum, or information you stored in the documents’ metadata. You can also choose not to include the annotations, and export only the documents.

Here is a full list of attributes a document can have:

`text`	The text (content) of the document
`metadata`	A mapping of user-defined metadata. You can use it to associate some information with the document, for example an identifier, DOI, author… This data is not used by labelbuddy. It is stored and bundled with the document when you export it.
`display_title`	Shown in the “Annotate” tab when the document is displayed and being annotated.
`list_title`	Shown in the document list in the “Labels & Documents” tab.
`annotations`	A list of annotations.

Each annotation has the following attributes (the last one, extra_data, is optional):

`label_name`	the label name.
`start_char`	the position of the first character (Unicode code point), starting from 0 at the beginning of the text.
`end_char`	the position of one past the last character.
`start_byte`	the position of the first byte, starting from 0 at the beginning of the UTF-8 encoded text.
`end_byte`	the position of one past the last UTF-8 byte.
`extra_data`	additional text associated with the annotation.

Annotation positions: characters, code points and bytes

To represent text on a computer, all characters such as "a" (LATIN SMALL LETTER A), "α" (GREEK SMALL LETTER ALPHA) and "🤩" (GRINNING FACE WITH STAR EYES) are associated with an integer number, known as a Unicode code point. Text can be thought of in an abstract way as a sequence of Unicode code points (often called Unicode characters).

In practice, data in a file or in a computer’s memory is stored as a sequence of bytes (groups of 8 bits, which can each represent one of 2⁸ = 256 different values). In order to store text in a file, the code points need to be represented as a sequence of bytes. As there are far more than 256 characters (close to 150,000), each code point cannot be stored directly in one byte. A transformation that maps each code point to a string of bytes is called a character encoding. Several encodings exist, but the most widely used (and required for JSON files as specified in the JSON standard, rfc8259) is UTF-8, which encodes each Unicode character as a sequence of 1 to 4 bytes depending on the character. For example, "a" (LATIN SMALL LETTER A) corresponds to the Unicode code point 97, and is encoded in UTF-8 as a single byte, 0x61. "α" (GREEK SMALL LETTER ALPHA) corresponds to the Unicode code point 945, and is encoded in UTF-8 as the sequence of 2 bytes 0xCE 0xB1.

In order to specify the portion of text covered by an annotation, labelbuddy provides the positions in the text where the annotation starts and ends, in 2 equivalent ways:

As indices into the sequence of Unicode code points, start_char and end_char.
As indices into the sequence of bytes that results from encoding the text with UTF-8, start_byte and end_byte.

For example if the text starts with “hello 😀” and you highlighted exactly that phrase, the associated annotation positions will be:

start_char: 0
end_char: 7 (because there are 7 code points, "h", "e", "l", "l", "o", " ", and "😀")
start_byte: 0
end_byte: 10 (because there are 10 bytes in the annotated text once it is encoded as UTF-8: "😀" is represented by the 4-byte sequence 0xF0 0x9F 0x98 0x80 and the other characters are each represented with 1 byte)

When importing annotations, either Unicode character positions or UTF-8 byte positions can be provided. When both are present, the char is used and the byte is ignored. Annotations at invalid positions (for example starting past the end of the text or in the middle of a multi-byte character) are ignored.

Finally, note that Unicode characters do not necessarily correspond to graphemes in the displayed text — a single grapheme may be represented by several Unicode characters, for example when combining characters such as diacritical marks are used.

Documents and annotations can be imported and exported to several formats. Below, two examples are shown for each format (except .txt which has only one). The first illustrates a file that you may create from your data to import it into labelbuddy. The second one is a file exported from labelbuddy. It is also a valid input file that can be imported into labelbuddy, and it additionnally contains annotations.

There is no exported example for .txt because this format is only available for importing.

A.2.1. Import only: plain text (`.txt`)

The simplest format you can use is .txt. In this case, the file must contain the text of one document per line. The newlines that separate documents are not considered part of the document and are discarded.

While convenient, this format has some limitations: you cannot specify any other document attributes than the text, and the documents cannot contain newlines. The other import formats do not have these limitations.

Here is an example of a .txt file we can use to import documents into a labelbuddy database:

Importing 2 documents from a .txt file

😀 the text of document 1 some text the end
the text of document 2 more text the end

Exporting to .txt is not supported.

A.2.2. JSON (`.json`)

When using this format, the import or export file is a JSON file containing one JSON array. Each element of the array is a JSON object representing a document and its attributes. If provided, metadata must be a JSON object (mapping) containing user data about the document.

Here is an example of a .json file we can use to import documents into a labelbuddy database:

Importing 2 documents from a .json file.

[
  {
    "text": "😀 the text of document 1\nsome text\nthe end\n"
  },
  {
    "text": "the text of document 2\nmore text\nthe end\n",
    "display_title": "title 2",
    "list_title": "the title of document 2",
    "metadata": {
      "id": "doc-2",
      "source": "example.org"
    }
  }
]

Note that display_title, list_title, metadata are optional — and indeed the first document does not provide them.

Below are the same documents, exported by labelbuddy after being annotated. Each document is always on a separate line. This makes it easy to parse the file incrementally. Moreover, as the documents are always in the same order (the order in which they were imported), this gives line-oriented tools such as diff or git a better chance of producing useful output.

Another example of 2 documents in a .json file, with annotations (exported by labelbuddy)

[
{"annotations":[{"end_byte":13,"end_char":10,"label_name":"Word","start_byte":9,"start_char":6},{"end_byte":27,"end_char":24,"extra_data":"1","label_name":"Number","start_byte":26,"start_char":23}],"metadata":{},"text":"😀 the text of document 1\nsome text\nthe end\n","utf8_text_md5_checksum":"baa2e7f8d7efb87cc449f5f1d3ecd6aa"},
{"annotations":[{"end_byte":20,"end_char":20,"label_name":"Word","start_byte":12,"start_char":12}],"display_title":"title 2","list_title":"the title of document 2","metadata":{"id":"doc-2","source":"example.org"},"text":"the text of document 2\nmore text\nthe end\n","utf8_text_md5_checksum":"c27b0dbfabf831afaf8bafa0c727cf97"}
]

Here, the text and titles are included, but labelbuddy provides the option to omit them. Note that as the export format is the same as the import format, this file can be imported into a labelbuddy database.

For large files (containing many documents or large documents), the JSONL format described in the next section is very similar to JSON but more memory-efficient.

A.2.3. JSON Lines (`.jsonl`)

When importing a JSON file the whole file is read into memory before inserting the documents in the database. To read documents one by one and reduce memory usage, you can use JSON Lines format. It is very similar to the JSON format, but instead of having one JSON array, the file must contain one JSON document per line. For example:

Importing 2 documents from a .jsonl file.

{"text": "😀 the text of document 1\nsome text\nthe end\n"}
{"text": "the text of document 2\nmore text\nthe end\n", "display_title": "title 2", "list_title": "the title of document 2", "metadata": {"id": "doc-2", "source": "example.org"}}

(We can see that the first document omits the optional display_title, list_title and metadata.)

It is also possible to export documents and annotations to a .jsonl file:

Another example of 2 documents in a .jsonl file, with annotations (exported by labelbuddy)

{"annotations":[{"end_byte":13,"end_char":10,"label_name":"Word","start_byte":9,"start_char":6},{"end_byte":27,"end_char":24,"extra_data":"1","label_name":"Number","start_byte":26,"start_char":23}],"metadata":{},"text":"😀 the text of document 1\nsome text\nthe end\n","utf8_text_md5_checksum":"baa2e7f8d7efb87cc449f5f1d3ecd6aa"}
{"annotations":[{"end_byte":20,"end_char":20,"label_name":"Word","start_byte":12,"start_char":12}],"display_title":"title 2","list_title":"the title of document 2","metadata":{"id":"doc-2","source":"example.org"},"text":"the text of document 2\nmore text\nthe end\n","utf8_text_md5_checksum":"c27b0dbfabf831afaf8bafa0c727cf97"}

Note that JSON Lines is not valid JSON. If you exported your annotations in a JSON Lines file, when parsing it each line must be parsed separately as a JSON document — not the whole file.

A.3. File formats for labels

Labels can have the following attributes:

`name`	the label’s name, mandatory.
`color`	A hexadecimal color string, such as `"#aec7e8"`, or an SVG color name, such as `"yellow"`. If missing, a default color will be assigned to the label.
`shortcut_key`	An ASCII letter (a-z, A-Z) or digit (0-9) that helps quickly annotating text with that label. If missing or already used by another label, the label will not have any `shortcut_key`.

color and shortcut_key are optional. Both can be set from within the labelbuddy GUI.

As for documents, each section below except for .txt shows two examples: a simple manually created labels file that can be imported, and a file containing labels exported by labelbuddy. Note the exported file is in the same format, thus can also be imported into a labelbuddy database.

A.3.1. Import only: plain text (`.txt`)

Labels can be imported from a text file containing one label name per line (end-of-line characters are discarded):

Importing 2 labels from a .txt file

Word
Number

This format does not allow specifying color nor shortcut_key. Exporting labels to .txt is not possible.

A.3.2. JSON (`.json`)

When labels are stored in a JSON file, the file must contain a JSON array containing one JSON object per label. Each label’s object must have the key name and optionally color and shortcut_key.

For example:

Importing 2 labels from a .json file

[
  {
    "name": "Word"
  },
  {
    "name": "Number",
    "shortcut_key": "n",
    "color": "orange"
  }
]

When exporting labels to JSON, unlike documents, each exported label is not on a single line. This makes them easier to read and labels files are very short so there is no need to parse them incrementally.

Another example of 2 labels in a .json file (exported by labelbuddy)

[
    {
        "color": "#aec7e8",
        "name": "Word"
    },
    {
        "color": "#ffa500",
        "name": "Number",
        "shortcut_key": "n"
    }
]

A.3.3. JSON Lines (`.jsonl`)

In the JSON Lines format, each line in the file contains a JSON document: the JSON object representing a label. For example:

Importing 2 labels from a .jsonl file.

{"name": "Word"}
{"name": "Number", "shortcut_key": "n", "color": "orange"}

And here are the same labels exported by labelbuddy:

Another example of 2 labels in a .jsonl file (exported by labelbuddy)

{"color":"#aec7e8","name":"Word"}
{"color":"#ffa500","name":"Number","shortcut_key":"n"}

labelbuddy documentation

1. Introduction

1.1. labelbuddy compared to other annotation tools

1.2. Quick start

2. Using labelbuddy

2.1. Importing documents (and annotations)

2.2. Importing or creating labels

2.3. Annotating documents

2.3.1. Overlapping annotations

2.3.2. Summary of key bindings in the “Annotate” tab

2.4. Exporting documents and annotations

2.5. Exporting labels

2.6. Managing projects

2.7. Command-line interface

3. Conclusion

Appendix A: Documents and labels file formats

A.1. General remarks

A.2. File formats for documents and annotations

A.2.1. Import only: plain text (.txt)

A.2.2. JSON (.json)

A.2.3. JSON Lines (.jsonl)

A.3. File formats for labels

A.3.1. Import only: plain text (.txt)

A.3.2. JSON (.json)

A.3.3. JSON Lines (.jsonl)

A.2.1. Import only: plain text (`.txt`)

A.2.2. JSON (`.json`)

A.2.3. JSON Lines (`.jsonl`)

A.3.1. Import only: plain text (`.txt`)

A.3.2. JSON (`.json`)

A.3.3. JSON Lines (`.jsonl`)