labelbuddy documentation

This document describes labelbuddy version 0.0.7.

1. Introduction

labelbuddy is an open-source desktop Graphical User Interface (GUI) application for annotating documents. It can be used for example for Named Entity Recognition, sentiment analysis, document classification, etc.

It is easy to install and use, and can efficiently handle many documents and annotations.

1.1. labelbuddy compared to other annotation tools

There exist several tools for annotating documents. Most of them, such as doccano and labelstudio, are meant to run on a web server and be used online. If you are crowdsourcing annotations and want users to annotate documents online you should turn to one of these tools.

However if you do not plan to host such a tool on a server, it may not be convenient for each annotator to install a rather complex program and run a local server and database management system on their own machine. In this case, it may be easier to rely on a desktop application such as labelbuddy, which is a more lightweight solution.

A labelbuddy database is just an ordinary file that you can copy, share or delete like any other file. Therefore managing annotations created with labelbuddy is just as simple as if the annotations were stored in plain text formats or LibreOffice Calc or Microsoft Excel spreadsheets. But being a dedicated tool, labelbuddy provides a convenient interface for creating and editing the annotations.

labelbuddy works completely offline, making it suitable to annotate confidential data.

labelbuddy supports the input and output formats of doccano so it is possible to switch from one to the other or to combine the work of annotators that use either.

1.2. Quick start

Start by installing labelbuddy. Then to give it a try, start labelbuddy and select File Demo in the menu. This will open a temporary demo database pre-loaded with a few example documents and labels. You can play around with labelbuddy’s features in this temporary database. If you decide to start creating annotations that you want to keep, read on to the next section. It explains how to open a new (persistent) database and import your documents and labels.

If you start labelbuddy from the command line, you can also open the demo database at startup with the --demo option:

labelbuddy --demo

2. Using labelbuddy

This section describes how to annotate text with labelbuddy. We start with a list of terms used in the rest of the documentation:

Document	A piece of text that we want to annotate, along with optional attributes such as a `short_title` or an `id`.
Label	A term that can be used to annotate (tag) documents or portions of text. Labels can be for example types of named entities such as “City”, Part-Of-Speech tags such as “Verb”, etc. Labels can also be given a `color` and a `shortcut_key`.
Character	A Unicode code point.
Annotation	A label attached to a specific portion of text in a document. For example, the first 5 characters of document #2 tagged with the label “City” constitute an annotation. Additional free-form text can also be attached to annotations.
Database	A binary file, created and modified by labelbuddy, containing documents, labels and annotations.
Importing and exporting data	Reading (writing) documents, labels and annotations from (to) files in non-binary formats that can easily be read by humans or other tools, such as JSON, XML or CSV.

The labelbuddy interface is organized around three tabs: the “Import / Export” tab for importing and exporting documents, labels and annotations, the “Dataset” tab to get an overview and manage documents and labels in the database, and the “Annotate” tab is where annotations are edited.

Documents and labels can be imported into a labelbuddy database from several simple formats such as JSON or CSV, described in Appendix A. Once they are imported, you can annotate the documents. Finally, you can export your annotations.

The import and export formats are the same, meaning that exported labels, documents or annotations can readily be imported back into the same or another labelbuddy database.

Importing and exporting data can be done from the graphical or the command line interface. Documents, labels and annotations that are already in the database are skipped if you try to import them again (but you can import new annotations for an existing document).

2.1. Importing documents (and annotations)

In the “Import / Export” tab, click Import docs & annotations and select a file containing the documents you plan to annotate. If you want try this before creating your own, you can download example documents from the labelbuddy website.

When importing a new document into labelbuddy, several attributes can be specified:

text

The text (content) of the document — mandatory.

All other attributes are optional:

`meta`	A mapping of user-defined metadata. You can use it to associate some information with the document, for example an identifier, DOI, author… This data is not used by labelbuddy. It is stored and bundled with the document when you export it.
`id`	A string attached to the document — a simpler alternative to `meta` if you only need a way to identify exported documents. Just as `meta`, `id` is stored but not used internally by labelbuddy.
`short_title`	Displayed in the “Annotate” tab when annotating the document.
`long_title`	Displayed in the document list in the “Dataset” tab.

You can use the short_title to display essential metadata or short instructions specific to a document. It can contain links by using an html <a> tag.

This information can be provided in several formats. The format is deduced from the filename extension.

Moreover, annotations (previously exported from labelbuddy or doccano) can also be imported into a database as described in Appendix A.

Importing documents and annotations can also be done from the command line, for example:

labelbuddy mydatabase.labelbuddy --import-docs mydocs.jsonl

2.2. Importing or creating labels

To import labels, click Import labels in the “Import / Export” tab and select a file. If you want try this before creating your own labels, you can download example labels from the labelbuddy examples page.

Labels have three attributes: a mandatory text (label name), and an optional color and shortcut_key. The shortcut_key is a lower-case letter (a-z) that helps quickly annotating text with that label.

As for documents, the format is deduced from the filename extension when importing labels, and is described in Appendix A.

It is also possible to manually enter a new label or to change the labels’ color and shortcut key from within the GUI application, in the “Dataset” tab.

Labels can also be imported using the command line, for example:

labelbuddy mydatabase.labelbuddy --import-labels mylabels.json

In the “Dataset” tab, labels can be reordered by dragging and dropping them.

2.3. Annotating documents

Once you have imported labels and documents you can see them in the “Dataset” tab. You can filter which documents are shown by the labels they have been annotated with. You can delete labels or documents, add labels and change the color and shortcut associated with each label. You can drag and drop labels to change their order. You then go to the “Annotate” tab. (If you double-click a document or press Enter after selecting it it will be opened in the “Annotate” tab.)

To annotate a document, select the region you want to label with the mouse and click on the appropriate label. It is also possible to do the same thing with the keyboard. Press / or Ctrl+F to search for the term you want to annotate and the first match will be selected. The selection can be adusted with the keyboard using the bindings described below. Then press the shortcut key associated with the label you want to set.

You can also attach additional information to the annotation by typing it in the “Extra annotation data” box. You can use this to add a comment to the annotation. You can also use the extra data for free-form labelling. For example, you can enter a number, the normalized name of an entity, a URI, etc. — anything that an Information Extraction system should find in the labelled region.

Once you have created annotations, you can select any of them by clicking it. It becomes bold and underlined and you can edit its additional data, change its label by clicking on a different one or remove the annotation by clicking Remove. You can also do this with the keyboard: jump to the next annotation with the Space key and change its label with a label shortcut or remove it with Backspace.

You can control whether the selected annotation is displayed in a bold font by checking or unchecking Show selected annotation in bold font.

If you are doing document classification and need global labels for the documents, just annotate any arbitrary portion of text. If you need to tag some document status such as "approved", "in progress", etc., add a label for that! You can then use it to filter documents in the “Annotate” tab. If you need free-form labels, use a generic label name and type the free-form annotation in the “Extra annotation data” box.

2.3.1. Overlapping annotations

When two or more annotations overlap, the whole group is shown in white text on a gray background. As you click the gray region or press the Space key, each annotation is selected in turn and shown in its label’s color.

The status bar on the bottom of the window shows a caret (“^”) next to the label name when the selected annotation is the first of its overlapping group (and “^^” when it is the first in the document).

2.3.2. Summary of key bindings in the “Annotate” tab

Searching and navigation
`Ctrl` and scroll the mouse	zoom or dezoom the text (for persistent settings, use Preferences Choose font )
`Ctrl`+`F`, `/`	search
`Enter`	next search match
`Shift`+`Enter`	previous search match
`Ctrl`+`J`, `Ctrl`+`N`, `Down`	scroll down one line
`Ctrl`+`K`, `Ctrl`+`P`, `Up`	scroll up one line
`Ctrl`+`D`, `PageDown`	scroll down one page
`Ctrl`+`U`, `PageUp`	scroll up one page
`Ctrl`+`L`	cycle between placing the cursor at the center, top and bottom of the window
`Home`	go to start of document
`End`	go to end of document

Searching and navigation

Ctrl and scroll the mouse

zoom or dezoom the text (for persistent settings, use Preferences Choose font )

Ctrl+F, /

Enter

next search match

Shift+Enter

previous search match

Ctrl+J, Ctrl+N, Down

scroll down one line

Ctrl+K, Ctrl+P, Up

scroll up one line

Ctrl+D, PageDown

scroll down one page

Ctrl+U, PageUp

scroll up one page

Ctrl+L

cycle between placing the cursor at the center, top and bottom of the window

Home

go to start of document

End

go to end of document

Manipulating annotations

Manipulating annotations
`A-Z` (label’s `shortcut_key`)	set corresponding label for the currently selected region or annotation
`Backspace`	remove selected annotation
`Alt`+`E`	edit the extra annotation data (then press `Enter` to return focus to the text)
`Space`	jump to next annotation and select it
`Shift`+`Space`	jump to previous annotation and select it
`Esc`	un-select selected annotation

A-Z (label’s shortcut_key)

set corresponding label for the currently selected region or annotation

Backspace

remove selected annotation

Alt+E

edit the extra annotation data (then press Enter to return focus to the text)

Space

jump to next annotation and select it

Shift+Space

jump to previous annotation and select it

Esc

un-select selected annotation

Manipulating the text selection
`]`	move the end of the selection by one word to the right
`[`	move the end of the selection by one word to the left
`}`	move the beginning of the selection by one word to the right
`{`	move the beginning of the selection by one word to the left
`Ctrl`+`]`	move the end of the selection by one character to the right
`Ctrl`+`[`	move the end of the selection by one character to the left
`Ctrl`+`}`	move the beginning of the selection by one character to the right
`Ctrl`+`{`	move the beginning of the selection by one character to the left
`Shift`+`Right`	select next character
`Shift`+`Left`	select previous character
`Ctrl`+`Shift`+`Right`	select next word
`Ctrl`+`Shift`+`Left`	select previous word
`Shift`+`Down`	select next line
`Shift`+`Up`	select previous line

Manipulating the text selection

]

move the end of the selection by one word to the right

[

move the end of the selection by one word to the left

}

move the beginning of the selection by one word to the right

{

move the beginning of the selection by one word to the left

Ctrl+]

move the end of the selection by one character to the right

Ctrl+[

move the end of the selection by one character to the left

Ctrl+}

move the beginning of the selection by one character to the right

Ctrl+{

move the beginning of the selection by one character to the left

Shift+Right

select next character

Shift+Left

select previous character

Ctrl+Shift+Right

select next word

Ctrl+Shift+Left

select previous word

Shift+Down

select next line

Shift+Up

select previous line

Navigating documents
`>`	go to next document
`<`	go to previous document

Navigating documents

>

go to next document

<

go to previous document

Underlined letters in the GUI indicate Alt-key shortcuts: for example in the “Dataset” tab “New label” indicates you can jump to creating a new label by pressing Alt+N.

2.4. Exporting documents and annotations

Once you are satisfied with your annotations you can export them in order to share them or use them in other applications.

Back in the “Import / Export” tab, click Export docs & annotations. You can choose to export all documents or only those that have annotations. You can choose to export the text of the documents or not. If you don’t export the text, the documents can be identified from metadata you may have associated with them, or by the MD5 checksum of the text that is always exported. You can also provide an “annotation approver” (user name), that will be exported as the annotation_approver (used by doccano). You can also choose to only export the documents, without the annotations.

When clicking Export docs & annotations you are asked to select a file and the resulting format will depend on the filename extension. The export format is the same as the import format. Exported documents and annotations can thus be imported back into a labelbuddy database. See Appendix A for details.

Exporting documents and annotations can also be done from the command line, for example:

labelbuddy mydatabase.labelbuddy --export-docs my-annotated-docs.jsonl

2.5. Exporting labels

You may also want to export labels, especially if you created them or edited their colors or shortcut keys within labelbuddy. In the “Import / Export” tab, click Export labels. Here also, the format is deduced from the filename extension.

This can also be done from the command line:

labelbuddy mydatabase.labelbuddy --export-labels mylabels.json

2.6. Managing projects

Each labelbuddy database (containing documents, labels and annotations) is an SQLite database. That is a single regular file on your disk that you can copy, backup, or share, like any other file. Therefore managing labelbuddy data is as simple as if you were storing annotations in a LibreOffice Calc or Microsoft Excel spreadsheet, for example.

Using SQLite (with the SQLite command-line interface or one of the many programming languages that provide SQLite bindings) you can also open a connection directly to the database to query it or even modify it. If you do so, set PRAGMA foreign_keys = ON.

After starting labelbuddy, you can create a new database or open an existing one by selecting File Open. If you used labelbuddy before, by default at startup it opens the last database that you used. The database to open can also be specified when invoking labelbuddy from the command line:

labelbuddy /path/to/my_project.labelbuddy

The path to the current database is displayed in the “Import / Export” tab.

If you just want to give labelbuddy a try and don’t have documents or labels yet, you can also select File Demo to open a temporary database pre-loaded with a few examples.

As it is easy to create, copy and delete databases (an empty labelbuddy database is just 60K), and to copy documents, labels and annotations from one to another, you have some freedom in the organization of annotation work. For example, you can break down the annotations into several files to reflect the structure of your project or to limit the number of documents in each labelbuddy file.

2.7. Command-line interface

labelbuddy can also be used from the command line to create databases, import and export documents, labels and annotations without opening the GUI. See the labelbuddy(1) man page, or labelbuddy -h for a short list of options reproduced here:

Usage: labelbuddy [options] [database]
Annotate documents.

Options:
  -h, --help                              Displays this help.
  -v, --version                           Displays version information.
  --demo                                  Open a temporary demo database with
                                          pre-loaded docs
  --import-labels <labels file>           Labels file to import in database.
  --import-docs <docs file>               Docs & annotations file to import in
                                          database.
  --export-labels <exported labels file>  Labels file to export to.
  --export-docs <exported docs file>      Docs & annotations file to export to.
  --labelled-only                         Export only labelled documents
  --no-text                               Do not include doc text when
                                          exporting
  --no-annotations                        Do not include annotations when
                                          exporting
  --approver <name>                       User or 'annotations approver' name
  --vacuum                                Repack database into minimal amount
                                          of disk space.

Arguments:
  database                                Database to open.

If any of the import or export options are used, labelbuddy doesn’t start a GUI but performs the required import or export operations and exits. It is possible to specify these options several times. To use these options, the database path must be provided explicitly.

Labels are imported first, then documents, then export operations are performed. Therefore it is possible to import documents and then export them in one execution of labelbuddy. As an example, to strip the annotations from previously exported documents you could run:

labelbuddy tempdb --import-docs docs.jsonl --export-docs unlabelled-docs.jsonl --no-annotations; rm tempdb

Or even (using SQLite's in-memory database):

labelbuddy :memory: --import-docs docs.jsonl --export-docs unlabelled-docs.jsonl --no-annotations

Regarding vacuum: when data is deleted from an SQLite database, the file doesn’t shrink. The freed up space is not lost; it is kept and reused when new data is added to the database. To shrink the database to occupy a minimal amount of disk space, we can use:

labelbuddy --vacuum /path/to/db.labelbuddy

or equivalently:

sqlite3 /path/to/db.labelbuddy 'VACUUM;'

See more details here. When the vacuum option is used, other options are ignored and labelbuddy shrinks the database then exits without starting the GUI.

3. Conclusion

labelbuddy was created using C++, Qt, SQLite, tools from the GNU project, and more. The documentation relies on asciidoctor and antora.

If you find a bug or have suggestions to improve labelbuddy, kindly open an issue on the labelbuddy GitHub repository.

Appendix A: Documents and labels file formats

A.1. General remarks

Documents, annotations and labels can be imported (exported) from (to) several simple formats, described below. In all cases, the format is deduced from the filename extension.

Several formats are supported to facilitate integrating labelbuddy in your project. For example, if you work with spreadsheet software such as Microsoft Excel or LibreOffice Calc, you may want to use the CSV format. If your documents are in XML you can easily transform them into the labelbuddy XML format with XSLT. Of course, you can import data from several files in any formats into the same database. If you are not sure which format to choose, use JSONL for documents and JSON for labels.

The examples shown in this section can be found as separate files on the labelbuddy examples page. Annotations are imported or exported together with information about the document to which they belong, so they are discussed together in the next section. However note that it is possible to export document metadata and annotations without the documents' text to save storage space.

At the moment, labelbuddy expects all the files it imports to be encoded with UTF-8 (or UTF-16, or UTF-32). Note that ASCII files are also valid UTF-8. One exception is that if you use the XML format, you can specify a different encoding in the XML prolog.

All files produced by labelbuddy are UTF-8 encoded.

The import and export formats of labelbuddy are the same, meaning that once you have exported documents (and annotations) or labels to a file, you can always import that file back into any labelbuddy database. Some formats are also compatible with doccano; see Section A.4.

Note that the content (text) of the documents must be in plain text. Other formats will need to be converted to text, for example using pdftotext if they are in PDF, or pandoc (specifying the target format as "plain") for many other formats.

A.2. File formats for documents and annotations

Documents and annotations are kept together in imported or exported files. When importing new documents (and possibly their annotations), each document will have a text attribute containing its content. For documents that are already in the database, it is possible to import new annotations from a file where the text is omitted, if the documents have the utf8_text_md5_checksum attribute instead. The associated value must be the hexadecimal representation of the MD5 checksum of the (UTF-8 encoded) text of the document. Typically, you will not add this attribute yourself. It is added by labelbuddy when exporting documents and annotations. It allows you to export annotations without the document’s text (or titles), and then import them into a database that already contains the document.

When exporting documents and annotations, the document’s metadata (meta and id, if you provided them), and utf8_text_md5_checksum are always included. You can choose not to include the document’s text (in which case short_title and long_title are not included either). This is useful to export annotations while avoiding to store the text several times. Exported annotations can be linked to the original text using the MD5 checksum, or information you stored in the documents’ id or meta. You can also choose not to include the annotations, and export only the documents.

Here is a full list of attributes a document can have:

`text`	The text (content) of the document
`meta`	A mapping of user-defined metadata. You can use it to associate some information with the document, for example an identifier, DOI, author… This data is not used by labelbuddy. It is stored and bundled with the document when you export it.
`id`	A string attached to the document — a simpler alternative to `meta` if you only need a way to identify exported documents. Just as `meta`, `id` is stored but not used internally by labelbuddy.
`short_title`	Displayed in the “Annotate” tab when annotating the document.
`long_title`	Displayed in the document list in the “Dataset” tab.
`labels`	A list of annotations.

A document can can have a labels attribute, which is a list of annotations (the name is labels and not annotations for compatibility with doccano). Each annotation is a list of 3 or 4 elements (the last one, extra_data, is optional):

`start_char`	the position of the first character (Unicode code point), starting from 0 at the begining of the text
`end_char`	the position of one past the last character
`label`	the label name
`extra_data`	additional text associated with the annotation.

For example if the text starts with “hello 😀” and you highlighted exactly that phrase, and labelled it with label_1, the associated annotation will be [0, 7, "label_1"]. If you also typed “some more info” in the “Extra annotation data” box, the annotation will be [0, 7, "label_1", "some more info"]. Here, a “character” means a Unicode code point. Note this does not correspond to byte offsets in the encoded text, nor necessarily to graphemes in the displayed text — a single grapheme may be represented by several Unicode characters, for example when combining characters such as diacritical marks are used.

Below, two examples are shown for each format (except .txt which has only one). The first illustrates a file that you may create from your data to import it into labelbuddy. The second one is a file exported from labelbuddy. It is also a valid input file that can be imported into labelbuddy, and it additionnally contains annotations.

There is no exported example for .txt because this format is only available for importing.

A.2.1. Import only: plain text (`.txt`)

The simplest format you can use is .txt. In this case, the file must contain the text of one document per line. The newlines that separate documents are not considered part of the document and are discarded.

While convenient, this format has some limitations: you cannot specify any other document attributes than the text, and the documents cannot contain newlines. The other import formats do not have these limitations.

Here is an example of a .txt file we can use to import documents into a labelbuddy database:

Importing 2 documents from a .txt file

the text of document 1 some text the end
the text of document 2 more text the end

Exporting to .txt is not supported.

A.2.2. JSON (`.json`)

When using this format, the import or export file is a JSON file containing one JSON array. Each element of the array is a JSON object representing a document and its attributes. If provided, meta must be a JSON object (mapping) containing user data about the document.

Here is an example of a .json file we can use to import documents into a labelbuddy database:

Importing 2 documents from a .json file.

[
  {
    "text": "the text of document 1\nsome text\nthe end\n"
  },
  {
    "text": "the text of document 2\nmore text\nthe end\n",
    "short_title": "title 2",
    "long_title": "the title of document 2",
    "meta": {
      "id": "doc-2",
      "source": "example.org"
    }
  }
]

Note that short_title, long_title, meta are optional — and indeed the first document does not provide them.

Below are the same documents, exported by labelbuddy after being annotated. Each document is always on a separate line. This makes it easy to read the file incrementally. Moreover, as the documents are always in the same order (the order in which they were imported), this gives line-oriented tools such as diff or git a better chance of producing useful output.

Another example of 2 documents in a .json file, with annotations (exported by labelbuddy)

[
{"labels":[[4,8,"Word"],[21,22,"Number","1"]],"meta":{},"text":"the text of document 1\nsome text\nthe end\n","utf8_text_md5_checksum":"872edcd008dee45d894d5d3c9143f96b"},
{"labels":[[12,20,"Word"]],"long_title":"the title of document 2","meta":{"id":"doc-2","source":"example.org"},"short_title":"title 2","text":"the text of document 2\nmore text\nthe end\n","utf8_text_md5_checksum":"c27b0dbfabf831afaf8bafa0c727cf97"}
]

Here, the text and titles are included, but labelbuddy provides the option to omit them. Note that as the export format is the same as the import format, this file can be imported into a labelbuddy database.

For large files (containing many documents or large documents), the JSONL format described in the next section is very similar to JSON but more memory-efficient.

A.2.3. JSON Lines (`.jsonl`)

When importing a JSON file the whole file is read into memory before inserting the documents in the database. To read documents one by one and reduce memory usage, you can use JSON Lines format. It is very similar to the JSON format, but instead of having one JSON array, the file must contain one JSON document per line. For example:

Importing 2 documents from a .jsonl file.

{"text": "the text of document 1\nsome text\nthe end\n"}
{"text": "the text of document 2\nmore text\nthe end\n", "short_title": "title 2", "long_title": "the title of document 2", "meta": {"id": "doc-2", "source": "example.org"}}

(We can see that the first document omits the optional short_title, long_title and meta.)

It is also possible to export documents and annotations to a .jsonl file:

Another example of 2 documents in a .jsonl file, with annotations (exported by labelbuddy)

{"labels":[[4,8,"Word"],[21,22,"Number","1"]],"meta":{},"text":"the text of document 1\nsome text\nthe end\n","utf8_text_md5_checksum":"872edcd008dee45d894d5d3c9143f96b"}
{"labels":[[12,20,"Word"]],"long_title":"the title of document 2","meta":{"id":"doc-2","source":"example.org"},"short_title":"title 2","text":"the text of document 2\nmore text\nthe end\n","utf8_text_md5_checksum":"c27b0dbfabf831afaf8bafa0c727cf97"}

Note that JSON Lines is not valid JSON. If you exported your annotations in a JSON Lines file, when parsing it each line must be parsed separately as a JSON document — not the whole file.

A.2.4. XML (`.xml`)

If your source documents are stored in XML, it may be easier to transform them to labelbuddy's XML format than to JSON. A RELAX NG schema is provided on the labelbuddy XML schema page. The root element must be document_set and contain any number of document elements. Each document contains children that have the names of the attributes described above: text, short_title, etc. User metadata is provided in the attributes of an element named meta or by using the id element’s text. A document’s children can appear in any order. The children of an annotation must respect the order: start_char, end_char, label, and optionally extra_data.

For example:

Importing 2 documents from a .xml file.

<?xml version='1.0' encoding='utf-8'?>
<document_set>
  <document>
    <text>the text of document 1
some text
the end
</text>
  </document>
  <document>
    <meta id="doc-2" source="example.org"/>
    <text>the text of document 2
more text
the end
</text>
    <short_title>title 2</short_title>
    <long_title>the title of document 2</long_title>
  </document>
</document_set>

Here are the same documents exported with some annotations:

Another example of 2 documents in a .xml file, with annotations (exported by labelbuddy)

<?xml version="1.0" encoding="UTF-8"?>
<document_set>
    <document>
        <utf8_text_md5_checksum>872edcd008dee45d894d5d3c9143f96b</utf8_text_md5_checksum>
        <meta/>
        <text>the text of document 1
some text
the end
</text>
        <labels>
            <annotation>
                <start_char>4</start_char>
                <end_char>8</end_char>
                <label>Word</label>
            </annotation>
            <annotation>
                <start_char>21</start_char>
                <end_char>22</end_char>
                <label>Number</label>
                <extra_data>1</extra_data>
            </annotation>
        </labels>
    </document>
    <document>
        <utf8_text_md5_checksum>c27b0dbfabf831afaf8bafa0c727cf97</utf8_text_md5_checksum>
        <meta id="doc-2" source="example.org"/>
        <short_title>title 2</short_title>
        <long_title>the title of document 2</long_title>
        <text>the text of document 2
more text
the end
</text>
        <labels>
            <annotation>
                <start_char>12</start_char>
                <end_char>20</end_char>
                <label>Word</label>
            </annotation>
        </labels>
    </document>
</document_set>

Caveats when importing from Text, JSON, JSONL or CSV and exporting to XML

In the XML format, the document’s metadata is stored as attributes of the meta element. This means the values in meta must be literal values (ie not arrays or objects). Therefore, if you plan to export documents and annotations to XML, do not use arrays or objects for values in the meta attribute of documents imported from other formats. For example, if you import a document from JSON with meta equal to {"authors": ["author 1", "author 2"], "id": "doc-1"}, "authors" will not be correctly exported to XML (but "id" will, and both would be correctly exported to other formats than XML).

JSON and other formats can represent some characters that are invalid in XML. Invalid characters (for example form feed, 0xC) are not allowed to appear at all in XML (unlike special characters such as < that simply need to be escaped), and they will not be written to the output. Thus if you imported documents from another format and there is a chance that they contain invalid XML characters, it is safer to choose something else than XML as the export format so that no characters are omitted in the output.

A.2.5. CSV (`.csv`)

Documents and annotations can also be imported and exported using CSV files.

The file must be in the default format produced by most tools. More precisely, it must respect the conventions described in rfc4180: , is the separator, fields that contain newlines (\r\n or \n) or double quotes (") must be enclosed in double quotes, and " must be escaped by doubling it (replacing it with ""). Line endings can be either \r\n or \n (in files exported by labelbuddy they will always be \r\n, as specified by rfc4180). This is the default CSV format used by tools such as Microsoft Excel, LibreOffice Calc, Google Sheets, write.csv in R, the csv Python standard library module or DataFrame.to_csv in the Pandas library, etc. so most likely you will not need to configure anything.

The file should be encoded with UTF-8 (or UTF-16, or UTF-32).

The CSV file must start with a header row and the column names must correspond to the document attributes described above (other columns will be ignored). If present (and not empty), meta must be the JSON serialization of the metadata (eg {"key": "value"}, which is why it may be more convenient to rely only on id instead, as shown below.

Here is an example:

Importing 2 documents from a .csv file.

text,short_title,long_title,id
"the text of document 1
some text
the end
",,,
"the text of document 2
more text
the end
",title 2,the title of document 2,doc-2

And here is the same file represented as a table:

Table 1. Importing 2 documents from a `.csv` file.
text	short_title	long_title	id
the text of document 1 some text the end
the text of document 2 more text the end	title 2	the title of document 2	doc-2

When documents contain annotations, their representation is slightly different than in the JSON, JSONL or XML formats. Documents do not have a labels attribute. Instead, each row corresponds to one annotation and contains the attributes of both the document and the annotation, as shown in the examples below. To avoid storing multiple copies of the documents’ text and titles, it is recommended to export the text and annotations separately. That is,

export the documents with the text but without the annotations,
then export to a different file without the text but with the annotations.

The two resulting tables can then be joined on the utf8_text_md5_checksum column (or on the id column if you provided a distinct id for each document when importing them).

Another example of 2 documents in a .csv file, with annotations (exported by labelbuddy)

ignore_this_column,utf8_text_md5_checksum,meta,id,short_title,long_title,text,start_char,end_char,label,extra_data
,872edcd008dee45d894d5d3c9143f96b,{},,,,"the text of document 1
some text
the end
",4,8,Word,
,872edcd008dee45d894d5d3c9143f96b,{},,,,"the text of document 1
some text
the end
",21,22,Number,1
,c27b0dbfabf831afaf8bafa0c727cf97,{},doc-2,title 2,the title of document 2,"the text of document 2
more text
the end
",12,20,Word,

Another example of 2 documents in a .csv file, with annotations (exported by labelbuddy)

utf8_text_md5_checksum	meta	id	short_title	long_title	text	start_char	end_char	label	extra_data
872edcd008dee45d894d5d3c9143f96b	{}				the text of document 1 some text the end	4	8	Word
872edcd008dee45d894d5d3c9143f96b	{}				the text of document 1 some text the end	21	22	Number	1
c27b0dbfabf831afaf8bafa0c727cf97	{}	doc-2	title 2	the title of document 2	the text of document 2 more text the end	12	20	Word

ignore_this_column

utf8_text_md5_checksum

A.3. File formats for labels

Labels can have the following attributes:

`text`	the label’s name, mandatory
`color`	A hexadecimal color string, such as `"#aec7e8"`, or an SVG color name, such as `"yellow"`. If missing, a default color will be assigned to the label.
`shortcut_key`	A lower-case ASCII letter (a-z) that helps quickly annotating text with that label. If missing or already used by another label, the label will not have any `shortcut_key`.

color and shortcut_key are optional. Both can be set from within the labelbuddy GUI.

As for documents, each section below except for .txt shows two examples: a simple manually created labels file that can be imported, and a file containing labels exported by labelbuddy. Note the exported file is in the same format, thus can also be imported into a labelbuddy database.

A.3.1. Import only: plain text (`.txt`)

Labels can be imported from a text file containing one label name per line (end-of-line characters are discarded):

Importing 2 labels from a .txt file

Word
Number

This format does not allow specifying color nor shortcut_key. Exporting labels to .txt is not possible.

A.3.2. JSON (`.json`)

When labels are stored in a JSON file, the file must contain a JSON array containing one JSON object per label. Each label’s object must have the key text and optionally color and shortcut_key. For compatibility with doccano, color and shortcut_key can also be specified as background_color and suffix_key. If both are present, color has higher precedence than background_color and shortcut_key has higher precedence than suffix_key.

For example:

Importing 2 labels from a .json file

[
  {
    "text": "Word"
  },
  {
    "text": "Number",
    "shortcut_key": "n",
    "color": "orange"
  }
]

When exporting labels to JSON, Both keys color and background_color are set to the label’s color. Both keys shortcut_key and suffix_key are set to the label’s shortcut key if it has one. Unlike documents, each exported label is not on a single line:

Another example of 2 labels in a .json file (exported by labelbuddy)

[
    {
        "background_color": "#aec7e8",
        "color": "#aec7e8",
        "text": "Word"
    },
    {
        "background_color": "#ffa500",
        "color": "#ffa500",
        "shortcut_key": "n",
        "suffix_key": "n",
        "text": "Number"
    }
]

A.3.3. JSON Lines (`.jsonl`)

In the JSON Lines format, each line in the file contains a JSON document: the JSON object representing a label. For example:

Importing 2 labels from a .jsonl file.

{"text": "Word"}
{"text": "Number", "shortcut_key": "n", "color": "orange"}

And here are the same labels exported by labelbuddy:

Another example of 2 labels in a .jsonl file (exported by labelbuddy)

{"background_color":"#aec7e8","color":"#aec7e8","text":"Word"}
{"background_color":"#ffa500","color":"#ffa500","shortcut_key":"n","suffix_key":"n","text":"Number"}

A.3.4. XML (`.xml`)

Labels can also be stored in XML. As for documents, a RELAX NG schema is provided on the labelbuddy XML schema page. Here is an example:

Importing 2 labels from a .xml file.

<?xml version='1.0' encoding='utf-8'?>
<label_set>
  <label>
    <text>Word</text>
  </label>
  <label>
    <text>Number</text>
    <shortcut_key>n</shortcut_key>
    <color>orange</color>
  </label>
</label_set>

Labels exported by labelbuddy:

Another example of 2 labels in a .xml file (exported by labelbuddy)

<?xml version="1.0" encoding="UTF-8"?>
<label_set>
    <label>
        <text>Word</text>
        <color>#aec7e8</color>
    </label>
    <label>
        <text>Number</text>
        <color>#ffa500</color>
        <shortcut_key>n</shortcut_key>
    </label>
</label_set>

Caveat when exporting labels to XML

The same caveat as for documents applies: if the label names contain invalid characters (for example form feed, 0xC), they cannot be correctly exported to XML.

A.3.5. CSV (`.csv`)

Labels can also be imported and exported from and to CSV. The CSV format is the same as for documents, ie the one described in rfc4180: , is the separator, fields that contain newlines (\r\n or \n) or double quotes (") must be enclosed in double quotes, and " must be escaped by doubling it (replacing it with ""). Line endings can be either \r\n or \n (they are always \r\n in CSV files created by labelbuddy). This is the default format used by many tools including Microsoft Excel, LibreOffice Calc, and Google Sheets.

Here is an example:

Importing 2 labels from a .csv file

text,shortcut_key,color
Word,,
Number,n,orange

Here is the same file presented as a table:

Table 2. Importing 2 labels from a `.csv` file
text	shortcut_key	color
Word
Number	n	orange

As for documents, when exporting labels to CSV the first column is empty and can be ignored. This allows inserting a Byte Order Mark (BOM), so that tools such as Microsoft Excel or LibreOffice Calc that need it to detect UTF-8 can parse the file correctly, while at the same time ensuring that the file will also be parsed correctly by tools that do not expect or do not skip the BOM. For example:

Another example of 2 labels in a .csv file (exported by labelbuddy)

ignore_this_column,text,color,shortcut_key
,Word,#aec7e8,
,Number,#ffa500,n

Here is the same file presented as a table:

Table 3. Another example of 2 labels in a `.csv` file (exported by **labelbuddy**)
ignore_this_column	text	color	shortcut_key
	Word	#aec7e8
	Number	#ffa500	n

A.4. Compatibility with doccano

Labels exported from labelbuddy in the JSON format can be imported into doccano. Labels exported from doccano and saved in a .json file can be imported into labelbuddy.

Documents and annotations exported from doccano can also be imported into a labelbuddy database. To do so, when exporting from doccano select the format “jsonl (text label)”. Make sure to save them in a file with the .jsonl extension (not .json, otherwise labelbuddy will try to parse it as JSON and JSON Lines is not valid JSON).

Caveats when importing doccano documents and annotations into labelbuddy

doccano strips leading and trailing whitespace from documents when importing them. Therefore if you import the result into a labelbuddy database that already contains the original documents, it may not be recognized as being the same (labelbuddy doesn’t modify the imported documents) and you might end up with (near) duplicate documents in the database.

Annotations exported from labelbuddy in the .jsonl format together with the document’s text can also be imported into doccano (selecting the “jsonl” import format).

Caveats when importing labelbuddy documents and annotations into doccano

If the original document contained leading whitespace, labelbuddy annotations will appear shifted when doccano removes the whitespace. Moreover, doccano allows duplicate documents so if the documents were already in the doccano database, they will appear as new (duplicate) documents rather than new annotations for existing documents.

If you used labelbuddy features that do not exist in doccano, you will not be able to import the resulting annotations into doccano:

if you have attached extra data to the annotations, doccano will not recognize the format and will not import the annotations.
doccano does not allow overlapping annotations. Therefore if you try to import overlapping annotations (created with labelbuddy) into doccano the results will be incorrect; annotated text will appear duplicated and jumbled.

labelbuddy documentation

1. Introduction

1.1. labelbuddy compared to other annotation tools

1.2. Quick start

2. Using labelbuddy

2.1. Importing documents (and annotations)

2.2. Importing or creating labels

2.3. Annotating documents

2.3.1. Overlapping annotations

2.3.2. Summary of key bindings in the “Annotate” tab

2.4. Exporting documents and annotations

2.5. Exporting labels

2.6. Managing projects

2.7. Command-line interface

3. Conclusion

Appendix A: Documents and labels file formats

A.1. General remarks

A.2. File formats for documents and annotations

A.2.1. Import only: plain text (.txt)

A.2.2. JSON (.json)

A.2.3. JSON Lines (.jsonl)

A.2.4. XML (.xml)

A.2.5. CSV (.csv)

A.3. File formats for labels

A.3.1. Import only: plain text (.txt)

A.3.2. JSON (.json)

A.3.3. JSON Lines (.jsonl)

A.3.4. XML (.xml)

A.3.5. CSV (.csv)

A.4. Compatibility with doccano

A.2.1. Import only: plain text (`.txt`)

A.2.2. JSON (`.json`)

A.2.3. JSON Lines (`.jsonl`)

A.2.4. XML (`.xml`)

A.2.5. CSV (`.csv`)

A.3.1. Import only: plain text (`.txt`)

A.3.2. JSON (`.json`)

A.3.3. JSON Lines (`.jsonl`)

A.3.4. XML (`.xml`)

A.3.5. CSV (`.csv`)