Turin University Treebank

Treebanks: data sets and download

TUT

is a morpho-syntactically annotated collection of Italian sentences, which includes texts from different text genres and domains, released in several annotation formats.
Open/close more about TUT corpora (genres, size and download).

Size

section
sentences
tokens
in TUT format tokens
in CoNLL format

CODICECIVILE 1,100 30,669 28,048

NEWS 700 19,134 18,044

VEDCH 400 14,390 12,508

EUDIR 201 7,955 7,426

WIKI 459 15,766 14,746

COSTITA 687 15,643

Licence for TUT

Turin University Treebank (TUT) by Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

Formats of TUT

raw texts
parsed sentences in native dependency TUT format
parsed sentences TUT-structured, but with the annotation expressed in the ConLL format
parsed sentences in the Penn Treebank phrase structure format made adequate for Italian
parsed sentences in CCG format (see CCG-TUT web page).

Documents about TUT

Linguistic notes: a detailed description of all the syntactic structures of the treebank
Syntactic categories: the Part of Speech tagset of the TUT corpus
Labels of the edges: the list of the grammatical relations labeling the dependency edges of TUT
Parallel annotations in TUT formats a rich selection of examples available in the native TUT format, in ConsTUT and in Penn format (with comments in Italian only)

Download TUT

The available data can be downloaded from the table (in .zip):

section raw text native TUT TUT in CoNLL TUT-Penn CCG-TUT

NEWS NEWS.raw NEWS.tut NEWS.conl NEWS.penn CCG-TUT

VEDCH VEDCH.raw VEDCH.tut VEDCH.conl VEDCH.penn

CODICECIVILE CODCIV.raw CODCIV.tut CODCIV.conl CODCIV.penn

EUDIR EUDIR.raw EUDIR.tut EUDIR.conl EUDIR.penn

WIKI WIKI.raw WIKI.tut WIKI.conl WIKI.penn

COSTITA COSTITA.raw COSTITA.tut

ParallelTUT

is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats.
Open/close more about ParTUT corpora (genres, size and download).

Size

corpus
sentences
tokens
JRCAcquis 540 19,268
UDHR 230 6,791
CC 291 9,241
FB 341 5,697
Europarl 1,505 43,479
WIT3 287 4,715
WIKI 1,168 28,044
PS 1859 32,835

Licence of ParTUT

ParTUT by Manuela Sanguinetti, Cristina Bosco is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Formats of ParTUT

All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats:

The raw input texts
The parsed sentences in the standard native dependency TUT annotation
The parsed sentences TUT-structured, but with the annotation expressed in the CoNLL format
The parsed sentences in the Penn Treebank phrase structure format. The conversion has been obtained by means of a script that is described at: http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/
The parsed sentences in the Tiger-Xml format

Download ParTUT

The available data can be downloaded from the following table:

corpus raw texts native TUT TUT-CoNLL TUT-Penn TUT-Tiger

JRCAcquis Ita It Ita Ita Ita
En En En
Fr Fr Fr Fr Fr
UDHR It It It It It
En En En En En
Fr Fr Fr Fr Fr
CC It It It It It
En En En En
Fr Fr Fr Fr Fr
FB It It It It
En En En En
Fr Fr Fr Fr
Europarl It It It
En En En
Fr Fr Fr Fr
WIT3 It It It It It
En En En
Fr Fr Fr
PS It It It
En En En
Fr to do to do
WIKI It It It
En En En
to do to do to do

See publications for more information about each corpus, and annotations for a description of the annotations format applied in each corpus