Treebanks: data sets and download

  • TUT

    is a morpho-syntactically annotated collection of Italian sentences, which includes texts from different text genres and domains, released in several annotation formats.
    Open/close more about TUT corpora (genres, size and download).

    Size

    section
    sentences
    tokens
    in TUT format
    tokens
    in CoNLL format
    CODICECIVILE 1,100 30,669 28,048
    NEWS 700 19,134 18,044
    VEDCH 400 14,390 12,508
    EUDIR 201 7,955 7,426
    WIKI 459 15,766 14,746
    COSTITA 687 15,643

    Licence for TUT

    Creative Commons License
    Turin University Treebank (TUT) by Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
    is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

    Formats of TUT

    • raw texts
    • parsed sentences in native dependency TUT format
    • parsed sentences TUT-structured, but with the annotation expressed in the ConLL format
    • parsed sentences in the Penn Treebank phrase structure format made adequate for Italian
    • parsed sentences in CCG format (see CCG-TUT web page).

    Documents about TUT

    • Linguistic notes: a detailed description of all the syntactic structures of the treebank
    • Syntactic categories: the Part of Speech tagset of the TUT corpus
    • Labels of the edges: the list of the grammatical relations labeling the dependency edges of TUT
    • Parallel annotations in TUT formats a rich selection of examples available in the native TUT format, in ConsTUT and in Penn format (with comments in Italian only)

    Download TUT

    The available data can be downloaded from the table (in .zip):
    section raw text native TUT TUT in CoNLL TUT-Penn CCG-TUT
    NEWS NEWS.raw NEWS.tut NEWS.conl NEWS.penn CCG-TUT
    VEDCH VEDCH.raw VEDCH.tut VEDCH.conl VEDCH.penn
    CODICECIVILE CODCIV.raw CODCIV.tut CODCIV.conl CODCIV.penn
    EUDIR EUDIR.raw EUDIR.tut EUDIR.conl EUDIR.penn
    WIKI WIKI.raw WIKI.tut WIKI.conl WIKI.penn
    COSTITA COSTITA.raw COSTITA.tut

  • ParallelTUT

    is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats.
    Open/close more about ParTUT corpora (genres, size and download).

    Size

    corpus
    sentences
    tokens
    JRCAcquis 540 19,268
    UDHR 230 6,791
    CC 291 9,241
    FB 341 5,697
    Europarl 1,505 43,479
    WIT3 287 4,715
    WIKI 1,168 28,044
    PS 1859 32,835

    Licence of ParTUT

    Creative Commons License
    ParTUT by Manuela Sanguinetti, Cristina Bosco is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

    Formats of ParTUT

    All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats:

    • The raw input texts
    • The parsed sentences in the standard native dependency TUT annotation
    • The parsed sentences TUT-structured, but with the annotation expressed in the CoNLL format
    • The parsed sentences in the Penn Treebank phrase structure format. The conversion has been obtained by means of a script that is described at: http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/
    • The parsed sentences in the Tiger-Xml format

    Download ParTUT

    The available data can be downloaded from the following table:

    corpus raw texts native TUT TUT-CoNLL TUT-Penn TUT-Tiger
    JRCAcquis Ita It Ita Ita Ita
    En En En
    Fr Fr Fr Fr Fr
    UDHR It It It It It
    En En En En En
    Fr Fr Fr Fr Fr
    CC It It It It It
    En En En En
    Fr Fr Fr Fr Fr
    FB It It It It
    En En En En
    Fr Fr Fr Fr
    Europarl It It It
    En En En
    Fr Fr Fr Fr
    WIT3 It It It It It
    En En En
    Fr Fr Fr
    PS It It It
    En En En
    Fr to do to do
    WIKI It It It
    En En En
    to do to do to do

See publications for more information about each corpus, and annotations for a description of the annotations format applied in each corpus