The CoRoLa corpus has been developed for 4 years and since 2017 it is visible for research. This study is focused on the Romanian language, which is a young language in this domain, for which it has to be made big steps until it is considered well resourced - both qualitatively and quantitatively. Great efforts are being made to increase the utility of linguistic resources in applications related to language processing by interconnecting them. It is important to mention that all decisions that we have taken with respect to these operations were dictated by the ultimate goal of COROLA -a corpus of extracts of the Romanian language. year) or are left unmarked, just like in the original and without trying to make any explicit link to a reference in the bibliography list -footnotes are marked as text (see Figure 3) and moved from the bottom of the page to their corresponding places of reference in the pageautomatically -citations appearing in footnotes are not marked -endnotes are left in their places, and only their references in the text are marked, as here: (see Figure 4) -automatically -some files received from the publishers do not contain the full text (to insure that they cannot be reproduced), most frequently, entire pages are skipped in these cases we remove the interrupted/truncated sentences before and after the missing page, and include the XML tag, as it can be seen in Figure 6.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |