


At the same time, the DLx format is computer-readable, easily searchable, and is natively supported by all modern web-based tools.

This functionality turns out to be a crucial factor in inputting, editing, searching, and analyzing linguistic data.

For instance, this format is capable of capturing the fact that a text contains utterances, utterances contain words, words contains morphemes, and morphemes contain phonemes. This recommended format was designed to capture hierarchical linguistic data in a way that aligns with the descriptive categories that linguists actually use, relying on fundamental linguistic notions such as text, morpheme, orthography, etc.
#SCHEMAPLIC ENGLISH FILE HOW TO#
Moreover, the DLx project has drafted recommendations for how to structure linguistic data using JSON. The Digital Linguistics (DLx) project recommends a data format called JSON (JavaScript Object Notation) for digitally representing your linguistic data. For example, does the data about a text have a property specifying the language it was spoken in, and should that property be represented as "lang" or "language"? Not only are many formats are available (a relational database, XML, a tabular spreadsheet, JSON, etc.), but there is significant flexibility in deciding what properties to include in your data and what to call them. There are many ways a linguist could choose to represent their data in digital form. The purpose of the Digital Linguistics data format is to define a standard for representing interlinear glosses (as well as other linguistic information, such as dictionary entries) in a digital, computer-readable way. While humans look at a representation like this and can see which glosses are associated with which morphemes, computers cannot rely on visual layouts in this way, and require more explicit structure. 'Then he left his brothers there.' (Translation) Then his brother-PL-TOP he there leave-PL-3sg (Glosses) Wetkx hus naancaaka-mank-x weyt hi hok-mi-qi (Morpheme Breakdown) Here is a short example of an interlinear gloss for a phrase in a language called Chitimacha: Wetkx hus naancaakamankx weyt hi hokmiqi. This is typically a 3- or 4-line format that shows a phrase in the language of interest, the words and morphemes inside the phrase, what each of those morphemes means, and its overall translation. The canonical way that linguists represent linguistic data in their publications is with an interlinear gloss. Check out the contributing guidelines for information on the best way to report a bug or request a feature.Ĭontributing: Want to contribute to this project? :star2: Awesome! :star2: Check out the contributing guidelines to get started.ĭeveloper Readme: Are you a developer who wants to work with the data format programmatically? Check out the Developer Readme. Schemas: read the schemas and get started using the DLx format in your own projectsīugs & Feature Requests: Need to report a bug or suggest a feature? Open an issue on GitHub. DOI: 10.5281/zenodo.1438589Ībout the Format: learn what the DLx format is and how it works Please consider citing this specification in scholarly articles using this repository’s Zenodo DOI: This format also facilitates adherence to the Austin Principles for Data Citation in Linguistics by supporting the use of persistent identifiers, fields for identifying contributors to the data and their role(s), easy searchability, human-readability (in the form of human-readable keys in addition to opaque database IDs), and interoperability between different tools and web technologies more generally. All Digital Linguistics projects utilize this data format. In addition, this format is compatible with the modern web platform, making it easy to manage linguistic data online or in a browser. Tools which follow this recommended format will be interoperable, allowing users to migrate their data easily from one tool to another. This specification is a recommendation for how to store linguistic data in a way that is standardized, human-readable and web-compatible, using a popular data storage format on the web known as JSON. This repository contains the specification for the Data Format for Digital Linguistics (abbreviated as DaFoDiL, i.e. This project will be useful for anyone who manages a linguistic database. It is part of a broader project called Digital Linguistics (DLx), which has the goal of creating web tools for managing linguistic data. This project aims to create a standardized, human-readable, web-compatible format for storing linguistic data, following best practices for managing data on the modern web. The Data Format for Digital Linguistics (Daffodil) Introduction
