The special interest group deals with the linguistic, linguistic and textological basics which are needed for the construction of annotated corpora of language use in social media and in Internet-based communication as well as corresponding data in web corpora. Internet-based communication (also known as “computer-mediated communication”) comprises dialogic forms of communication that use the Internet as a communication infrastructure – for example, communication in online forums, chats, instant messaging applications and via Skype, on wiki discussion pages, in blog and video blog comment threads, on Twitter, on social network profile pages, and in multimodal interaction spaces (learning environments, MMORPGs, and “virtual worlds”).
There are already national and international initiatives on the subject areas of the special interest group (eg as part of the Text Encoding Initiative). This is followed by the special interest group, in collaboration with researchers from linguistics, computational linguistics and language technology to develop solutions specifically for German-language data.
The special interest group consolidates topics, projects and discussion lines with computer linguistic, linguistic and textual technological aspects, which were treated within the framework of the DFG Network Empirical Research on Internet-based Communication (Empirikom) and for the development of methods for the processing and annotation of speech data from social media and from genres of Internet-based communication are of central importance. This includes:
anchoring the topic of “Social Media / Internet-based Communication” on the agenda of national and international standardization initiatives in the field of speech and text technology ;
the documentation of annotation guidelines, gold standards and results from projects for the adaptation of existing NLP procedures for the automatic linguistic annotation of speech data from social media and from genres of internet-based communication;
the creation of standardized components for the automatic processing of voice data from social media and from genres of internet-based communication, eg in cooperation with the development teams of Apache UIMA and the DKPro framework; it is planned to develop the components in the UIMA standard and make them freely available as part of DKPro;
the documentation of rights issues relating to the collection, annotation and provision of voice data from the treated genres in Corpora and their use for the purposes of empirical speech analysis and in the field of speech technology;
the establishment of a network of researchers who deal with the issues dealt with in the AK at home and abroad (based on existing contacts and cooperations).
Regular workshops on changing key topics, exchange via a mailing list and a digital newsletter as well as documentation of current projects and events related to the topics of the AK on the GSCL website are planned.
- Workshop of the AK as part of the KONVENS 2014: “NLP 4 CMC: Natural Language Processing for Computer-Mediated Communication / Social Media”
- University of Hildesheim, October 6, 2014
- Website for the workshop and call for papers: sites.google.com/site/nlp4cmc
- Workshop “Social Media Corpora for the eHumanities: Standards, Challenges, and Perspectives”
- [TU Dortmund, 20./21. February 2014](](https://sites.google.com/view/empirikom/aktivit%C3%A4ten/7-tagung-2014)
The workshop focuses on topics that have been the focal points of the DFG’s network “Empirical Research in Internet-based Communication” over the past three and a half years: Am Examples of corpus projects from Germany, France, the Netherlands, Italy and Switzerland will address questions of the linguistic description of language use in social media as well as corpus and computer linguistic aspects of the construction, annotation and processing of corpora to language on the Internet and in social media.
Networking and cooperation
The AK uses existing contacts and cooperations from the DFG Network Empirical Research on Internet-based Communication as well as the development team of Apache UIMA and the DKPro Framework.
For the area of development and standardization of representation schemes, a close collaboration with the Special Interest Group Computer-Mediated Communication is planned as part of the Text Encoding Initiative (TEI).
For the adaptation of German language data tags to the specificities of the treated genres, the AK cooperates with the special interest group on the revision of the Stuttgart-Tübingen-Tagset (STTS).
Existing contacts with comparable networks in other European countries (e.g. the French Nouvelles formes de communication (Nouv-com)) and projects from the Building and Annotating Corpora of Computer-Mediated Communication will be further developed within the framework of the AK. Among other things, joint workshops are planned to discuss issues of processing and annotating data from social media and from internet-based communication genres for different languages.