The press is considered to play a fundamental social role, as it shapes public opinion. In
this regard, CDA Critical discourse analysis (CDA) has as a primary aim to study “the way
social power abuse, dominance, and inequality are enacted, reproduced, and resisted by text
and talk in the social and political context” with the purpose of resisting “social inequality
(van Dijk, 1991: 353). The analysis of ideologies in news discourse has a long tradition,
but only recently have linguists started to use large corpora and corpus techniques to study
them.
This presentation describes the process of developing a large corpus of journalistic
news in English, Spanish and Catalan on Violence Against Women (VAW) in the digital
press, which contains over 80,000 texts and 70 million words so far. This corpus is part
of the NEWSGEN project of the University of Valencia, which aims to document and
investigate the historical evolution and the political, cultural, social, and ideological
impact of discourses on VAW in recent times.
Methodologically, the three phases for creating this corpus will be described:
design, compilation, and annotation. The seed words on VAW have been defined in the
design phase. The Factiva database was used for the compilation of the corpus, and then
the texts were cleaned of irrelevant data and duplicates were eliminated. Finally, the texts
were annotated with metadata such as the article's date, title, and body. A statistical
analysis of the corpus was conducted, and case studies showing its potential and possible
applications will be presented.