This small project seeks to prepare a language model using the relevant research papers and articles currently available pertaining to the ongoing global Covid-19 pandemic. A multitude of datasets is used for this language model, the main source being the COVID-19 Open Research Dataset Challenge (CORD-19). It contains a collation of all the relevant and useful research papers on the topic. Further, I have also taken references from some Kaggle notebooks: Xhlulu’s notebook provides the appropriate cleaned data that I required for the work and Daniel Wolffram’s notebook talks about topic modelling and finding related articles and papers.
First the languages in which all the papers are written is detected. This is done by using the detect
function of the langdetect library. It is found that all but four papers are written in English and the rest are written in Spanish. Since I am not proficient in Spanish and translating only four papers is largely an easy task, I translated them using the Googletrans library.
The language model comes prepared with basic understanding of English thanks to the Fastai library. The paper Universal Language Model Fine-tuning for Text Classification argues that just like many applications in computer vision, language processing tasks can also be benefitted using transfer learning. So, the model is trained with Wikipedia text. In order to make it specific to my task, I train it on the Covid-19 papers data so that it identifies the nuances in these kinds of texts further. An AWD-LSTM is used for this task particularly because it employs DropConnect and a variant of Average-SGD (NT-ASGD) along with several other well-known regularisation strategies.
Training is done using a 50% dropout rate for a total of 14 epochs using the cosine cyclical annealing method of fitting proposed in Cyclical Learning Rates for Training Neural Networks by Leslie Smith. Briefly explaining, cyclical annealing helps the model to examine the entire domain of a loss funtion to precisely know where to optimise. After 14 epochs, the model accuracy comes out to around 48%. For a language model, this accuracy is quite good because it tells that the model is predicting the correct next word almost half the time (which is great for textual data).
The language model can further be used for carrying out document clustering and classification to help the frontline workers quickly find the appropriate research for the specific problem at hand.
The entire code can be viewed at the Github repository