Today, Majlinda Llugiqi and I released the first Albanian News Articles Dataset on Kaggle. It includes over 3 million articles from different Kosovo’s news pages, which can be used for category prediction and other various Machine Learning/NLP tasks.

Dataset Header Image

After having scraped all of the newspages through their Wordpress API’s we merged all of the data into this file, where to separate the origin of each news article we’ve also added the source to each post.

All available articles from the 25.09.2007 until 27.08.2020 are stored in the file.

These articles were taken from these news pages:

  • https://www.gazetaexpress.com/
  • https://insajderi.com/
  • https://gazetablic.com
  • https://ballkani.info/
  • https://indeksonline.net/
  • https://klankosova.tv/
  • https://kallxo.com/
  • https://lajmi.net/
  • https://telegrafi.com/
  • https://www.kungulli.com/

The dataset and further details can be found here: https://www.kaggle.com/gentrexha/kosovo-news-articles-dataset

Tools: Python