DHCenter - logo
arXiv is owned and operated by Cornell University, a private not-for-profit educational institution / Public domain

On August 4th, data science community Kaggle announced its presentation of a free, open pipeline to the machine-readable dataset of the open-access repository, arXiv.

“Having the entire arXiv corpus on Kaggle grows the potential of arXiv articles immensely,” said Eleonora Presani, arXiv Executive Director in Kaggle’s Medium article. “By offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format.”

Kaggle said its hope was to “empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, and semantic search interfaces.”

The dataset is now available on Kaggle and will be updated weekly.

Read the full article on Medium, or the arXiv blog.

Research and projects

CROSS program funds four digital humanities projects

Career opportunities

University of Zurich Digital Society Initiative Excellence Program for PhD students


New book seeks to combat ‘media warming’


UNIL conference on Research and Mobility: presentations available online