This repository contains Sofcatalà web site content (articles and programs descriptions).
Dataset are available in the dataset directory.
Dataset size:
- articles.json contains 623 articles with 373233 words
- programes.json contains 330 program descripctions with 49868 words
The license of the data is Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
How to update the dataset:
- Export the programes and articles items from WordPress admin interface
- Save the raw files into _/raw _directory
- Run ./filter.sh to filter out sensitive data
- Do pip install -r requirements.txt
- Run python wp-to-json.py