Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

OSTA web scraper, for checking the status of school buses in Ottawa

Python Web Scrapper Project

a high-performance, lightweight and human friendly serving engine for scrapy

12306抢票脚本

Deep Web Miner Python | Spyder Crawler

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Html Content / Article Extractor, web scrapping lib in Python

Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

Extract embedded metadata from HTML markup

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Web scrapping

🐞 Douban Movie / Douban Book Scarpy

Command line program to download documents from web portals.

Scraping weather data using Python to receive umbrella reminders

🥫 The simple, fast, and modern web scraping library

fork huanghyw/jd_seckill

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

A scalable frontier for web crawlers

Iptvcrawl - A scrapy project for crawl IPTV playlist