Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

Facebook Group Scraping Using Beautiful Soup & Selenium

Bulk download tool for the MyMedia platform

Web Scraping Framework

A Python library for automating interaction with websites.

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

a high-performance, lightweight and human friendly serving engine for scrapy

Automated Linkedin bot that will improve your visibility and increase your network.

Web scrapper para cotizar articulos

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

This repo has the source code for the crawler and data crawled from auto-data.net

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Iptvcrawl - A scrapy project for crawl IPTV playlist

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

An arxiv spider

A modern CSS selector implementation for BeautifulSoup

A high-level distributed crawling framework.

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

This program will help you to properly scrape all data from a specific website

A web scraper for nomadlist.com, made to avoid website restrictions.

A python script to extract answers to any question on Quora (Quora+ included)