Automatically download and crop key information from the arxiv daily paper.

Last update: Jul 30, 2022

Related tags

Web Crawling paper deeplearning arxiv

Overview

Arxiv daily 速览

功能：按关键词筛选arxiv每日最新paper，自动获取摘要，自动截取文中表格和图片。

1 测试环境

Ubuntu 16+
Python3.7
torch 1.9
Colab GPU

2 使用演示

首先下载权重baiduyun 提取码:il87，放置于code/ParseServer/models/PubLayNet/faster_rcnn_R_50_FPN_3x/model_final.pth

2.1 环境安装

可选择在本地使用或Colab使用，以本地使用为例。

1.提前安装Pytorch GPU版本
2.在本项目根目录启动jupyter notebook，运行Overview_RUNME_Local.ipynb
3.首次运行，先安装环境

4.运行文档版面分析服务，确认正常启动后再运行下一步

5.按照需要填写关键词进行筛选，如果需要PDF文件needPDF=True，需要将结果打包needZip=True

6.启动后，将同时进行下载和文档版面分析，截取需要的内容。下载的文件将保存在./arxiv 目录下，如果needZip=True，会产生 ./arxiv.zip 文件。

2.2 Colab

将code目录压缩上传 google drive根目录
使用Colab运行Overview_RUNME_Colab.ipynb，后续步骤同2.1

3 效果展示

本地解压后，使用Typora markdown阅览工具可进行查看。

每个文件夹中的abs.md文件保留的是当前pdf的介绍。

ps:排版不规范会导致截图混乱，这也侧面说明了文章质量。

其他

ps:本着能用就行"堆屎山"代码，有bug描述清楚提issue，定期维护。

Owner

HeoLis

Interesting in generate methods.

HeoLis

GitHub Repository

Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

4 Jan 05, 2022

WebScrapping Project - G1 Latest News

Web Scrapping com Python Esse projeto consiste em um código para o usuário buscar as últimas nóticias sobre um termo qualquer, no site G1. Para esse p

2 Feb 13, 2022

Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

150 Oct 15, 2022

Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

21 Sep 29, 2022

Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

1 Dec 07, 2021

Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

334 Jan 01, 2023

Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

1 Feb 11, 2022

Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

31 Apr 17, 2022

CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

0 Jul 23, 2022

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

2 Aug 03, 2022

This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

1 Dec 13, 2021

A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021

A webdriver-based script for reserving Tsinghua badminton courts.

AutoReserve A webdriver-based script for reserving badminton courts. 使用说明下载 chromedriver 选择当前Chrome对应版本安装 selenium pip install selenium 更改场次、金额信息dat

4 Nov 09, 2021

LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

321 Dec 12, 2022

Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

0 Jul 04, 2021

Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

1 Nov 10, 2021

Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs.

searchcve Web scrapping tool written in python3, using regex, to get CVEs, Source and URLs. Generates a CSV file in the current directory. Uses the NI

32 Oct 10, 2022

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

1 Nov 30, 2021

Google Maps crawler using Selenium

Google Maps Crawler using Selenium Built as part of the Antifragile Dev Project Selenium crawler that browses Google Maps as a regular user and stores

46 Dec 16, 2022