An Arxiv Spider

做为一个cser，杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时，杰出男孩都会感叹道：”师兄这是每天都挂在arxiv上呀，跑的好快~“。于是杰出男孩找了找 github，借鉴了一下其他大佬们的脚本，实现了一个每天向自己的邮件发送('cs.CV','cs.AI','stat.ML','cs.LG','cs.RO')里面感兴趣的文章的spider，支持自定义key word以及感兴趣的author。

How to run

配置main.py里面的邮箱用户名和密码，记得开启邮箱的pop3验证
修改run.sh里面代码的目录和运行的python env的路径
使用crontab设置定时任务
```
crontab -e
```
contrab内容为
```
0 10 * * 1,2,3,4,5 bash your_dir/arxiv_spider/run.sh
```
即每周一到周五，早上10点定时推送arxiv当天更新到邮箱

arxiv是一个非常棒的网站，用脚本高频率爬取肯定是要被谴责的行为。但文章每天只更新一次，所以建议大家每天运行一次脚本，相当于每天逛一次arxiv了~

Result

Today arxiv has 338 new papers in ['cs.CV', 'cs.AI', 'stat.ML', 'cs.LG', 'cs.RO'] area, and 127 of them is about CV, 2/2 of them contain your keywords.

Ensure your keywords is ['(?i)offline.*(RL|reinforcement learning)', '(?i)(RL|reinforcement learning).*offline'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12468
Title: SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12468

------------2------------
arXiv:2110.13060
Title: Safely Bridging Offline and Online Reinforcement Learning
['Machine Learning (cs.LG)', 'Machine Learning (stat.ML)']
https://arxiv.org/abs/2110.13060

Ensure your authors is ['Sergey Levine', 'Song Han'].

This is your paperlist.Enjoy!

------------1------------
arXiv:2110.12080
Title: C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
['Machine Learning (cs.LG)', 'Artificial Intelligence (cs.AI)']
https://arxiv.org/abs/2110.12080

------------2------------
arXiv:2110.12543
Title: Understanding the World Through Action
['Machine Learning (cs.LG)']
https://arxiv.org/abs/2110.12543

Acknowledgement

This code is built upon the implementation from https://github.com/ZihaoZhao/Arxiv_daily

An arxiv spider

Related tags

Overview

An Arxiv Spider

How to run

Result

Acknowledgement

Owner

Jie Liu

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python module to bypass Cloudflare's anti-bot page.

Scraping Thailand COVID-19 data from the DDC's tableau dashboard

An automated, headless YouTube Watcher and Scraper

:arrow_double_down: Dumb downloader that scrapes the web

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

A simple Discord scraper for discord bots

A low-code tool that generates python crawler code based on curl or url

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

Find papers by keywords and venues. Then download it automatically

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

一个m3u8视频流下载脚本

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

A module for CME that spiders hashes across the domain with a given hash.

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Searching info from Google using Python Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.