构建一个多源(公众号、RSS)、干净、个性化的阅读环境

Overview

2C

构建一个多源(公众号、RSS)、干净、个性化的阅读环境

作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题

假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运气不好的话一天刷下来都是广告也不一定。若你关注了二三十个公众号,那很难避免现阶段公众号环境的广告轰炸。

更可恶的是,大部分的广告,无不是贩卖焦虑,营造消极气氛,实在无法忍受且严重影响我的心情。但有些公众号写的文章又确实不错,那怎么做可以不看广告只看文章呢?如果你在公众号阅读体验下深切感受到对于广告的无奈,那么这个项目就是你需要的。

这就是本项目的产生的原因,构建一个多源(公众号、RSS)、干净、个性化的阅读环境

PS: 这里声明一点,看广告是对作者的支持,这样一定程度上可以促进作者更好地产出。但我看到喜欢的会直接打赏支持,所以搭便车的言论在我这里站不住脚,谢谢。

实现

我的思路很简单,大概流程如下:

2c_process

简单解释一下:

  • 采集器:监控各自关注的公众号或者博客源,最终构建Feed流作为输入源;
  • 分类器(广告):基于历史广告数据,利用机器学习实现一个广告分类器(可自定义规则),然后给每篇文章自动打上标签再持久化到MongoDB
  • 分发器:依靠接口层进行数据请求&响应,为使用者提供个性化配置,然后根据配置自动进行分发,将干净的文章流向微信、钉钉、TG甚至自建网站都行。

这样做就实现了干净阅读环境的构建,衍生一下,还可以实现个人知识库的构建,可以做诸如标签管理、图谱构建等,这些都可以在接口层进行实现。

实现详情可参考文章[打造一个干净且个性化的公众号阅读环境]

使用

本项目使用 pipenv 进行项目管理, 安装使用过程如下:

# 确保有Python3.6+环境
git clone https://github.com/howie6879/2c.git
cd 2c

# 创建基础环境
pipenv install --python={your_python3.6+_path}  --skip-lock --dev
# 配置.env 具体查看 doc/00.环境变量.md
# 启动
pipenv run dev

使用前建议阅读文档:

帮助

为了提升模型的识别准确率,我希望大家能尽力贡献一些广告样本,请看样本文件:.files/datasets/ads.csv,我设定格式如下:

title url
广告文章标题 广告文章连接

来个实例:

ads_demo

一般广告会重复在多个公众号投放,填写的时候麻烦查一下是否存在此条记录,真的真的希望大家能一起合力贡献,亲,来个PR贡献你的力量吧!

致谢

非常感谢以下项目:

感谢以下开发者的贡献(排名不分先后):

关于

欢迎与我交流(关注入群):

img
Comments
  • 使用 docker 一键安装,运行报错 ERROR Liuli 执行失败!'doc_source'

    使用 docker 一键安装,运行报错 ERROR Liuli 执行失败!'doc_source'

    运行日志如下,请问这是啥问题。

    [2022:02:18 10:51:54] INFO  Liuli Schedule(v0.2.1) task([email protected]_team) started successfully :)
    
    [2022:02:18 10:51:54] INFO  Liuli Task([email protected]_team) schedule time:
    
     00:10
    
     12:10
    
     21:10
    
    [2022:02:18 10:51:54] ERROR Liuli 执行失败!'doc_source'
    opened by GuoZhaoHui628 24
  • 带有空格的公众号采集总是失败

    带有空格的公众号采集总是失败

    [2022:05:27 08:11:47] INFO Request <GET: https://weixin.sogou.com/weixin?type=1&query=丁爸20%情报分析师的工具箱&ie=utf8&s_from=input&sug=n&sug_type=> liuli_schedule | [2022:05:27 08:11:48] ERROR SGWechatSpider <Item: Failed to get target_item's value from html.> liuli_schedule | Traceback (most recent call last): liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/spider.py", line 197, in _process_async_callback liuli_schedule | async for callback_result in callback_results: liuli_schedule | File "/data/code/src/collector/wechat/sg_ruia_start.py", line 58, in parse liuli_schedule | async for item in SGWechatItem.get_items(html=html): liuli_schedule | File "/root/.local/share/virtualenvs/code-nY5aaahP/lib/python3.9/site-packages/ruia/item.py", line 127, in get_items liuli_schedule | raise ValueError(value_error_info) liuli_schedule | ValueError: <Item: Failed to get target_item's value from html.>

    bug 
    opened by hackdoors 7
  • liuli_schedule exited with code 0

    liuli_schedule exited with code 0

    根据https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA的提示进行安装。

    实际文件和代码如下:

    pro.env文件的内容:

    PYTHONPATH=${PYTHONPATH}:${PWD}
    LL_M_USER="liuli"
    LL_M_PASS="liuli"
    LL_M_HOST="liuli_mongodb"
    LL_M_PORT="27017"
    LL_M_DB="admin"
    LL_M_OP_DB="liuli"
    LL_FLASK_DEBUG=0
    LL_HOST="0.0.0.0"
    LL_HTTP_PORT=8765
    LL_WORKERS=1
    # 上面这么多配置不用改,下面的才需要各自配置
    # 请填写你的实际IP
    LL_DOMAIN="http://172.17.0.1:8765"
    # 请填写微信分发配置
    LL_WECOM_ID="自定义"
    LL_WECOM_AGENT_ID="自定义"
    LL_WECOM_SECRET="自定义"
    

    default.json的内容如下:

    {
        "name": "default",
        "author": "liuli_team",
        "collector": {
            "wechat_sougou": {
                "wechat_list": [
                    "老胡的储物柜"
                ],
                "delta_time": 5,
                "spider_type": "playwright"
            }
        },
        "processor": {
            "before_collect": [],
            "after_collect": [{
                "func": "ad_marker",
                "cos_value": 0.6
            }, {
                "func": "to_rss",
                "link_source": "github"
            }]
        },
        "sender": {
            "sender_list": ["wecom"],
            "query_days": 7,
            "delta_time": 3
        },
        "backup": {
            "backup_list": ["mongodb"],
            "query_days": 7,
            "delta_time": 3,
            "init_config": {},
            "after_get_content": [{
                "func": "str_replace",
                "before_str": "data-src=\"",
                "after_str": "src=\"https://images.weserv.nl/?url="
            }]
        },
        "schedule": {
            "period_list": [
                "00:10",
                "12:10",
                "21:10"
            ]
        }
    }
    

    docker-compose.yml文件的内容如下:

    version: "3"
    services:
      liuli_api:
        image: liuliio/api:v0.1.3
        restart: always
        container_name: liuli_api
        ports:
          - "8765:8765"
        volumes:
          - ./pro.env:/data/code/pro.env
        depends_on:
          - liuli_mongodb
        networks:
          - liuli-network
      liuli_schedule:
        image: liuliio/schedule:v0.2.4
        restart: always
        container_name: liuli_schedule
        volumes:
          - ./pro.env:/data/code/pro.env
          - ./liuli_config:/data/code/liuli_config
        depends_on:
          - liuli_mongodb
        networks:
          - liuli-network
      liuli_mongodb:
        image: mongo:3.6
        restart: always
        container_name: liuli_mongodb
        environment:
          - MONGO_INITDB_ROOT_USERNAME=liuli
          - MONGO_INITDB_ROOT_PASSWORD=liuli
        ports:
          - "27027:27017"
        volumes:
          - ./mongodb_data:/data/db
        command: mongod
        networks:
          - liuli-network
    
    networks:
      liuli-network:
        driver: bridge
    

    报错内容如下:

    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule  | Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py
    liuli_schedule  | Loading .env environment variables...
    liuli_schedule exited with code 0
    

    我感觉是python路径的问题。我的python路径是:

    which python3 # /usr/bin/python3
    

    我的VPS中没有${PYTHONPATH}这个系统变量:

    echo ${PYTHONPATH} # NULL
    

    请问大佬,我应该如何改正?

    opened by huangwb8 7
  • Liuli 项目需要一个 logo

    Liuli 项目需要一个 logo

    项目名称来源,群友 @ Sngxpro 提供:

    代号:琉璃(Liuli)
    
    英文:RuriElysion
     or:RuriWorld
    
    slogan:琉璃开净界,薜荔启禅关 ---梅尧臣《缑山子晋祠 会善寺》
    
    寓意:构建一方净土如东方琉璃净世界。《药师经》云:「然彼佛土,一向清净,无有女人,亦无恶趣,及苦音声。」
    
    help wanted 
    opened by howie6879 7
  • 希望能在RSS订阅里面包含~原始文章链接

    希望能在RSS订阅里面包含~原始文章链接

    image

    目前打算写一个脚本,通过全文获取API来去获取全文,在根据自定义的格式寄给我的gmail...这样除了newsletter之外,一些RSS订阅和微信公众号都可以直接在spark阅读...

    然而我找到的全文获取的付费api要求有些高,RSS里面的link格式不行,就算经过decodeURIComponent函数转换也还是格式不正确。

    如果RSS订阅有原始网页的连接,就可以抓取用原始链接来获取全文而不会出错!

    希望作者可以给与支持!感谢:)

    opened by CenBoMin 5
  • 希望增加功能,取消生成的RSS中的updated的变动

    希望增加功能,取消生成的RSS中的updated的变动

    截取一部分生成的RSS信息如下,此处的 updated 日期,为liuli在周期性运行的过程中更新时的时间,即使对于一条很久以前的RSS信息,它的 updated 也会被更新到当前时间。

    <entry>
        <id>liuli_wechat - 谷歌开发者 - 社区说|TensorFlow 在工业视觉中的落地</id>
        <title>社区说|TensorFlow 在工业视觉中的落地 </title>
        <updated>2022-05-28T13:17:35.903720+00:00</updated>
        <author>
            <name>liuli_wechat - GDG</name>
        </author>
        <content/>
        <link href="https://ddns.ysmox.com:8766/backup/liuli_wechat/谷歌开发者/%E7%A4%BE%E5%8C%BA%E8%AF%B4%EF%BD%9CTensorFlow%20%E5%9C%A8%E5%B7%A5%E4%B8%9A%E8%A7%86%E8%A7%89%E4%B8%AD%E7%9A%84%E8%90%BD%E5%9C%B0" rel="alternate"/>
        <published>2022-05-25T17:30:46+08:00</published>
    </entry>
    

    这样会引起一些问题,在某些RSS订阅器上(如Tiny Tiny RSS),其时间轴上是根据 updated 来排序,而并非 published,如此一来,无法有效地区分当前的RSS哪些内容是最近生成的,哪些又是以前生成过的。

    所以希望保留 updated 的时间不变(如第一次存到mongodb中时,记录当前时间;若周期性更新时则不改变其值)或者与 published 保持一致。

    最后,希望我已经清楚地表达了我的问题和请求,谢谢!

    enhancement 
    opened by YsMox 3
  • 爬取微信公众号的Demo执行失败

    爬取微信公众号的Demo执行失败

    参考的https://mp.weixin.qq.com/s/rxoq97YodwtAdTqKntuwMA 刚起了demo试着爬一下微信公众号的内容,但是日志里显示执行失败了。

    Loading .env environment variables...
    [2022:05:09 10:55:45] INFO  Liuli Schedule(v0.2.4) task([email protected]_team) started successfully :)
    [2022:05:09 10:55:45] INFO  Liuli Task([email protected]_team) schedule time:
     00:10
     12:10
     21:10
    [2022:05:09 10:55:45] ERROR Liuli 执行失败!'doc_source'
    

    文章里给你docker compose配置文件里使用的liuli schedule镜像版本是不带playwright的,我看文章里提供的default的json里描述的使用playwright爬取微信内容,尝试着更改为了带playwright的版本,也显示执行失败。

    opened by Colin-XKL 3
  • 抓取公众号文章时,时间格式清洗出错

    抓取公众号文章时,时间格式清洗出错

    测试脚本如下:

    from src.collector.wechat_feddd.start import WeiXinSpider
    WeiXinSpider.request_config = {"RETRIES": 3, "DELAY": 5, "TIMEOUT": 20}
    WeiXinSpider.start_urls = ['https://mp.weixin.qq.com/s/OrCRVCZ8cGOLRf5p5avHOg']
    WeiXinSpider.start()
    

    错误原因: 数据清洗时,期望的数据格式是 2022-03-21 20:59,但实际抓取回来的数据是 2022-03-22 20:37:12,导致 clean_doc_ts函数报错。如下图 image

    opened by showthesunli 3
  • 动态获取企业微信分发部门ID参数

    动态获取企业微信分发部门ID参数

    新增两个配置项:

    # 企业微信分发用户(填写用户帐号,不区分大小写),多个用户用;分割
    CC_WECOM_TO_USER=""
    # 企业微信分发部门(填写部门名称),多个部门用;分割
    CC_WECOM_PARTY=""
    

    如两项都不填写,默认向当前应用所有部门的所有用户分发,如用户填写,则按用户填写的配置进行分发

    opened by zyd16888 1
  • 0.24版本参照教程无法启动schedule

    0.24版本参照教程无法启动schedule

    如果按照教程手动添加pro.env文件,无法启动docker,但是如果不手动添加文件,启动docker的话会自动创建pro.env文件夹,然后docker会循环输出如下日志 Loading .env environment variables... Start schedule(pro) serve: PIPENV_DOTENV_LOCATION=./pro.env pipenv run python src/liuli_schedule.py Warning: file PIPENV_DOTENV_LOCATION=./pro.env does not exist!! Not loading environment variables. Process Process-1: Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/data/code/src/liuli_schedule.py", line 84, in run_liuli_schedule ll_config = json.load(load_f) File "/usr/local/lib/python3.9/json/init.py", line 293, in load return loads(fp.read(), File "/usr/local/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/lib/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None

    opened by zhyueyueniao 1
  • 分发器支持

    分发器支持

    目前计划支持将文章输出到如下终端:

    • [x] 钉钉,比较开放,方便介入,推荐 @howie6879
    • [x] 微信,可考虑企业微信或Hook @howie6879
    • [x] RSS生成器模块 @howie6879
    • [x] TG @123seven
    • [x] Bark @LeslieLeung
    • [ ] 飞书

    更多分发终端需求大家可在评论区请求支持

    enhancement help wanted 
    opened by howie6879 14
Releases(v0.2.0)
  • v0.2.0(Feb 10, 2022)

    v0.2.0 2022-02-11

    liuli v0.2.0 👏 成功发布,看板计划见这里,相关特性和功能提升见下方描述。

    提升:

    • 部分代码重构,重命名为 liuli
    • 提升部署效率,支持docker-compose #17
    • 项目容量从100m缩小到3m(移除模型)

    修复:

    • 分发器:企业微信分发部门ID参数不定 #16 @zyd16888
    • 修复含有特殊字符密码链接失败 #35 @gclm

    特性:

    Source code(tar.gz)
    Source code(zip)
Owner
howie.hu
奇文共欣赏,疑义相与析
howie.hu
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

SNCSE SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples This is the repository for SNCSE. SNCSE aims to allev

Sense-GVT 59 Jan 02, 2023
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 05, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

SUNNY BHAVEEN CHANDRA 2 Oct 02, 2022
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 06, 2022