Snowflake database loading utility with Scrapy integration

Overview

Snowflake Stage Exporter

Snowflake database loading utility with Scrapy integration.
Meant for streaming ingestion of JSON serializable objects into Snowflake stages and tables.

Installation

> pip install git+https://github.com/hermit-crab/snowflake-stage-exporter.git

Basic example

from snowflake_stage_exporter import SnowflakeStageExporter

with SnowflakeStageExporter(
    user='...',
    password='...',
    account='...',
    table_path='MY_DATABASE.PUBLIC.{item_type_name}',
) as exporter:
    exporter.export_item({'name': 'Jack', 'salary': 100}, item_type_name='employee')
    exporter.export_item({'name': 'Sal', 'salary': 90, 'extra_info': {'age': 20}}, item_type_name='employee')
    exporter.export_item({'title': 'Steel Grill', 'price': 5.5}, item_type_name='product')
    exporter.finish_export()  # flushes all stage buffers, creates tables and populates them with data inside stages

After you call finish() 2 tables will be created: EMPLOYEE (2 rows, 3 columns) and PRODUCT (1 row, 2 columns) located inside database MY_DATABASE and database schema PUBLIC (Snowflake default database schema).

Same thing achieved via Scrapy integration.

How this works

For each object that you feed into the exporter it will write it into a local buffer (temporary JSON file). Once a configurable maximum buffer size is reached the file is uploaded to Snowflake internal stage via PUT statement. Upon the end of the execution exporter will create all specified tables then instruct Snowflake to populate each table from every staged JSON file via COPY INTO

statements

  • If you output to multiple tables then a buffer is maintained for each.
  • Alternatively you can create / populate tables as soon as the buffers are flushed via *_on parameters described below.
  • *_on parameters also allow you to disable any table creation / population and just deal with the stages yourself.
  • For table creation the exporter will try to figure out column types dynamically during execution, otherwise you can pass them explicitly via parameter.

Why "Stages"?

The use of local buffers and stages opposed to typical SQL INSERT statements is motivated largely by Snowflake performance implications and their billing model (see https://community.snowflake.com/s/question/0D50Z00008JpBymSAF/implications-of-multiple-insert-statements-vs-copy-into).

An illustrative example can be a long running Scrapy / ScrapyCloud job that constantly outputs data. If the job was to keep the connection constantly executing the INSERTs - Snowflake would also keep the warehouse running / consuming the credits for the entire duration of the job.

Secondary consideration was for allowing the user to be able to work with purely just the stages like one would work with S3 or similar blob file storage. This covers cases when user would needs their own sophisticated table management approach and simply wants a convenient place to store raw data.

IMPORTANT NOTE: it won't make much sense to use this library if you're already working with S3 or similar storages (including just local machine) where your data is one of the Snowflake supported file formats. Snowflake has built-in support of ingesting several 3rd party blob storages and for local files you can upload them via PUT statements.

Configuration

All of the configurations are done via arguments to main exporter class SnowflakeStageExporter.

  • user/password/account - Snowflake account credentials, passed as is to snowflake.connector.connect.
  • connection_kwargs - any additional parameters to snowflake.connector.connect.
  • table_path - table path to use.
    • If you specify database / database schema in connection_kwargs you won't need to specify them in the table path.
    • The path can include template variables which are expanded when you feed an item to exporter. By default only item variable is passed (e.g. 'MY_DB.PUBLIC.TABLE_{item[entity_type]}' here it's assumed all of your items have "entity_type" field).
    • Any additional variables you can pass yourself as keyword arguments when calling exporter.export_item().
    • Additionally in Scrapy integration the following fields are passed:
      • spider - spider instance.
      • item_type_name - type(item).__name__. In the basic example above you passed this explicitly yourself.
  • stage - which internal stage to use. By default user stage ("@~") is used.
  • stage_path - naming for the files being uploaded to the stage.
    • By default it's "{table_path}/{instance_ms}_{batch_n}.jl" where table_path is table_path with all variables resolved, instance_ms epoch milliseconds when exporter was instantiated and batch_n being sequential number of the buffer.
    • In Scrapy integration by default this is "{table_path}/{job}/{instance_ms}_{batch_n}.jl" where job is the key of the ScrapyCloud job or "local" if spider ran locally.
  • max_file_size - maximum buffer size in bytes. 1GiB by default.
  • predefined_column_types - dictionary of table_path to Snowflake columns types for table creation.
    • e.g. {"MY_DB.PUBLIC.PRODUCT": {"title": "STRING", "price": "NUMBER"}, "MY_DB.PUBLIC.EMPLOYEE": {"name": "STRING", "salary": "NUMBER", "extra_info": "OBJECT"}}.
  • ignore_unexpected_fields - ignore fields not passed in predefined_column_types during table creation / population.
    • True by default but only takes effect when table does have predefined column types.
    • The data is still exported in full to the staged files.
  • allow_varying_value_types - if False during table creation / population skip columns that had multiple value types.
    • True by default. VARIANT type is assigned to such column.
    • Error is logged when False and such column is encountered.
    • Takes effect only when there is a need for exporter to figure out the column type.
    • The data is still exported in full to the staged files.
  • create_tables_on - one of "finish/flush/never". "finish" by default. "flush" is for each time a file is staged.
  • populate_tables_on - ditto.
  • clear_stage_on - same as above but "never" is default. Each file is removed from stage individually when enabled.

Configuration (Scrapy)

All of the exporter instance parameters are exposed as Scrapy settings like SNOWFLAKE_ (e.g. SNOWFLAKE_MAX_FILE_SIZE).

Once a Scrapy job ends, all remaining buffers are flushed. If the job outcome is not "finished" (something went wrong) then no table creation / table population / stage clear takes place.

TODO

  • Unit tests >_>.
  • Test on windows?
.
Owner
Oleg T.
Oleg T.
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
Get-web-images - A python code that get images from any site

image retrieval This is a python code to retrieve an image from the internet, a

CODE 1 Dec 30, 2021
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
A simple reddit scraper to get memes (only images) from r/ProgrammerHumor.

memey A simple reddit scraper to get memes (only images) from r/ProgrammerHumor. Note Only works if you have firefox installed (yet). Instructions foo

2 Nov 16, 2021
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

Karan Arora 17 Jan 24, 2022
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 04, 2023
Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

Game Scraper Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms. Join the discord About The Proj

KursK 2 Mar 28, 2022
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 06, 2022
Libextract: extract data from websites

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

499 Dec 09, 2022
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

Dalunacrobate 347 Jan 07, 2023
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 07, 2023
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022
API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

Allan Galarza 25 Oct 31, 2022