Search for documents in a domain through Google. The objective is to extract metadata

Last update: Dec 16, 2022

Related tags

Overview

MetaFinder - Metadata search through Google

   _____               __             ___________ .__               .___                   
  /     \     ____   _/  |_  _____    \_   _____/ |__|   ____     __| _/   ____   _______  
 /  \ /  \  _/ __ \  \   __\ \__  \    |    __)   |  |  /    \   / __ |  _/ __ \  \_  __ \ 
/    Y    \ \  ___/   |  |    / __ \_  |     \    |  | |   |  \ / /_/ |  \  ___/   |  | \/ 
\____|__  /  \___  >  |__|   (____  /  \___  /    |__| |___|  / \____ |   \___  >  |__|    
        \/       \/               \/       \/               \/       \/       \/          
        
|_ Author: @JosueEncinar
|_ Description: Search for documents in a domain through Google. The objective is to extract metadata
|_ Usage: python3 metafinder.py -d domain.com -l 100 -o /tmp

Installation:

> pip3 install metafinder

Upgrades are also available using:

> pip3 install metafinder --upgrade

Usage

CLI

metafinder -d domain.com -l 20 -o folder [-t 10] [-v]

Parameters:

d: Specifies the target domain.
l: Specify the maximum number of results to be searched.
o: Specify the path to save the report.
t: Optional. Used to configure the threads (4 by default).
v: Optional. It is used to display the results on the screen as well.

In Code

import metafinder.extractor as metadata_extractor

documents_limit = 5
domain = "target_domain"
data = metadata_extractor.extract_metadata_from_google_search(domain, documents_limit)
for k,v in data.items():
    print(f"{k}:")
    print(f"|_ URL: {v['url']}")
    for metadata,value in v['metadata'].items():
        print(f"|__ {metadata}: {value}")

document_name = "test.pdf"
try:
    metadata_file = metadata_extractor.extract_metadata_from_document(document_name)
    for k,v in metadata_file.items():
        print(f"{k}: {v}")
except FileNotFoundError:
    print("File not found")

Author

This project has been developed by:

Josué Encinar García -- @JosueEncinar

Contributors

Félix Brezo Fernández -- @febrezo

Disclaimer!

This Software has been developed for teaching purposes and for use with permission of a potential target. The author is not responsible for any illegitimate use.

Search for documents in a domain through Google. The objective is to extract metadata

Related tags

Overview

MetaFinder - Metadata search through Google

Installation:

Usage

CLI

In Code

Author

Contributors

Disclaimer!

Owner

Josué Encinar

Baseline code for Korean open domain question answering(ODQA)

API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

Code for evaluating Japanese pretrained models provided by NTT Ltd.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

Segmenter - Transformer for Semantic Segmentation

Example code for "Real-World Natural Language Processing"

GSoC'2021 | TensorFlow implementation of Wav2Vec2

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

🏆 • 5050 most frequent words in 109 languages

Train 🤗-transformers model with Poutyne.

TLA - Twitter Linguistic Analysis

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The Classical Language Toolkit

This repository has a implementations of data augmentation for NLP for Japanese.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

👄 The most accurate natural language detection library for Python, suitable for long and short text alike