Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:52628
	🔒 Private network:	192.168.178.31:52628
	🌐 Public address:	217.70.138.123:52628
⠹       DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sister’s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
RedTeam-Security - In this repo you will get the information of Red Team Security related links

OSINT Passive Discovery Amass - https://github.com/OWASP/Amass (Attack Surface M

Abhinav Pathak 5 May 18, 2022
Detection tool of malware(s) by checksum (useful for forensic)

🐍 malware_checker.py Detection tool of malware(s) by checksum (useful for forensic) 📦 Dependencies installation $ pip3 install -r requirements.txt

Fayred 1 Jan 30, 2022
proof-of-concept running docker container from omero web

docker-from-omero-poc proof-of-concept running docker container from omero web How-to Edit test_script.py so that the BaseClient is created pointing t

Erick Martins Ratamero 2 Jan 22, 2022
Glass是一款针对资产列表的快速指纹识别工具,通过调用Fofa/ZoomEye/Shodan/360等api接口

Glass是一款针对资产列表的快速指纹识别工具,通过调用Fofa/ZoomEye/Shodan/360等api接口快速查询资产信息并识别重点资产的指纹,也可针对IP/IP段或资产列表进行快速的指纹识别。

s7ck Team 764 Jan 05, 2023
Security offerings for AWS Control Tower

Caylent Security Catalyst Reference Architecture Examples This repository contains solutions for Caylent's Security Catalyst. The Security Catalyst is

Steven Connolly 1 Oct 22, 2021
Python directory buster, multiple threads, gobuster-like CLI, web server brute-forcer, URL replace pattern feature.

pybuster v1.1 pybuster is a tool that is used to brute-force URLs of web servers. Features Directory busting (URI) URL replace patterns (put PYBUSTER

Glaukio 1 Jan 05, 2022
Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

12 Sep 28, 2022
The Multi-Tool Web Vulnerability Scanner.

🟥 RapidScan v1.2 - The Multi-Tool Web Vulnerability Scanner RapidScan has been ported to Python3 i.e. v1.2. The Python2.7 codebase is available on v1

skavngr 1.3k Dec 31, 2022
Searches filesystem for CVE-2021-44228 and CVE-2021-45046 vulnerable instances of log4j library, including embedded (jar/war/zip) packaged ones.

log4shell_finder Python port of https://github.com/mergebase/log4j-detector log4j-detector is copyright (c) 2021 - MergeBase Software Inc. https://mer

Hynek Petrak 33 Jan 04, 2023
Pre-Auth Blind NoSQL Injection leading to Remote Code Execution in Rocket Chat 3.12.1

CVE-2021-22911 Pre-Auth Blind NoSQL Injection leading to Remote Code Execution in Rocket Chat 3.12.1 The getPasswordPolicy method is vulnerable to NoS

Enox 47 Nov 09, 2022
LeLeLe: A tool to simplify the application of Lattice attacks.

LeLeLe is a very simple library (300 lines) to help you more easily implement lattice attacks, the library is inspired by Z3Py (python interfa

Mathias Hall-Andersen 4 Dec 14, 2021
Course: Information Security with Python

Curso: Segurança da Informação com Python Curso realizado atravès da Plataforma da Digital Innovation One Prof: Bruno Dias Conteúdo: Introdução aos co

Elizeu Barbosa Abreu 1 Nov 28, 2021
Windows Server 2016, 2019, 2022 Extracter & Recovery

Parsing files from Deduplicated volumes. It can also recover deleted files from NTFS Filesystem that were deduplicated. Installation git clone https:/

0 Aug 28, 2022
PKUAutoElective for 2021 spring semester

PKUAutoElective 2021 Spring Version Update at Mar 7 15:28 (UTC+8): 修改了 get_supplement 的 API 参数,已经可以实现课程列表页面的正常跳转,请更新至最新 commit 版本 本项目基于 PKUAutoElectiv

Zihan Mao 84 Sep 09, 2022
Argument Injection in Dragonfly Ruby Gem

CVE-2021-33564 PoC Exploit script for CVE-2021-33564 (Argument Injection in Dragonfly Ruby Gem). Usage Arbitrary File Read python3 poc.py -u https://

Michael Tsai 12 Nov 09, 2022
PyExtractor is a decompiler that can fully decompile exe's compiled with pyinstaller or py2exe

PyExtractor is a decompiler that can fully decompile exe's compiled with pyinstaller or py2exe with additional features such as malware checker/detector! Also checks file(s) for suspicious words, dis

Rdimo 56 Jul 31, 2022
DNSpooq - dnsmasq cache poisoning (CVE-2020-25686, CVE-2020-25684, CVE-2020-25685)

dnspooq DNSpooq PoC - dnsmasq cache poisoning (CVE-2020-25686, CVE-2020-25684, CVE-2020-25685) For educational purposes only Requirements Docker compo

Teppei Fukuda 80 Nov 28, 2022
S2-062 (CVE-2021-31805) / S2-061 / S2-059 RCE

CVE-2021-31805 Remote code execution S2-062 (CVE-2021-31805) Due to Apache Struts2's incomplete fix for S2-061 (CVE-2020-17530), some tag attributes c

warin9 31 Nov 22, 2022
PortSwigger Burp Plugin for the Log4j (CVE-2021-44228)

yLog4j This is Y-Sec's @PortSwigger Burp Plugin for the Log4j CVE-2021-44228 vulnerability. The focus of yLog4j is to support mass-scanning of the Log

Y-Security 1 Jan 31, 2022
🔍 IRIS: An open-source intelligence framework

IRIS is an open-source OSINT framework, consisting of modules to find information about a target by scraping sites and fetching data from APIs.

IRIS 79 Dec 20, 2022