Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Overview

FeatureHasher

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Note, this requires Jina>=2.2.4.

Example

Here I use FeatureHasher to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match to find top-K similar sentences.

from jina import Document, DocumentArray, Flow

# load 
   
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()

# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())

# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')

embed_da = DocumentArray()
with f:
    f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)

print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
    print(d.text)
    for m in d.matches:
        print(m.scores['cosine'], m.text)
    input()
           [email protected][I]:๐ŸŽ‰ Flow is ready to use!
	๐Ÿ”— Protocol: 		GRPC
	๐Ÿ  Local access:	0.0.0.0:52628
	๐Ÿ”’ Private network:	192.168.178.31:52628
	๐ŸŒ Public address:	217.70.138.123:52628
โ น       DONE โ”โ•ธโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences:  12153
๏ปฟThe Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

   
     *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

    
      *** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***

     
       production, promotion and distribution of Project Gutenberg-tm

      
        Pride and Prejudice

       
         By Jane Austen This eBook is for the use of anyone anywhere in the United States and 
        
          This eBook is for the use of anyone anywhere in the United States and 
         
           by the awkwardness of the application, and at length wholly 
          
            Elizabeth passed the chief of the night in her sisterโ€™s room, and 
           
             the happiest memories in the world. Nothing of the past was 
            
              charities and charitable donations in all 50 states of the United 
            
           
          
         
        
       
      
     
    
   

In practice, you can implement matching and storing via an indexer inside Flow. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.

Owner
Jina AI
A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.
Jina AI
A simple linux keylogger project.

The project This project is a simple linux keylogger. When activated, it registers all the actions made with the keyboard. The log files are registere

1 Oct 24, 2021
A proof-of-concept exploit for Log4j RCE Unauthenticated (CVE-2021-44228)

CVE-2021-44228 โ€“ Log4j RCE Unauthenticated About This is a proof-of-concept exploit for Log4j RCE Unauthenticated (CVE-2021-44228). This vulnerability

Pedro Havay 20 Nov 11, 2022
A deobfuscator for multiple python obfuscators

PY4COC A deobfuscator for multiple python obfuscators, supports exe's packed with pyinstaller too. How to use python3 py4coc.py exe file or py file o

svenskithesource 16 Dec 03, 2022
Execution After Redirect (EAR) / Long Response Redirection Vulnerability Scanner written in python3

Execution After Redirect (EAR) / Long Response Redirection Vulnerability Scanner written in python3, It Fuzzes All URLs of target website & then scan them for EAR

Pushpender Singh 9 Dec 12, 2022
Growtopia Save.dat Stealer

savedat-stealer Growtopia Save.dat Stealer (Auto Send To Webhook) How To Use After Change Webhook URL Compile script to exe Give to target Done Info C

NumeX 9 May 01, 2022
A toolkit for web reconnaissance, it's fast and easy to use.

A toolkit for web reconnaissance, it's fast and easy to use. File Structure httpsuite/ main.py init.py db/ db.py init.py subdomains_db directories_db

whoami security 22 Jul 22, 2022
A knockoff social-engineer toolkit

The Python SE Dopp Kit is a social engineering toolkit with many purposes. It contains 5 different modules designed to be of assistance in different s

48 Nov 26, 2022
ไฟกๆฏๆ”ถ้›†่‡ชๅŠจๅŒ–ๅทฅๅ…ท

ๆฐดๆณฝ-ไฟกๆฏๆ”ถ้›†่‡ชๅŠจๅŒ–ๅทฅๅ…ท ้ƒ‘้‡ๅฃฐๆ˜Ž๏ผšๆ–‡ไธญๆ‰€ๆถ‰ๅŠ็š„ๆŠ€ๆœฏใ€ๆ€่ทฏๅ’Œๅทฅๅ…ทไป…ไพ›ไปฅๅฎ‰ๅ…จไธบ็›ฎ็š„็š„ๅญฆไน ไบคๆตไฝฟ็”จ๏ผŒไปปไฝ•ไบบไธๅพ—ๅฐ†ๅ…ถ็”จไบŽ้žๆณ•็”จ้€”ไปฅๅŠ็›ˆๅˆฉ็ญ‰็›ฎ็š„๏ผŒๅฆๅˆ™ๅŽๆžœ่‡ช่กŒๆ‰ฟๆ‹…ใ€‚ 0x01 ไป‹็ป ไฝœ่€…๏ผšSke ๅ›ข้˜Ÿ๏ผš0x727๏ผŒๆœชๆฅไธ€ๆฎตๆ—ถ้—ดๅฐ†้™†็ปญๅผ€ๆบๅทฅๅ…ท๏ผŒๅœฐๅ€๏ผšhttps://github.com/0x727 ๅฎšไฝ๏ผšๅๅŠฉ

0x727 2.7k Jan 09, 2023
Visius Heimdall is a tool that checks for risks on your cloud infrastructure

Heimdall Cloud Checker ๐Ÿ‡ง๐Ÿ‡ท About Visius is a Brazilian cybersecurity startup that follows the signs of the crimson thunder ;) ๐ŸŽธ ! As we value open s

visius 48 Jun 20, 2022
MainCoon - an automated recon framework

MainCoon is an automated recon framework meant for gathering information during penetration testing of web applications.

Md. Nur habib 8 Aug 26, 2022
"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

ViGA: Video moment retrieval via Glance Annotation This is the official repository of the paper "Video Moment Retrieval from Text Queries via Single F

Ran Cui 38 Dec 31, 2022
Fat-Stealer is a stealer that allows you to grab the Discord token from a user and open a backdoor in his machine.

Fat-Stealer is a stealer that allows you to grab the Discord token from a user and open a backdoor in his machine.

Jet Berry's 21 Jan 01, 2023
CTF framework and exploit development library

pwntools - CTF toolkit Pwntools is a CTF framework and exploit development library. Written in Python, it is designed for rapid prototyping and develo

Gallopsled 9.8k Dec 31, 2022
A collection of write-ups and solutions for Cyber FastTrack Spring 2021.

IMPORTANT: Please contact us before you use any styling or content shown here! Cyber FastTrack Spring 2021 / National Cyber Scholarship Competition -

Alice 48 Aug 28, 2022
A fast sub domain brute tool for pentesters

subDomainsBrute 1.4 A fast sub domain brute tool for pentesters. It works with P

Oliver 2 Oct 18, 2022
Bypass ReCaptcha: A Python script for dealing with recaptcha

Bypass ReCaptcha Bypass ReCaptcha is a Python script for dealing with recaptcha.

Marcos Camargo 1 Jan 11, 2022
A dynamic multi-STL, multi-process OpenSCAD build system with autoplating support

scad-build This is a multi-STL OpenSCAD build system based around GNU make. It supports dynamic build targets, intelligent previews with user-defined

Jordan Mulcahey 1 Dec 21, 2021
Seamless deployment and management of cybersecurity solutions ๐Ÿ—๏ธ

Description ๐Ÿ–ผ๏ธ Background ๐Ÿ‘ด๐Ÿผ Vision ๐Ÿ“œ Concepts ๐Ÿ’ฌ Solutions' Lifecycle. Operations โญ• Functionalities ๐Ÿš€ Supported Cybersecurity Solutions ๐Ÿ“ฆ Insta

MutableSecurity 36 Nov 10, 2022
๐™พ๐š™๐šŽ๐š— ๐š‚๐š˜๐šž๐š›๐šŒ๐šŽ ๐š‚๐šŒ๐š›๐š’๐š™๐š - ๐™ฝ๐š˜ ๐™ฒ๐š˜๐š™๐šข๐š›๐š’๐š๐š‘๐š - ๐šƒ๐šŽ๐šŠ๐š– ๐š†๐š˜๐š›๐š” - ๐š‚๐š’๐š–๐š™๐š•๐šŽ ๐™ฟ๐šข๐š๐š‘๐š˜๐š— ๐™ฟ๐š›๐š˜๐š“๐šŽ๐šŒ๐š - ๐™ฒ๐š›๐šŽ๐šŠ๐š๐šŽ๐š ๐™ฑ๐šข : ๐™ฐ๐š•๐š• ๐šƒ๐šŽ๐šŠ๐š– - ๐™ฒ๐š˜๐š™๐šข๐™ฟ๐šŠ๐šœ๐š ๐™ฒ๐šŠ๐š— ๐™ฝ๐š˜๐š ๐™ผ๐šŠ๐š”๐šŽ ๐šˆ๐š˜๐šž ๐š๐šŽ๐šŠ๐š• ๐™ฟ๐š›๐š˜๐š๐š›๐šŠ๐š–๐š–๐šŽ๐š›

๐™พ๐š™๐šŽ๐š— ๐š‚๐š˜๐šž๐š›๐šŒ๐šŽ ๐š‚๐šŒ๐š›๐š’๐š™๐š - ๐™ฝ๐š˜ ๐™ฒ๐š˜๐š™๐šข๐š›๐š’๐š๐š‘๐š - ๐šƒ๐šŽ๐šŠ๐š– ๐š†๐š˜๐š›๐š” - ๐š‚๐š’๐š–๐š™๐š•๐šŽ ๐™ฟ๐šข๐š๐š‘๐š˜๐š— ๐™ฟ๐š›๐š˜๐š“๐šŽ๐šŒ๐š - ๐™ฒ๐š›๐šŽ๐šŠ๐š๐šŽ๐š ๐™ฑ๐šข : ๐™ฐ๐š•๐š• ๐šƒ๐šŽ๐šŠ๐š– - ๐™ฒ๐š˜๐š™๐šข๐™ฟ๐šŠ๐šœ๐š ๐™ฒ๐šŠ๐š— ๐™ฝ๐š˜๐š ๐™ผ๐šŠ๐š”๐šŽ ๐šˆ๐š˜๐šž ๐š๐šŽ๐šŠ๐š• ๐™ฟ๐š›๐š˜๐š๐š›๐šŠ๐š–๐š–๐šŽ๐š›

CodeX-ID 2 Oct 27, 2022
Complet and easy to run Port Scanner with Python

Port_Scanner Complet and easy to run Port Scanner with Python Installation 1- git clone https://github.com/s120000/Port_Scanner 2- cd Port_Scanner 3-

1 May 19, 2022