Inverted index creation and query search mechanism on Wikipedia pages.

Overview

WikiPedia Search Engine

Step 1 : Installing Requirements

Install "stemming" module for python using pip.

Step 2 : Parsing the Data

To parse the data, run the file "wikiIndexer.py"

To run the file, the syntax is: "python WikipediaIndexer.py "

It will parse the whole dump and file the index files in the 'indexFiles' directory. It also creates the document to title mapping file in the current directory named 'docTitleMap.txt' which will be used by the search module later. (NOTE : As of current code, it pushes the index to disk for every 5000 documents encountered. This can be changed by changing the value of ' WRITE_PAGES_TO_FILE' macro in config.py)

Step 3 : Merging the Indexes and Creating Secondary Indexes

This task is done by WikipediaIndexer.py itself. There isn't any need for separate command. It takes the index files from 'pathOfFolder' directory and populates the 'finalIndex' directory with indexes of given chunk size and creates a secondary index named 'secondaryIndex.txt' in the same folder. (NOTE : As of the current code, the chunk size is kept 20000)

Step 4 : Running the Search Engine

To search for queries, run the file 'search.py'. It loads the index from 'finalIndex' (both primary and secondary). It also uses the file 'titleoffset.txt' (which must be in the working directory) to display titles corresponding to the docIDs. After it loads up, it gives the user a prompt to enter the query. After that the result of query is displayed. (top K results).

The format of result is:

	
   
     : 
    
      (Here the DocID is from the XML Dump)
	...
	
     
    
   

To specify normal queries, type them normally. For Field Queries, follow the format:

	f1:
   
     f2:
    
      ...
	where f1,f2 are fields : t - title, b - body, r - references, i - infobox, c - categories, e - external links

    
   
Owner
Piyush Atri
Piyush Atri
PwnWiki 数据库搜索命令行工具;该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目

PWSearch PwnWiki 数据库搜索命令行工具。该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目。

K4YT3X 72 Dec 20, 2022
Es-schema - Common Data Schemas for Elasticsearch

Common Data Schemas for Elasticsearch The Common Data Schema for Elasticsearch i

Tim Schnell 2 Jan 25, 2022
a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot

Drive Search Bot This is a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot How to deploy? Clone this repo: git clone

Hafitz Setya 25 Dec 09, 2022
MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch.

MeiliSearch FastAPI MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch. Installation Using a virtual environmnet is recommen

Paul Sanders 29 Nov 18, 2022
A play store search application programming interface ( API )

Play-Store-API A play store search application programming interface ( API ) Made with Python3

Fayas Noushad 8 Oct 21, 2022
User-friendly, tiny source code searcher written by pure Python.

User-friendly, tiny source code searcher written in pure Python. Example Usages Cat is equivalent in the regular expression as '^Cat$' bor class Cat

Furkan Onder 106 Nov 02, 2022
This is a Telegram Bot written in Python for searching data on Google Drive.

This is a Telegram Bot written in Python for searching data on Google Drive. Supports multiple Shared Drives (TDs). Manual Guide for deploying the bot

Levi 158 Dec 27, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
Home for Elasticsearch examples available to everyone. It's a great way to get started.

Introduction This is a collection of examples to help you get familiar with the Elastic Stack. Each example folder includes a README with detailed ins

elastic 2.5k Jan 03, 2023
A library for fast parse & import of Windows Prefetch into Elasticsearch.

prefetch2es Fast import of Windows Prefetch(.pf) into Elasticsearch. prefetch2es uses C library libscca. Usage When using from the commandline interfa

S.Nakano 5 Nov 24, 2022
Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

imgsmlr-server Image search service based on imgsmlr extension of PostgreSQL. Support image search by image. This is a sample application of imgsmlr.

jie 45 Dec 12, 2022
A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

1 Dec 23, 2021
A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

S.Nakano 3 Apr 01, 2022
ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to

Çağrı Aliş 7 Oct 25, 2022
Pythonic Lucene - A simplified python impelementaiton of Apache Lucene

A simplified python impelementaiton of Apache Lucene, mabye helps to understand how an enterprise search engine really works.

Mahdi Sadeghzadeh Ghamsary 2 Sep 12, 2022
Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

Haystack Search 3.4k Jan 04, 2023
PwnWiki Telegram database searching bot

pwtgbot PwnWiki Telegram database searching bot. Screenshots How it looks like in the terminal when running How it looks like in Telegram Run Directly

K4YT3X 3 Jan 25, 2022
Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

honmaple 197 Dec 29, 2022
TG-searcherBot - Search any channel/chat from keyword

TG-searcherBot Search any channel/chat from keyword. Commands /start - Starts th

TechiError 12 Nov 04, 2022
Yuno is context based search engine for anime.

Yuno yuno.mp4 Table of Contents Introduction Power Of Yuno Try Yuno How Yuno was created? References Introduction Yuno is a context based search engin

IAmParadox 354 Dec 19, 2022