Apache Lucene

Lucene is the tool used in advanced matching/filtering of services.

It is an open source project hosted by Apache and provides a Java based high-performance, full-featured text search engine library. To search large amounts of text quickly, one must first index the text and convert it into a format that can be searched rapidly, eliminating the slow sequential scanning of each file for the given word or phrase. This conversion process is called indexing, and its output is called an index. Searching is the process of looking up words in an index to find documents where they appear.

Lucene allows to add indexing and searching capabilities to user applications, and can index and make search-able any data that can be converted to a textual format. This means Lucene can be used to search and index information kept in torrents, files, web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML files or PDF documents, or any other format from which textual information can be extracted. The product is being used by many well known websites like Wikipedia, an online encyclopedia, as well as in many Java applications. To build an Index Lucene uses different types of analyzers like StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer, SnowballAnalyzer etc. The analyzer breaks text fields up into index-able tokens and it is the core part of the Lucene. For example; StandardAnalyzer is a sophisticated general-purpose analyzer. WhitespaceAnalyzer is a very simple analyzer which just separates tokens using white space while StopAnalyzer removes common English words which are not usually useful for indexing.