Hobbit spaCy

Welcome to Hobbit spaCy, a custom Natural Language Processing pipeline built on top of the powerful spaCy library. This pipeline is designed specifically for working with Middle Earth data, providing custom NER, tokenization, and other NLP tasks specifically tailored for texts from the world of J.R.R. Tolkien.

Learn More

Leet Topic

LeetTopic builds upon Top2Vec, BerTopic and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.

Learn More

GLiNER spaCy

This project is a wrapper for integrating GLiNER, a Named Entity Recognition (NER) model, with the SpaCy Natural Language Processing (NLP) library. GLiNER, which stands for Generalized Language INdependent Entity Recognition, is an advanced model for recognizing entities in text. The SpaCy wrapper enables easy integration and use of GLiNER within the SpaCy environment, enhancing NER capabilities with GLiNER's advanced features.

Learn More

spaCy Whisper

spaCy Whisper is a Python package designed to integrate Whisper transcriptions with the natural language processing (NLP) capabilities of spaCy. It allows users to process and analyze Whisper transcribed text with the powerful tools offered by spaCy, including tokenization, entity recognition, part-of-speech tagging, and more.

Learn More

BioSpacy

BioSpaCy is a spacy pipeline for processing biology texts. Currently, the pipeline uses rulers and heuristics to identify the genus and family of plats.

Learn More

spaCyEx

spaCyEx is an extension for spaCy, designed to make pattern matching as flexible and easy as using regular expressions. It builds upon the existing capabilities of spaCy's Matcher, enhancing it with a more accessible syntax for defining complex patterns. spaCyEx allows for intuitive and detailed text pattern specifications, perfect for extracting detailed linguistic features from texts.

Learn More

GliNER Bird

GLiNER Bird is a a fine-tuned version of the GLiNER gliner_large-v2.1 targeting specific types of data related to the descriptions of birds. This model enhances the capability to recognize detailed aspects of avian life, particularly focusing on their nesting and dietary habits.

Learn More

Weaviate Filter

Weaviate Filter provides a convenient way to build GraphQL filters for Weaviate. The main class, GraphQLFilter, allows you to create complex filters by adding conditions and operands, and then retrieve the final filter object.

Learn More

Bagpipes spaCy

Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.

Learn More

Keyword spaCy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity. The basis for this comes from KeyBERT: A Minimal Method for Keyphrase Extraction using BERT, a transformer-based approach to keyword extraction. The methods employed by Keyword spaCy follow this methodology closely. It allows users to specify the range of n-grams to consider and can operate in a strict mode, which limits results to the specified n-gram range.

Learn More

Date spaCy

Date spaCy is a collection of custom spaCy pipeline component that enables you to easily identify date entities in a text and fetch the parsed date values using spaCy's token extensions. It uses RegEx to find dates and then uses the dateparser library to convert those dates into structured datetime data.

Learn More

DNA spaCy

DNA spaCy is a machine learning spaCy pipeline for processing DNA sequences found in FASTA files. By treating the classification of DNA as an NLP problem, we can leverage NLP libraries, such as spaCy to classify DNA sequences. This methodology has already been explored by other scholars. To our knowledge, however, these methods have not been applied in a spaCy pipeline. We selected spaCy because of its wide use in industry and academia, speed, and its Python library is accessible and expandable.

Learn More

Streamlit Pandas

Streamlit Pandas is a component for the Streamlit library. It allows users to load a Pandas DataFrame and automatically generate Streamlit widgets in the sidebar. These widgets trigger filtering events within the Pandas DataFrame.

Learn More

Vulgata spaCy

Vulgata spaCy is a library built upon spaCy to automate the identification and extraction of potential Biblical quotes in medieval Latin texts. The Vulgate version used is the Clementine version available from The Clementine Text Project. This was cleaned and structured into a CSV file. I would like to thank Marjorie Burghart for drawing this dataset to my attention. A future update will include the Stuttgart version and give the user the ability to switch between either version.

Learn More

Open-Source Software

Hobbit spaCy

Leet Topic

GLiNER spaCy

spaCy Whisper

BioSpacy

spaCyEx

GliNER Bird

Weaviate Filter

Bagpipes spaCy

Keyword spaCy

Date spaCy

DNA spaCy

Streamlit Pandas

Vulgata spaCy