Open-Source Software
-
Hobbit spaCy
Welcome to Hobbit spaCy, a custom Natural Language Processing pipeline built on top of the powerful spaCy library. This pipeline is designed specifically for working with Middle Earth data, providing custom NER, tokenization, and other NLP tasks specifically tailored for texts from the world of J.R.R. Tolkien.
-
Leet Topic
LeetTopic builds upon Top2Vec, BerTopic and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.
-
GLiNER spaCy
This project is a wrapper for integrating GLiNER, a Named Entity Recognition (NER) model, with the SpaCy Natural Language Processing (NLP) library. GLiNER, which stands for Generalized Language INdependent Entity Recognition, is an advanced model for recognizing entities in text. The SpaCy wrapper enables easy integration and use of GLiNER within the SpaCy environment, enhancing NER capabilities with GLiNER's advanced features.
-
spaCy Whisper
spaCy Whisper is a Python package designed to integrate Whisper transcriptions with the natural language processing (NLP) capabilities of spaCy. It allows users to process and analyze Whisper transcribed text with the powerful tools offered by spaCy, including tokenization, entity recognition, part-of-speech tagging, and more.
-
BioSpacy
BioSpaCy is a spacy pipeline for processing biology texts. Currently, the pipeline uses rulers and heuristics to identify the genus and family of plats.
-
spaCyEx
spaCyEx is an extension for spaCy, designed to make pattern matching as flexible and easy as using regular expressions. It builds upon the existing capabilities of spaCy's Matcher, enhancing it with a more accessible syntax for defining complex patterns. spaCyEx allows for intuitive and detailed text pattern specifications, perfect for extracting detailed linguistic features from texts.
-
GliNER Bird
GLiNER Bird is a a fine-tuned version of the GLiNER gliner_large-v2.1 targeting specific types of data related to the descriptions of birds. This model enhances the capability to recognize detailed aspects of avian life, particularly focusing on their nesting and dietary habits.
-
Weaviate Filter
Weaviate Filter provides a convenient way to build GraphQL filters for Weaviate. The main class, GraphQLFilter, allows you to create complex filters by adding conditions and operands, and then retrieve the final filter object.
-
Bagpipes spaCy
Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.
-
Keyword spaCy
Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity. The basis for this comes from KeyBERT: A Minimal Method for Keyphrase Extraction using BERT, a transformer-based approach to keyword extraction. The methods employed by Keyword spaCy follow this methodology closely. It allows users to specify the range of n-grams to consider and can operate in a strict mode, which limits results to the specified n-gram range.
-
Date spaCy
Date spaCy is a collection of custom spaCy pipeline component that enables you to easily identify date entities in a text and fetch the parsed date values using spaCy's token extensions. It uses RegEx to find dates and then uses the dateparser library to convert those dates into structured datetime data.
-
DNA spaCy
DNA spaCy is a machine learning spaCy pipeline for processing DNA sequences found in FASTA files. By treating the classification of DNA as an NLP problem, we can leverage NLP libraries, such as spaCy to classify DNA sequences. This methodology has already been explored by other scholars. To our knowledge, however, these methods have not been applied in a spaCy pipeline. We selected spaCy because of its wide use in industry and academia, speed, and its Python library is accessible and expandable.
-
Streamlit Pandas
Streamlit Pandas is a component for the Streamlit library. It allows users to load a Pandas DataFrame and automatically generate Streamlit widgets in the sidebar. These widgets trigger filtering events within the Pandas DataFrame.
-
Vulgata spaCy
Vulgata spaCy is a library built upon spaCy to automate the identification and extraction of potential Biblical quotes in medieval Latin texts. The Vulgate version used is the Clementine version available from The Clementine Text Project. This was cleaned and structured into a CSV file. I would like to thank Marjorie Burghart for drawing this dataset to my attention. A future update will include the Stuttgart version and give the user the ability to switch between either version.