What Works When for Whom?
Advancing therapy change process research
A flexible solution to build text mining workflows that allows you to quickly combine Natural Language Processing tools from different sources.
Digital Humanities research often involves Natural Language Processing (NLP), in which a body of natural language text, or corpus, is analyzed using software. While there are many software packages available, constructing new research analyses by combining (parts of) existing packages remains challenging. This is due to the fact that individual software packages are designed to do a task and to do that task well; they are not primarily designed to interact with other, complementary packages. Another problem is that there are many tools available for English, but not for other languages.
nlppln (pronounced 'NLP pipeline') is an open source Python package that helps to address these problems, by making it easy to package existing tools in a uniform way as defined in the CWL (Common Workflow Language) standard for describing data analysis workflows. nlppln includes components to do tasks that are common in NLP, such as tokenization (multiple languages), lemmatization (for Dutch), and named entity recognition (for Dutch). These components are based on existing tools. Users can easily construct new analysis workflows by combining these pre-baked components with tools of their own creation.
Besides improving interoperability, nlppln also keeps a formal record of all steps taken in a workflow. This makes the research more transparent, and improves reproducibility.