October 11, 2016 at 5-6:30pm in BIDS, 190 Doe Library

Text data requires a separate preprocessing stage often referred to as the ‘NLP pipeline’. One popular library for its implementation is Python’s NLTK (Natural Language Toolkit). This talk will cover how to clean text data, tag parts of speech (POS), identify named entities (NER), and quantify sentiment beyond dictionary look-up. While not explored in this talk, these preprocessing steps are often critical to developing more advanced, high-level models for document classifiers, topic modeling, and network models by providing targeted feature sets.


We are using this Jupyter notebook in the thehackerwithin/berkeley repo, master branch, nltk folder.

For installation of Python and NLTK follow these instructions

If you installed anaconda:

conda install nltk


pip install nltk

Lastly, the NER wrapper requires the Java Stanford NER here: Note: do not download the extension, just Download Stanford Named Entity Recognizer version 3.6.0

