ANLP_AuthorshipAttribution

March 1st, 2019, University of Potsdam

Final project for the graded course Advanced Natural Language Processing by Lisa Becker, Nina Harlacher, Joceline Ziegler. The report can be found in this repository as ANLP_report_paper.pdf. Supervised by Professor Tatjana Scheffler at the University of Potsdam.

Just download all files.

We implemented three models for the attribution of Elizabethan plays to their authors:

Bag of Words (Word Frequency with Naive Bayes)
N-gram Tracing (Relative Frequency of ngrams of words/characters)
Generative Model with Naive Bayes and SVM

The data (classified plays in .txt format) is contained in the EL folder.

Bag of Words

- bagOfWordsLOO.ipynb

Running the notebook creates a file saved in the current directory with a list of plays and their attribution. The overall accuracy is printed.

- bagOfWordsLOO.ipynb

The file that is created by the notebook containing names of plays and their attribution and the accuracy at the bottom.

N-gram Tracing

 - n_gram_tracing.ipynb

Implementation of the n-gram tracing approach (Grieve et al. 2018). The very last cell can be changed whether word or character ngrams should be used as well as their order. The accuracy is calculated and printed in the end.

Generative Model

- Generative Model (Sentence as instance + SVM cls).ipynb

Uses each sentence as instance for training data. This approach turned out to be unsuccessful. Two data frames are provided, one with features based on two sets of stop words and the other shows the predictions of the model:

- DataFrame_SVM.xlsx
- DataFrame_SVM_Preds.xlsx


- generative_model.ipynb

compares the results of a generative model using a Naive Bayes classifier or an SVM when using different sets of stop words for feature generation. The initial data frame used is provided as

- DataFrame.xlsx

and can be read in (rather than running the cell that creates the data frame). Results show when run. The prediction always takes a couple minutes.

- DataFrame_imp.xlsx

is also provided and contains the data created with a different set of stop words.

Computed accuracies for the classifiers and the stop word sets can be found in the file which is created in the end and provided:

- Stopword_results.xlsx

Naive Bayes

- NB_feature_engineering.ipynb

contains the NB generative model by itself as well as the implementation of different features. The function get_features_GM_imp(X_train, X_test) was modified for each feature (the columns of the df accessed, creation of count vectorizers if necessary, stacking the original and added data). Different combinations of features were tested and documented. Adding keywords as a feature improved our model best. The model is used to attribute two additional plays that are not included in the Fox et al. corpus:

- yorkshire.txt
- puritan.txt

SVM

- SVM_feature_engineering.ipynb

Exploring the SVM generative model (though not as thoroughly because it does not seem to perform better than Naive Bayes anyways).

Results

- Results.pdf

contains a summary of results of the different models in the form of a table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ANLP_AuthorshipAttribution

Bag of Words

N-gram Tracing

Generative Model

Naive Bayes

SVM

Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
EL		EL
ANLP_report_paper.pdf		ANLP_report_paper.pdf
BagOfWords_LOO		BagOfWords_LOO
DataFrame.xlsx		DataFrame.xlsx
DataFrame_SVM.xlsx		DataFrame_SVM.xlsx
DataFrame_SVM_Preds.xlsx		DataFrame_SVM_Preds.xlsx
DataFrame_imp.xlsx		DataFrame_imp.xlsx
Generative Model (Sentence as instance + SVM cls).ipynb		Generative Model (Sentence as instance + SVM cls).ipynb
NB_feature_engineering.ipynb		NB_feature_engineering.ipynb
README.md		README.md
Results.pdf		Results.pdf
SVM_feature_engineering.ipynb		SVM_feature_engineering.ipynb
Stopword_results.xlsx		Stopword_results.xlsx
bagOfWordsLOO.ipynb		bagOfWordsLOO.ipynb
generative_model.ipynb		generative_model.ipynb
n_gram_tracing.ipynb		n_gram_tracing.ipynb
puritan.txt		puritan.txt
yorkshire.txt		yorkshire.txt

lisabecker/ANLP_AuthorshipAttribution

Folders and files

Latest commit

History

Repository files navigation

ANLP_AuthorshipAttribution

Bag of Words

N-gram Tracing

Generative Model

Naive Bayes

SVM

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages