Skip to content

Commit

Permalink
Fixing NaN issue with sentiment, documenting and testing the Pipeline…
Browse files Browse the repository at this point in the history
… version
  • Loading branch information
John Hawkins authored and John Hawkins committed May 25, 2021
1 parent 4d5fe32 commit 848c52d
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 10 deletions.
18 changes: 15 additions & 3 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,24 @@ Python Package Usage
You can import the texturize package within python and then make use of the
SciKit Learn Compatible Transformer for your ML Pipeline.
In the example below we initialise a TextTransform object that will generate
the part of speech (pos), sentiment an topics indicator variables for any
the literacy and topics indicator variables for any
dataframe that has a column of text named 'TEXT_COL_NAME'


.. code-block:: python
import texturizer as txzr
textTransformer = txzr.TextTransform(['TEXT_COL_NAME'],['pos','sentiment','topics'])
from texturizer.pipeline import TextTransform
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('texttransform', TextTransform(['TEXT_COL_NAME'],['literacy','topics']) ),
('clf', SGDClassifier(loss='log') ),
])
Note that the transformer version of texturizer will remove the original text columns
so that the resulting data set can be fed into an algorithm that requires numerical
columns only. This means that if you need to do any other text feature engineering it
be placed earlier in the pipeline.


6 changes: 3 additions & 3 deletions texturizer/literacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
"""
texturizer.literacy: Literacy feature flags
Simple word matching to generate features for common literacy problems,
this includes typos or spelling mistakes and some simple grammar problems
like not capitalizing the first word of a sentence.
Simple word matching to generate features for common literacy problems.
This includes typos or spelling mistakes and some simple grammar problems,
for example, not capitalizing the first word of a sentence.
"""

Expand Down
5 changes: 5 additions & 0 deletions texturizer/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def __init__(self, columns, transforms=['simple']):
self.columns = columns
self.config = self.generate_feature_config(columns, transforms)
self.func = generate_feature_function(self.config)


def fit(self, X, y=None, **fit_params):
return self
Expand All @@ -37,6 +38,10 @@ def transform(self, X, y=None, **transform_params):
#if X.__class__.__name__ == "DataFrame":
# X = X.values

# REMOVE THE TEXT COLUMNS -- PARAMETERIZE THIS LATER
for col in self.columns:
rez.drop(col, inplace=True, axis=1)

return rez


Expand Down
8 changes: 4 additions & 4 deletions texturizer/sentiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ def add_text_sentiment_features(df, columns):
def add_textblob_features(df, col):
def tb_features(x, col):
if x[col]!=x[col]:
subjectivity = np.nan
polarity = np.nan
subjectivity = 0.0 #np.nan
polarity = 0.0 #np.nan
else:
text = ( x[col] )
blob = TextBlob(text)
Expand All @@ -83,8 +83,8 @@ def add_sentiment_features(df, col):
add simple text match features for sentiment.
"""
wc_col = col+'_wc' # This is ALWAYS computed first
df[col+'_positive'] = df[col].str.count(positive_pat, flags=re.IGNORECASE)
df[col+'_negative'] = df[col].str.count(negative_pat, flags=re.IGNORECASE)
df[col+'_positive'] = df[col].str.count(positive_pat, flags=re.IGNORECASE).fillna(0)
df[col+'_negative'] = df[col].str.count(negative_pat, flags=re.IGNORECASE).fillna(0)
df[col+'_sentiment'] = (df[col+'_positive'] - df[col+'_negative'] )/df[wc_col]

return df
Expand Down

0 comments on commit 848c52d

Please sign in to comment.