forked from randomjohn/project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
22 lines (22 loc) · 1.13 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
* Basic stuff
x add titles of blogs to the labels (tried adding attr to networkx graph, gml didn't like it) - completed
x try scraping the front page of each of the blogs for blogroll links - need to add replacements to title strings so they can be valid files
(I tried this, and didn't get anything very interesting)
x complete manual blogroll process (start with allendowney.blogspot.com)
* Implement the NLP stuff
x create term document matrix
x convert to tf-idf
* maybe make the similarity more efficient
* cluster blogs based on similarity (review Programming Collective Intelligence)
* fancier stuff (nice to have): named entity extraction
* Implement the SNA stuff
x build graph
x display it
x pickle the graph for further exploration in networkx
* analyze it
* Compare the two
* maybe add topic clusters as attributes? that way Gephi can color them (would be nice)
* try k-means on similarity, and compute how often the clustering and community detection agree that pairs are in same group
use Spearman's r, or something like that
* Efficiency
* Normalize TF-IDF vectors before analyzing them.