Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active February 10, 2022 16:45
Show Gist options
  • Select an option

  • Save dannguyen/c1f5fce7a063e16f7aa7 to your computer and use it in GitHub Desktop.

Select an option

Save dannguyen/c1f5fce7a063e16f7aa7 to your computer and use it in GitHub Desktop.
Using scikit-learn to classify NYT columnists
WASHINGTON
It’s a lost art, slinking away.
Now the fashion is slinking back.
Nobody wants to simply admit they made a mistake and disappear for awhile. Nobody even wants to use the weasel words: “Mistakes were made.” No, far better to pop right back up and get in the face of those who were savoring your absence.
We should think of a name for this appalling modern phenomenon. Kissingering, perhaps.
In Las Vegas, there’s the loathsome O.J., a proper candidate for shunning and stun-gunning, barging back into the picture.
And on Capitol Hill, Larry Craig shocked mortified Republicans by bounding into their weekly lunch. You’d think the conservative 62-year-old Idaho senator would have some shame, going from fervently opposing gay rights to provocatively tapping his toe in a Minneapolis airport toilet. (The toilet stall, now known as the Larry Craig bathroom, has become a hot local tourist attraction.)
But no.
As though Republicans don’t have enough problems, Mr. Craig said he is ready to go back to work while the legal hotshots he hired appeal his case. He even cast a couple votes, one against D.C. voting rights. (This creep gets to decide about my representation?)
Even if President Bush is “the cockiest guy” around, as the former Mexican President Vicente Fox writes in a new memoir critical of W.’s “grade-school-level” Spanish and his grade-school-level Iraq policy, he can’t be feeling good about the barbs being hurled his way by former supporters and enablers.
Rummy’s back in the news, giving interviews about a planned memoir and foundation designed to encourage “reasoned and civil debate” about global challenges and to spur more young people to go into government.
It’s rich. Maybe more young people would go into government if they didn’t have to work for devious bullies like Rummy who make huge life-and-death mistakes and then don’t apologize.
In The Washington Post, he blamed the press and Congress for creating an inhospitable atmosphere that drives good people away from public service. Maybe that’s why he and his evil twin, Dick Cheney, did their best to undermine the constitutional system of checks and balances so they could get more fine young people to serve.
Does the man blamed for creating civil disorder in Iraq even know what the word “civil” means? Wasn’t he the prickly Pentagon chief who got furious with anyone who didn’t agree with him on “global challenges”?
He shoved Gen. Eric Shinseki into retirement — and failed to show up at his retirement party — after the good general correctly told Congress that it would take several hundred thousand troops to invade and control Iraq. And he snubbed the German defense minister when Germany joined the Coalition of the Unwilling.
Interviewed by GQ’s Lisa DePaulo on his ranch in Taos, N.M., with another mule named Gus nearby, the “75-year-old package of waning testosterone,” as the writer called him, was asked if he misses W. Offering a wry smile, he replied, “Um, no.”
He now treats the son with the same contempt he treated the father with, which is why it’s so odd that the son hired his dad’s nemesis in the first place.
He actually had the gall to imply to Ms. DePaulo that he was out of the loop on Iraq and dragged out a copy of a memo he had written outlining all the things that could go wrong.
In fact, he was the one, right after 9/11, who began pushing to go after Saddam. He and Cheney were orchestrating the invasion from the start, guiding the dauphin with warnings about how weak he would seem if he let Saddam mock him.
The ultimate bureaucratic infighter wrote the memo as part of his Socratic strategy, asking a lot of questions when he was already pushing to go into Iraq. He never did any contingency planning in case those things went wrong; the memo was there simply so that someday he could pull it out for a reporter.
In the same issue of GQ, Colin Powell tried to build up the objections he made to the president, too, in an interview with Walter Isaacson. But nobody’s buying.
Even though he rubber-stamped W.’s tax cuts, Alan Greenspan is now upbraiding the president and vice president for profligate spending and putting politics ahead of sound economics.
He also says in his new memoir that “the Iraq war is largely about oil,” telling Bob Woodward that he had privately told W. and Cheney that ousting Saddam was “essential” to keeping world oil supplies safe.
Irrational exuberance, indeed.

Machine learning fun with scikit-learn and NYT columnists

The use of TF-IDF and LinearSVC is copied verbatim from the scikit-learn text analysis tutorial on about 5,000 columns gathered across 11 NYT columnists, for example, Maureen Dowd columns as listed on /column/maureen-dowd.

import sys
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

data_folder = "./data-hold/cleaned/"
sh_dataset = load_files(data_folder, shuffle = True)
sh_docs_train, sh_docs_test, sh_y_train, sh_y_test = train_test_split(
    sh_dataset.data, sh_dataset.target, test_size=0.25, random_state=None)
sh_pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
    ('clf', LinearSVC(C=1000)),
])

sh_pipeline.fit(sh_docs_train, sh_y_train)
sh_y_predicted = sh_pipeline.predict(sh_docs_test)

# print the results
print(metrics.classification_report(sh_y_test, sh_y_predicted, target_names = sh_dataset.target_names))

Initial results:

                   precision    recall  f1-score   support

   charles-m-blow       0.99      0.94      0.96        81
     david-brooks       0.98      0.98      0.98       169
      frank-bruni       1.00      0.98      0.99        64
     gail-collins       0.99      0.98      0.98       167
       joe-nocera       0.95      0.95      0.95        76
     maureen-dowd       0.95      0.98      0.96       125
 nicholas-kristof       0.93      0.96      0.95       134
     paul-krugman       0.98      0.99      0.98       157
      roger-cohen       0.99      0.99      0.99       115
     ross-douthat       1.00      0.94      0.97        49
thomas-l-friedman       0.98      0.98      0.98       126

      avg / total       0.97      0.97      0.97      1263

Finding the top 20 features

import numpy as np
clf = pipeline.steps[1][1]
vect = pipeline.steps[0][1]
feature_names = vect.get_feature_names()

class_labels = dataset.target_names
for i, class_label in enumerate(class_labels):
            topt = np.argsort(clf.coef_[i])[-20:]
            print("%s: %s" % (class_label,
                  " ".join(feature_names[j] for j in topt)))

Results:

charles-m-blow: zimmerman sequester week pew thankful gallup trayvon wednesday those pointed officer president continued nearly report furthermore poll must released according
david-brooks: moral series each these few speech then self cooper he culture lewinsky percent will past kerry people sort they are
frank-bruni: ones less monday there just he zelizer whose wasn evangelical isn colorado its many or last re them gay which
gail-collins: idea since perhaps giuliani all been guy ginsburg actually totally quiz who definitely was presidential going nobody pretty everybody really
joe-nocera: luke course money caro executive thus which article though indeed gun athletes retirement detainees joe football its company instance had
maureen-dowd: noting rice mushy put up poppy wrote old who christmas adding replied cheney tuesday hillary white even president said washington
nicholas-kristof: jesus isn notes my girls often united sudan then moldova one mr sometimes year found partly also yet may likewise
paul-krugman: thing which investors mainly aren isn answer even bad large claim administration example financial declared insurance fact what however mr
roger-cohen: french from century where obama course holbrooke minister perhaps land cannot words adderall before must states me has united london
ross-douthat: christian promise though post internet last critics liberals liberalism rather sweeping religious might instance instead kind well daniels liberal era
thomas-l-friedman: therefore will simon how watson putin just sandel arab more their anymore need regime israel our energy america added today
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment