ACA Pub­lic Sen­ti­ment Pro­ject Pro­posal
Christo­pher G. Healey

Pro­ject De­scrip­tion

We will col­lect and an­a­lyze re­cent so­cial net­work dis­cus­sions about the Af­ford­able Care Act (a.k.a. Oba­maCare). Specif­i­cally, we will col­lect tweets from Twit­ter, a so­cial net­work that al­lows users to post short text mes­sages of up to 140 char­ac­ters. We will apply topic clus­ter­ing and sen­ti­ment analy­sis to the tweets, then in­ter­pret the re­sults to pro­vide a sum­mary of the cur­rent major top­ics re­lated to the ACA and their as­so­ci­ated sen­ti­ment.

Data Source

We will use Twit­ter's real-time stream­ing API to col­lect tweets from Twit­ter that con­tain the key­words:

We will use the Tweet­Cap­ture pro­gram pro­vided to us to con­nect to Twit­ter's real-time stream (the fire­hose) and col­lect tweets by key­word. Based on a check of Twit­ter's re­cent tweet ac­tiv­ity, we an­tic­i­pate being able to col­lect ap­prox­i­mately 24,000 tweets per day (see the Data Source Jus­ti­fi­ca­tion sec­tion below for more de­tails), or about 150,000 tweets over a 1-week pe­riod. Again based on re­cent tweet ac­tiv­ity, we can ob­serve top­ics like:

Analy­sis

We will per­form topic clus­ter­ing on the tweets, to iden­tify major top­ics of dis­cus­sion. We will then per­form sen­ti­ment es­ti­ma­tion on each major topic, to de­ter­mine a gen­eral sen­ti­ment (specif­i­cally, a pos­i­tive, neu­tral, or neg­a­tive plea­sure) for the topic's tweets.

Chal­lenges

We an­tic­i­pate a num­ber of chal­lenges we will need to over­come as part of our pro­ject.

  1. Dif­fer­en­ti­at­ing tweets that dis­cuss the ACA ver­sus tweets that match one of our key­words, but are not talk­ing about the ACA. For­tu­nately, these sit­u­a­tions should be fairly rare, since our key­words are un­likely to be used in un­re­lated tweets.
  2. Per­form­ing stan­dard stop word re­moval, stem­ming, and topic clus­ter­ing on short text snip­pets that are not gram­mat­i­cally cor­rect, that do not use cor­rect spelling, that con­tain nu­mer­ous ab­bre­vi­a­tions, that con­tain short­ened URLs, and so on: RT @mr_prez What r u talkin 'bout, ur ACA sounds bo-gus!  :'(   >:O  http://bit.ly/1eYmVWG.
  3. Es­ti­mat­ing sen­ti­ment on short, pos­si­bly un­gram­mat­i­cal text snip­pets where punc­tu­a­tion, emoti­cons, and ab­bre­vi­a­tions can have a sig­nif­i­cant im­pact: RT @fuma ur ACA idea is teh sh*te!!!! #urbandictionary.

Data Source Jus­ti­fi­ca­tion

In spite of the fact that the ACA was passed in March 2010, pub­lic sen­ti­ment con­tin­u­ing to po­lar­ize around the Act and its pro­vi­sions. The up­com­ing midterm elec­tions in No­vem­ber 2014 have pro­vided an op­por­tu­nity for both sup­port­ers and op­po­nents to re-en­er­gize ar­gu­ments for and against the Act (1, 2). In ad­di­tion, a num­ber of legal chal­lenges to the Act are work­ing their way through the lower courts (1, 2, 3, 4), with an ex­pec­ta­tion that the con­flict­ing de­ci­sions will be re­ferred to the Supreme Court in the near fu­ture.

Given cur­rent in­ter­est in the ACA, and the dif­fer­ing opin­ions on the pros and cons of the Act, we be­lieve a suf­fi­cient num­ber of tweets, with ap­pro­pri­ate sen­ti­ment and topic vari­abil­ity, will be avail­able through Twit­ter. Pre­lim­i­nary in­ves­ti­ga­tion in­di­cates an avail­able rate of ap­prox­i­mately 1000 tweets/hour, with a wide range of com­ments and opin­ions em­bed­ded in the tweets we pre­viewed. Based on these find­ings, we feel con­fi­dent we can col­lect the raw data needed to sup­port our goals and analy­sis plan for this pro­ject.

De­liv­er­ables

We will pro­vide the fol­low­ing de­liv­er­ables at the end of the pro­ject.

  1. A dataset con­tain­ing tweets with var­i­ous ACA key­words.
  2. A set of top­ics and as­so­ci­ated sen­ti­ment de­rived from the tweet dataset.
  3. A short in-class pre­sen­ta­tion of our find­ings, dis­cus­sions of their mean­ing, and gen­eral "lessons learned" from our pro­ject.