Open In Colab   Open in Kaggle

Tutorial 4: Public Opinion on the Climate Emergency and Why it Matters#

Week 2, Day 3: IPCC Socio-economic Basis

Content creators: Maximilian Puelma Touzel

Content reviewers: Peter Ohue, Derick Temfack, Zahra Khodakaramimaghsoud, Peizhen Yang, Younkap Nina Duplex, Laura Paccini, Sloane Garelick, Abigail Bodner, Manisha Sinha, Agustina Pesce, Dionessa Biton, Cheng Zhang, Jenna Pearson, Chi Zhang, Ohad Zivan

Content editors: Jenna Pearson, Chi Zhang, Ohad Zivan

Production editors: Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan

Our 2023 Sponsors: NASA TOPS and Google DeepMind

Tutorial Objectives#

In this tutorial, we will explore a dataset derived from Twitter, focusing on public sentiment surrounding the Conference of Parties (COP) climate change conferences. We will use data from a published study by Falkenberg et al. Nature Clim. Chg. 2022. This dataset encompasses tweets mentioning the COP conferences, which bring together world governments, NGOs, and businesses to discuss and negotiate on climate change progress. Our main objective is to understand public sentiment about climate change and how it has evolved over time through an analysis of changing word usage on social media. In the process, we will also learn how to manage and analyze large quantities of text data.

The tutorial is divided into sections, where we first delve into loading and inspecting the data, examining the timing and languages of the tweets, and analyzing sentiments associated with specific words, including those indicating ‘hypocrisy’. We’ll also look at sentiments regarding institutions within these tweets and compare the sentiment of tweets containing ‘hypocrisy’-related words versus those without. This analysis is supplemented with visualization techniques like word clouds and distribution plots.

By the end of this tutorial, you will have developed a nuanced understanding of how text analysis can be used to study public sentiment on climate change and other environmental issues, helping us to navigate the intricate and evolving landscape of climate communication and advocacy.

Setup#

# imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# notebook config
from IPython.display import display, HTML
import datetime
import re
import nltk
from nltk.corpus import stopwords
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import urllib.request  # the lib that handles the url stuff
from afinn import Afinn
import pooch
import os
import tempfile

Figure settings#

Figure settings#

# @title Figure settings
import ipywidgets as widgets  # interactive display

%config InlineBackend.figure_format = 'retina'
plt.style.use(
    "https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle"
)

sns.set_style("ticks", {"axes.grid": False})
display(HTML("<style>.container { width:100% !important; }</style>"))

Video 2: A Simple Greenhouse Model#

Video 2: A Simple Greenhouse Model#

# @title Video 2: A Simple Greenhouse Model
# Tech team will add code to format and display the video
# helper functions


def pooch_load(filelocation=None, filename=None, processor=None):
    shared_location = "/home/jovyan/shared/Data/tutorials/W2D3_FutureClimate-IPCCII&IIISocio-EconomicBasis"  # this is different for each day
    user_temp_cache = tempfile.gettempdir()

    if os.path.exists(os.path.join(shared_location, filename)):
        file = os.path.join(shared_location, filename)
    else:
        file = pooch.retrieve(
            filelocation,
            known_hash=None,
            fname=os.path.join(user_temp_cache, filename),
            processor=processor,
        )

    return file

Section 1: Data Preprocessing#

We have performed the following preprocessing steps for you (simply follow along; there is no need to execute any commands in this section):

Every Twitter message (hereon called tweets) has an ID. IDs of all tweets mentioning COPx (x=20-26, which refers to the session number of each COP meeting) used in Falkenberg et al. (2022) were placed by the authors in an osf archive. You can download the 7 .csv files (one for each COP) here

The twarc2 program serves as an interface with the Twitter API, allowing users to retrieve full tweet content and metadata by providing the tweet ID. Similar to GitHub, you need to create a Twitter API account and configure twarc on your local machine by providing your account authentication keys. To rehydrate a set of tweets using their IDs, you can use the following command: twarc2 hydrate source_file.txt store_file.jsonl. In this command, each line of the source_file.txt represents a Twitter ID, and the hydrated tweets will be stored in the store_file.jsonl.

  • First, format the downloaded IDs and split them into separate files (batches) to make hydration calls to the API more time manageable (hours versus days - this is slow because of an API-imposed limit of 100 tweets/min.).

# import os
# dir_name='Falkenberg2022_data/'
# if not os.path.exists(dir_name):
#     os.mkdir(dir_name)
# batch_size = int(1e5)
# download_pathname=''#~/projects/ClimateMatch/SocioEconDay/Polarization/COP_Twitter_IDs/
# for copid in range(20,27):
#     df_tweetids=pd.read_csv(download_pathname+'tweet_ids_cop'+str(copid)+'.csv')
#     for batch_id,break_id in enumerate(range(0,len(df_tweetids),batch_size)):
#         file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
#         df_tweetids.loc[break_id:break_id+batch_size,'id'].to_csv(dir_name+file_name,index=False,header=False)
  • Make the hydration calls for COP26 (this took 4 days to download 50GB of data for COP26).

# import glob
# import time
# copid=26
# filename_list = glob.glob('Falkenberg2022_data/'+"tweetids_COP"+str(copid)+"*")
# dir_name='tweet_data/'
# if not os.path.exists(dir_name):
#     os.mkdir(dir_name)
# file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
# for itt,tweet_id_batch_filename in enumerate(filename_list):
#     strvars=tweet_id_batch_filename.split('/')[1].split('.')[0].split('_')
#     tweet_store_filename = dir_name+'tweets_'+strvars[1]+'_'+strvars[2]+'.json'
#     if not os.path.exists(tweet_store_filename):
#         st=time.time()
#         os.system('twarc2 hydrate '+tweet_id_batch_filename+' '+tweet_store_filename)
#         print(str(itt)+' '+str(strvars[2])+" "+str(time.time()-st))
  • Load the data, then inspect and pick a chunk size. Note, by default, there are 100 tweets per line in the .json files returned by the API. Given we asked for 1e5 tweets/batch, there should be 1e3 lines in these files.

# copid=26
# batch_id = 0
# tweet_store_filename = 'tweet_data/tweets_COP'+str(copid)+'_b'+str(batch_id)+'.json'
# num_lines = sum(1 for line in open(tweet_store_filename))
# num_lines
  • Now we read in the data, iterating over chunks in each batch and only store the needed data in a dataframe (takes 10-20 minutes to run). Let’s look at when the tweets were posted, what language they are in, and the tweet text:

# selected_columns = ['created_at','lang','text']
# st=time.time()
# filename_list = glob.glob('tweet_data/'+"tweets_COP"+str(copid)+"*")
# df=[]
# for tweet_batch_filename in filename_list[:-1]:
#     reader = pd.read_json(tweet_batch_filename, lines=True,chunksize=1)
# #     df.append(pd.DataFrame([item[selected_columns] for sublist in reader.data.values.tolist()[:-1] for item in sublist] )[selected_columns])
#     dfs=[]
#     for chunk in reader:
#         if 'data' in chunk.columns:
#             dfs.append(pd.DataFrame(list(chunk.data.values)[0])[selected_columns])
#     df.append(pd.concat(dfs,ignore_index=True))
# #     df.append(pd.DataFrame(list(reader.data)[0])[selected_columns])
# df=pd.concat(df,ignore_index=True)
# df.created_at=pd.to_datetime(df.created_at)
# print(str(len(df))+' tweets took '+str(time.time()-st))
# df.head()
  • Finally, store the data in the efficiently compressed feather format

# df.to_feather('stored_tweets')

Section 2: Load and Inspect Data#

Now that we have reviewed the steps that were taken to generate the preprocessed data, we can load the data. It may a few minutes to download the data.

filename_tweets = "stored_tweets"
url_tweets = "https://osf.io/download/8p52x/"
df = pd.read_feather(
    pooch_load(url_tweets, filename_tweets)
)  # takes a couple minutes to download
Downloading data from 'https://osf.io/download/8p52x/' to file '/tmp/stored_tweets'.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[11], line 4
      1 filename_tweets = "stored_tweets"
      2 url_tweets = "https://osf.io/download/8p52x/"
      3 df = pd.read_feather(
----> 4     pooch_load(url_tweets, filename_tweets)
      5 )  # takes a couple minutes to download

Cell In[4], line 11, in pooch_load(filelocation, filename, processor)
      9     file = os.path.join(shared_location, filename)
     10 else:
---> 11     file = pooch.retrieve(
     12         filelocation,
     13         known_hash=None,
     14         fname=os.path.join(user_temp_cache, filename),
     15         processor=processor,
     16     )
     18 return file

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    236 if downloader is None:
    237     downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
    241 if known_hash is None:
    242     get_logger().info(
    243         "SHA256 hash of downloaded file: %s\n"
    244         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
   (...)
    247         file_hash(str(full_path)),
    248     )

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/core.py:803, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    799 try:
    800     # Stream the file to a temporary so that we can safely check its
    801     # hash before overwriting the original.
    802     with temporary_file(path=str(fname.parent)) as tmp:
--> 803         downloader(url, tmp, pooch)
    804         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    805         shutil.move(tmp, str(fname))

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/pooch/downloaders.py:226, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    224     progress = self.progressbar
    225     progress.total = total
--> 226 for chunk in content:
    227     if chunk:
    228         output_file.write(chunk)

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/requests/models.py:816, in Response.iter_content.<locals>.generate()
    814 if hasattr(self.raw, "stream"):
    815     try:
--> 816         yield from self.raw.stream(chunk_size, decode_content=True)
    817     except ProtocolError as e:
    818         raise ChunkedEncodingError(e)

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:628, in HTTPResponse.stream(self, amt, decode_content)
    626 else:
    627     while not is_fp_closed(self._fp):
--> 628         data = self.read(amt=amt, decode_content=decode_content)
    630         if data:
    631             yield data

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:567, in HTTPResponse.read(self, amt, decode_content, cache_content)
    564 fp_closed = getattr(self._fp, "closed", False)
    566 with self._error_catcher():
--> 567     data = self._fp_read(amt) if not fp_closed else b""
    568     if amt is None:
    569         flush_decoder = True

File ~/miniconda3/envs/climatematch/lib/python3.10/site-packages/urllib3/response.py:533, in HTTPResponse._fp_read(self, amt)
    530     return buffer.getvalue()
    531 else:
    532     # StringIO doesn't like amt=None
--> 533     return self._fp.read(amt) if amt is not None else self._fp.read()

File ~/miniconda3/envs/climatematch/lib/python3.10/http/client.py:466, in HTTPResponse.read(self, amt)
    463 if self.length is not None and amt > self.length:
    464     # clip the read to the "end of response"
    465     amt = self.length
--> 466 s = self.fp.read(amt)
    467 if not s and amt:
    468     # Ideally, we would raise IncompleteRead if the content-length
    469     # wasn't satisfied, but it might break compatibility.
    470     self._close_conn()

File ~/miniconda3/envs/climatematch/lib/python3.10/socket.py:705, in SocketIO.readinto(self, b)
    703 while True:
    704     try:
--> 705         return self._sock.recv_into(b)
    706     except timeout:
    707         self._timeout_occurred = True

File ~/miniconda3/envs/climatematch/lib/python3.10/ssl.py:1274, in SSLSocket.recv_into(self, buffer, nbytes, flags)
   1270     if flags != 0:
   1271         raise ValueError(
   1272           "non-zero flags not allowed in calls to recv_into() on %s" %
   1273           self.__class__)
-> 1274     return self.read(nbytes, buffer)
   1275 else:
   1276     return super().recv_into(buffer, nbytes, flags)

File ~/miniconda3/envs/climatematch/lib/python3.10/ssl.py:1130, in SSLSocket.read(self, len, buffer)
   1128 try:
   1129     if buffer is not None:
-> 1130         return self._sslobj.read(len, buffer)
   1131     else:
   1132         return self._sslobj.read(len)

KeyboardInterrupt: 

Let’s check the timing of the tweets relative to the COP26 event (duration shaded in blue in the plot you will make) to see how the number of tweets vary over time.

total_tweetCounts = (
    df.created_at.groupby(df.created_at.apply(lambda x: x.date))
    .count()
    .rename("counts")
)
fig, ax = plt.subplots()
total_tweetCounts.reset_index().plot(
    x="created_at", y="counts", figsize=(20, 5), style=".-", ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_yscale("log")
COPdates = [
    datetime.datetime(2021, 10, 31),
    datetime.datetime(2021, 11, 12),
]  # shade the duration of the COP26 to guide the eye
ax.axvspan(*COPdates, alpha=0.3)
# gray region

In addition to assessing the number of tweets, we can also explore who was tweeting about this COP. Look at how many tweets were posted in various languages:

counts = df.lang.value_counts().reset_index()

The language name of the tweet is stored as a code name. We can pull a language code dictionary from the web and use it to translate the language code to the language name.

target_url = "https://gist.githubusercontent.com/carlopires/1262033/raw/c52ef0f7ce4f58108619508308372edd8d0bd518/gistfile1.txt"
exec(urllib.request.urlopen(target_url).read())
lang_code_dict = dict(iso_639_choices)
counts = counts.replace({"index": lang_code_dict})
counts

Coding Exercise 2#

Run the following cell to print the dictionary for the language codes:

lang_code_dict

Find your native language code in the dictionary you just printed and use it to select the COP tweets that were written in your language!

language_code = ...
df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)
pd.options.display.max_rows = 100  # see up to 100 entries
pd.options.display.max_colwidth = 250  # widen how much text is presented of each tweet
samples = ...
samples

Click for solution

df = df_tmp

Section 3: Word Set Prevalence#

Falkenberg et al. investigated the hypothesis that public sentiment around the COP conferences has increasingly framed them as hypocritical (“political hypocrisy as a topic of cross-ideological appeal”). The authors operationalized hypocrisy language as any tweet containing any of the following words:

selected_words = [
    "hypocrisy",
    "hypocrite",
    "hypocritical",
    "greenwash",
    "green wash",
    "blah",
]  # the last 3 words don't add much. Greta Thurnberg's 'blah, blah blah' speech on Sept. 28th 2021.

Questions 3#

  1. How might this matching procedure be limited in its ability to capture this sentiment?

Click for solution

The authors then searched for these words within a distinct dataset across all COP conferences (this dataset was not made openly accessible but the figure using that data is here). They found that hypocrisy has been mentioned more in recent COP conferences.

Here, we will shift our focus to their accessible COP26 dataset and analyze the nature of comments related to specific topics, such as political hypocrisy. First, let’s look through the whole dataset and pull tweets that mention any of the selected words.

selectwords_detector = re.compile(
    r"\b(?:{0})\b".format("|".join(selected_words))
)  # to make a word detector for a wordlist faster to run, compile it!
df["select_talk"] = df.text.apply(
    lambda x: selectwords_detector.search(x, re.IGNORECASE)
)  # look through whole dataset, flagging tweets with select_talk (computes in under a minute)

Let’s extract these tweets and examine their occurrence statistics in relation to the entire dataset that we calculated above.

selected_tweets = df.loc[~df.select_talk.isnull(), :]
selected_tweet_counts = (
    selected_tweets.created_at.groupby(
        selected_tweets.created_at.apply(lambda x: x.date)
    )
    .count()
    .rename("counts")
)
selected_tweet_fraction = selected_tweet_counts / total_tweetCounts
fig, ax = plt.subplots(figsize=(20, 5))
selected_tweet_fraction.reset_index().plot(
    x="created_at", y="counts", style=[".-"], ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.axvspan(*COPdates, alpha=0.3)  # gray region
ax.set_ylabel("fraction talking about hypocrisy")

Please note that these fractions are normalized, meaning that larger fractions closer to the COP26 dates (shaded in blue) when the total number of tweets are orders of magnitude larger indicate a significantly greater absolute number of tweets talking about hypocrisy.

Now, let’s examine the content of these tweets by randomly sampling 100 of them.

selected_tweets.text.sample(100).values

Coding Exercise 3#

  1. Please select another topic and provide a list of topic words. We will then conduct the same analysis for that topic. For example, if the topic is “renewable technology,” please provide a list of relevant words.

selected_words_2 = [..., ..., ..., ..., ...]

selectwords_detector_2 = re.compile(r"\b(?:{0})\b".format("|".join([str(word) for word in selected_words_2])))
df["select_talk_2"] = df.text.apply(
    lambda x: selectwords_detector_2.search(x, re.IGNORECASE)
)

selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]
selected_tweet_counts_2 = (
    selected_tweets_2.created_at.groupby(
        selected_tweets_2.created_at.apply(lambda x: x.date)
    )
    .count()
    .rename("counts")
)
selected_tweet_fraction_2 = ...

samples = ...
samples

Click for solution

Section 4: Sentiment Analysis#

Let’s test this hypothesis from Falkenberg et al. (that public sentiment around the COP conferences has increasingly framed them as political hypocrisy). To do so, we can use sentiment analysis, which is a method for computing the proportion of words that have positive connotations, negative connotations or are neutral. Some sentiment analysis systems can measure other word attributes as well. In this case, we will analyze the sentiment of the subset of tweets that mention international organizations central to globalization (e.g., G7), focusing specifically on the tweets related to hypocrisy.

Note: part of the computation flow in what follows is from Caren Neal’s tutorial.

We’ll assign tweets a sentiment score using a dictionary method (i.e. based on the word sentiment scores of words in the tweet that appear in given word-sentiment score dictionary). The particular word-sentiment score dictionary we will use is compiled in the AFINN package and reflects a scoring between -5 (negative connotation) and 5 (positive connotation). The English language dictionary consists of 2,477 coded words.

Let’s initialize the dictionary for the selected language. For example, the language code for English is ‘en’.

afinn = Afinn(language=language_code)

Now we can load the dictionary:

filename_afinn_wl = "AFINN-111.txt"
url_afinn_wl = (
    "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt"
)

afinn_wl_df = pd.read_csv(
    pooch_load(url_afinn_wl, filename_afinn_wl),
    header=None,  # no column names
    sep="\t",  # tab sepeated
    names=["term", "value"],
)  # new column names
seed = 808  # seed for sample so results are stable
afinn_wl_df.sample(10, random_state=seed)

Let’s look at the distribution of scores over all words in the dictionary

fig, ax = plt.subplots()
afinn_wl_df.value.value_counts().sort_index().plot.bar(ax=ax)
ax.set_xlabel("Finn score")
ax.set_ylabel("dictionary counts")

These scores were assigned to words based on labeled tweets (validation paper).

Before focussing on sentiments about institutions within the hypocrisy tweets, let’s look at the hypocrisy tweets in comparison to non-hypocrisy tweets. This will take some more intensive computation, so let’s only perform it on a 1% subsample of the dataset

smalldf = df.sample(frac=0.01)
smalldf["afinn_score"] = smalldf.text.apply(
    afinn.score
)  # intensive computation! We have reduced the data set to frac=0.01 it's size so it takes ~1 min. (the full dataset takes 1hrs 50 min.)
smalldf["afinn_score"].describe()  # generate descriptive statistics.

From this, we can see that the maximum score is 24 and the minimum score is -33. The score is computed by summing up the scores of all dictionary words present in the tweet, which means that longer tweets tend to have higher scores.

To make the scores comparable across tweets of different lengths, a rough approach is to convert them to a per-word score. This is done by normalizing each tweet’s score by its word count. It’s important to note that this per-word score is not specific to the dictionary words used, so this approach introduces a bias that depends on the proportion of dictionary words in each tweet. We will refer to this normalized score as afinn_adjusted.

def word_count(text_string):
    """Calculate the number of words in a string"""
    return len(text_string.split())


smalldf["word_count"] = smalldf.text.apply(word_count)
smalldf["afinn_adjusted"] = (
    smalldf["afinn_score"] / smalldf["word_count"]
)  # note this isn't a percentage
smalldf["afinn_adjusted"].describe()

After normalizing the scores, we find that the maximum score is now 2 and the minimum score is now -1.5.

Now let’s look at the sentiment of tweets with hypocrisy words versus those without those words. For reference, we’ll first make cumulative distribution plots of score distributions for some other possibly negative words: fossil, G7, Boris and Davos.

for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
    sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
    selectwords_detector = re.compile(
        r"\b(?:{0})\b".format("|".join(sel_words))
    )  # compile for speed!
    smalldf[sel_name] = smalldf.text.apply(
        lambda x: selectwords_detector.search(x, re.IGNORECASE) is not None
    )  # flag if tweet has word(s)
for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
    sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
    fig, ax = plt.subplots()
    ax.set_xlim(-1, 1)
    ax.set_xlabel("adjusted Finn score")
    ax.set_ylabel("probabilty")
    counts, bins = np.histogram(
        smalldf.loc[smalldf[sel_name], "afinn_adjusted"],
        bins=np.linspace(-1, 1, 101),
        density=True,
    )
    ax.plot(bins[:-1], np.cumsum(counts), color="C0", label=sel_name + " tweets")
    counts, bins = np.histogram(
        smalldf.loc[~smalldf[sel_name], "afinn_adjusted"],
        bins=np.linspace(-1, 1, 101),
        density=True,
    )
    ax.plot(
        bins[:-1], np.cumsum(counts), color="C1", label="non-" + sel_name + " tweets"
    )
    ax.axvline(0, color=[0.7] * 3, zorder=1)
    ax.legend()
    ax.set_title("cumulative Finn score distribution for " + sel_name + " occurence")

Recall from our previous calculations that the tweets containing the selected hypocrisy-associated words have minimum adjusted score of -1.5. This score is much more negative than the scores of all four reference words we just plotted. So what is the content of these selected tweets that is causing them to be so negative? The explore this, we can use word clouds to assess the usage of specific words.

Section 5: Word Clouds#

To analyze word usage, let’s first vectorize the text data. Vectorization (also known as tokenization) here means giving each word in the vocabulary an index and transforming each word sequence to its vector representation and creating a sequence of elements with the corresponding word indices (e.g. the response ['I','love','icecream'] maps to something like [34823,5937,79345]).

We’ll use and compare two methods: term-frequency (\(\mathrm{tf}\)) and term-frequency inverse document frequency (\(\mathrm{Tfidf}\)). Both of these methods measure how important a term is within a document relative to a collection of documents by using vectorization to transform words into numbers.

Term Frequency (\(\mathrm{tf}\)): the number of times the word appears in a document compared to the total number of words in the document.

\[\mathrm{tf}=\frac{\mathrm{number \; of \; times \; the \; term \; appears \; in \; the \; document}}{\mathrm{total \; number \; of \; terms \; in \; the \; document}}\]

Inverse Document Frequency (\(\mathrm{idf}\)): reflects the proportion of documents in the collection of documents that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

\[\mathrm{idf}=\frac{\log(\mathrm{number \; of \; the \; documents \; in \; the \; collection})}{\log(\mathrm{number \; of \; documents \; in \; the \; collection \; containing \; the \; term})}\]

Thus the overall term-frequency inverse document frequency can be calculated by multiplying the term-frequency and the inverse document frequency:

\[\mathrm{Tfidf}=\mathrm{Tf} * \mathrm{idf}\]

\(\mathrm{Tfidf}\) aims to add more discriminability to frequency as a word relevance metric by downweighting words that appear in many documents since these common words are less discriminative. In other words, the importance of a term is high when it occurs a lot in a given document and rarely in others.

If you are interested in learning more about the mathematical equations used to develop these two methods, please refer to the additional details in the “Further Reading” section for this day.

Let’s run both of these methods and store the vectorized data in a dictionary:

vectypes = ["counts", "Tfidf"]


def vectorize(doc_data, ngram_range=(1, 1), remove_words=[], min_doc_freq=1):

    vectorized_data_dict = {}
    for vectorizer_type in vectypes:
        if vectorizer_type == "counts":
            vectorizer = CountVectorizer(
                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
            )
        elif vectorizer_type == "Tfidf":
            vectorizer = TfidfVectorizer(
                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
            )

        vectorized_doc_list = vectorizer.fit_transform(data).todense().tolist()
        feature_names = (
            vectorizer.get_feature_names_out()
        )  # or  get_feature_names() depending on scikit learn version
        print("vocabulary size:" + str(len(feature_names)))
        wdf = pd.DataFrame(vectorized_doc_list, columns=feature_names)
        vectorized_data_dict[vectorizer_type] = wdf
    return vectorized_data_dict, feature_names


def plot_wordcloud_and_freqdist(wdf, title_str, feature_names):
    """
    Plots a word cloud
    """
    pixel_size = 600
    x, y = np.ogrid[:pixel_size, :pixel_size]
    mask = (x - pixel_size / 2) ** 2 + (y - pixel_size / 2) ** 2 > (
        pixel_size / 2 - 20
    ) ** 2
    mask = 255 * mask.astype(int)
    wc = WordCloud(
        background_color="rgba(255, 255, 255, 0)", mode="RGBA", mask=mask, max_words=50
    )  # ,relative_scaling=1)
    wordfreqs = wdf.T.sum(axis=1)
    num_show = 50
    sorted_ids = np.argsort(wordfreqs)[::-1]

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(x=range(num_show), height=wordfreqs[sorted_ids][:num_show])
    ax.set_xticks(range(num_show))
    ax.set_xticklabels(
        feature_names[sorted_ids][:num_show], rotation=45, fontsize=8, ha="right"
    )
    ax.set_ylabel("total frequency")
    ax.set_title(title_str + " vectorizer")
    ax.set_ylim(0, 10 * wordfreqs[sorted_ids][int(num_show / 2)])

    ax_wc = inset_axes(ax, width="90%", height="90%")
    wc.generate_from_frequencies(wordfreqs)
    ax_wc.imshow(wc, interpolation="bilinear")
    ax_wc.axis("off")


nltk.download(
    "stopwords"
)  # downloads basic stop words, i.e. words with little semantic value  (e.g. "the"), to be used as words to be removed
remove_words = stopwords.words("english")

We can now vectorize and look at the wordclouds for single word statistics. Let’s explicitly exclude some words and implicity exclude ones that appear in fewer than some threshold number of tweets.

data = (
    selected_tweets["text"].sample(frac=0.1).values
)  # reduce size since the vectorization computation transforms the corpus into an array of large size (vocabulary size x number of tweets)
# let's add some more words that we don't want to track (you can generate this kind of list iteratively by looking at the results and adding to this list):
remove_words += [
    "cop26",
    "http",
    "https",
    "30",
    "000",
    "je",
    "rt",
    "climate",
    "limacop20",
    "un_climatetalks",
    "climatechange",
    "via",
    "ht",
    "talks",
    "unfccc",
    "peru",
    "peruvian",
    "lima",
    "co",
]
print(str(len(data)) + " tweets")
min_doc_freq = 5 / len(data)
ngram_range = (1, 1)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )

Note in the histograms how the \(\mathrm{Tfidf}\) vectorizer has scaled down the hypocrisy words such that they are less prevalent relative to the count vectorizer.

There are some words here (e.g. private and jet) that look like they likely would appear in pairs. Let’s tell the vectorizer to also look for high frequency pairs of words.

ngram_range = (1, 2)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )

The hypocrisy words take up so much frequency that it is hard to see what the remaining words are. To clear this list a bit more, let’s also remove the hypocrisy words altogether.

remove_words += selected_words
ngram_range = (1, 2)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )

Observe that terms we might have expected are associated with hypocrisy, e.g. “flying” are still present. Even when allowing for pairs, the semantics are hard to extract from this analysis that ignores the correlations in usage among multiple words.

To futher assess statistics, one approach is use a generative model with latent structure.

Topic models (the structural topic model in particular) are a nice modelling framework to start analyzing those correlations.

For a modern introduction to text analysis in the social sciences, I recommend the textbook:

Text as Data: A New Framework for Machine Learning and the Social Sciences (2022) by Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart

Summary#

In this tutorial, you’ve learned how to analyze large amounts of text data from social media to understand public sentiment about climate change. You’ve been introduced to the process of loading and examining Twitter data, specifically relating to the COP climate change conferences. You’ve also gained insights into identifying and analyzing sentiments associated with specific words, with a focus on those indicating ‘hypocrisy’.

We used techniques to normalize sentiment scores and to compare sentiment among different categories of tweets. You have also learned about text vectorization methods, term-frequency (tf) and term-frequency inverse document frequency (tfidf), and their applications in word usage analysis. This tutorial provided you a valuable stepping stone to further delve into text analysis, which could help deeper our understanding of public sentiment on climate change. Such analysis helps us track how global perceptions and narratives about climate change evolve over time, which is crucial for policy planning and climate communication strategies.

This tutorial therefore not only provided you with valuable tools for text analysis but also demonstrated their potential in contributing to our understanding of climate change perceptions, a key factor in driving climate action.

Resources#

The data for this tutorial can be accessed from Falkenberg et al. Nature Clim. Chg. 2022.