Project Demo

Below is a demo of all 5 methods explored in the Project Details page. Input an example of a note that an organizer might make after a conversation and see if it accurately categorizes the constituent's main issue or issues. Find a few sample field notes at the bottom of this page, but you're encouraged to test the models' limitations with your own examples.

Block 1

Block 1 Description

Field Note

# All Models Demo
# Runs 5 models on a single input and compares their issue tag outputs.
# KeyBERT | Claude | SpaCy | Zero-Shot Classification | Sentence Transformers

# ── KeyBERT ───────────────────────────────────────────────────────────────────
# from keybert import KeyBERT
# kw_model = KeyBERT()

doc = input("Enter a sample field note: ")
candidates = ["Economy",
             "Democracy in the US",
             "Terrorism",
             "Immigration",
             "Education",
             "Healthcare",
             "Gun Policy",
             "Abortion",
             "Taxes",
             "Crime",
             "Foreign Affairs",
             "Energy Policy",
             "Race Relations",
             "LGBTQ+ rights",
             "Housing"]
keywords = kw_model.extract_keywords(doc, candidates=candidates)
kb_result = ", ".join(kw for kw, _ in keywords) if keywords else "No match"
print(f"{'KeyBERT:':<22}{kb_result}")

# ── Claude ────────────────────────────────────────────────────────────────────
# import anthropic
# client = anthropic.Anthropic()
client = claude_client
system_info = "Using the following list of potential keyword output, output 1 keyword (2 if both are very relevant) to the input organizer notes that summarize what the main issue of the consituent is. Potential Keyword Bank: ['Economy', 'Democracy in the US', 'Terrorism', 'Immigration', 'Education', 'Healthcare', 'Gun Policy', 'Abortion', 'Taxes', 'Crime', 'Foreign Affairs', 'Energy Policy', 'Race Relations', 'LGBTQ+ rights', 'Housing']"
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_info,
    messages=[
        {"role": "user", "content": doc}
    ]
)
print(f"{'Claude:':<22}{message.content[0].text.strip()}")

# ── SpaCy ─────────────────────────────────────────────────────────────────────
# import numpy as np
# import spacy
# from spacy.matcher import Matcher, PhraseMatcher

# ── Config ────────────────────────────────────────────────────────────────────

# Minimum vector cosine similarity to count as a match (Layer 3 only)
VECTOR_THRESHOLD = 0.55

# Second issue is only returned if its hit count is at least this fraction
# of the top issue's hit count. 0.5 = needs half as many hits as the top.
SECOND_ISSUE_MIN_RATIO = 0.5

# ── Keyword Bank ──────────────────────────────────────────────────────────────

KEYWORDS = {
   "The Economy": [
       "inflation", "prices", "affordability", "wages", "jobs",
       "unemployment", "cost of living", "grocery", "gas prices", "childcare costs",
       "recession", "economic growth", "small business", "supply chain", "poverty",
       "financial strain", "minimum wage", "debt", "layoffs", "outsourcing",
       "manufacturing", "consumer prices", "interest rates", "federal reserve", "stock market",
       "retirement savings", "pension", "economic inequality", "purchasing power", "labor market",
       "cost too high", "stagnation", "bankruptcy"
   ],
   "Democracy": [
       "voting", "election", "ballot", "voter rights", "voter suppression",
       "gerrymandering", "election integrity", "campaign finance", "lobbying", "misinformation",
       "voter ID", "mail-in voting", "polling", "civic engagement", "election security",
       "disenfranchisement", "democracy", "autocracy", "representation", "redistricting",
       "dark money", "PAC", "transparency", "accountability", "term limits",
       "filibuster", "Senate", "House", "constitutional rights", "political corruption",
       "polarization", "gridlock", "oligarchy", "partisanship"
   ],
   "Terrorism & National Security": [
       "terrorism", "national security", "homeland security", "threat", "surveillance",
       "intelligence", "military", "attack", "extremism", "radicalization",
       "counterterrorism", "border security", "cyber attack", "ISIS", "safety",
       "al-Qaeda", "domestic terrorism", "bioterrorism", "Pentagon", "defense spending",
       "FBI", "CIA", "NSA", "patriot act", "war on terror",
       "sleeper cell", "chemical weapons", "nuclear threat", "mass casualty", "security breach",
       "not safe anymore", "infiltration", "propaganda", "espionage"
   ],
   "Immigration": [
       "immigration", "immigrant", "deportation", "border", "asylum",
       "undocumented", "refugee", "citizenship", "visa", "DACA",
       "pathway", "migrant", "ICE", "detention", "naturalization",
       "sanctuary city", "border wall", "illegal immigration", "legal immigration", "green card",
       "work permit", "family separation", "border patrol", "immigrant labor", "xenophobia",
       "overcrowding", "displacement", "smuggling"
   ],
   "Education": [
       "school", "public school", "teacher", "curriculum", "tuition",
       "student loans", "voucher", "school choice", "charter school", "literacy",
       "higher education", "college", "classroom", "school board", "special education", "early childhood", "preschool",
       "teacher pay", "textbooks", "school safety", "homeschooling", "AP courses",
       "college affordability", "trade school", "library", "academic freedom", "overcrowding",
       "illiteracy", "dropout", "indoctrination"
   ],
   "Healthcare": [
       "healthcare", "insurance", "prescription", "hospital", "medical costs",
       "insulin", "mental health", "Medicare", "Medicaid", "coverage",
       "premiums", "copay", "chronic illness", "drug prices", "universal healthcare",
       "doctor", "treatment", "patient", "ACA", "preexisting condition",
       "emergency room", "healthcare access", "rural healthcare", "nursing home", "disability",
       "opioid", "therapy", "telehealth", "pharmaceutical", "out-of-pocket",
       "can't afford medication", "insurance denied claim", "healthcare is unaffordable", "medical bills",
       "uninsured", "misdiagnosis"
   ],
   "Gun Policy": [
       "gun", "firearm", "2nd amendment", "NRA", "background check",
       "gun control", "gun rights", "concealed carry", "assault weapon", "shooting",
       "gun violence", "open carry", "weapon", "ammunition", "gun ownership",
       "rifle", "handgun", "red flag law", "mass shooting", "school shooting",
       "gun safety", "magazine capacity", "silencer", "gun registry", "gun buyback", "pistol", "shotgun", "self-defense", "gun shop",
       "bloodshed", "loopholes"
   ],
   "Abortion": [
       "abortion", "pro-choice", "pro-life", "reproductive rights", "Planned Parenthood",
       "Roe v. Wade", "women's health", "clinic", "contraception", "pregnancy",
       "fetus", "bodily autonomy", "abortion access", "ballot initiative", "family planning",
       "abortion ban", "trimester", "late-term abortion", "abortion pill", "miscarriage",
       "rape exception", "incest exception", "parental consent", "abortion clinic", "maternal health",
       "birth control", "unintended pregnancy", "sex education", "adoption", "viability",
       "no access to care", "clinics are closing",
       "criminalization", "autonomy", "restriction"
   ],
   "Taxes": [
       "taxes", "tax cut", "tax increase", "IRS", "income tax",
       "property tax", "sales tax", "tax break", "deduction", "tax reform",
       "taxpayer", "corporate tax", "tax credit", "refund", "tax code",
       "audit", "loophole", "estate tax", "capital gains", "tax evasion",
       "flat tax", "progressive tax", "tariff", "tax shelter", "write-off",
       "fiscal policy", "tax burden", "payroll tax", "wealth tax", "tax filing",
       "overtaxed", "misappropriation", "exemptions"
   ],
   "Crime": [
       "crime", "safety", "police", "law enforcement", "incarceration",
       "prison", "arrest", "violence", "theft", "homicide",
       "drug crime", "recidivism", "criminal justice", "sentencing", "bail",
       "neighborhood safety", "patrol", "victim", "gang", "drug trafficking",
       "white collar crime", "fraud", "juvenile crime", "parole", "probation",
       "police reform", "defund police", "overcrowding", "reentry", "rehabilitation",
       "justice system",
       "lawlessness", "impunity"
   ],
   "Foreign Affairs": [
       "foreign policy", "diplomacy", "ally", "NATO", "United Nations",
       "sanctions", "war", "conflict", "Ukraine", "China",
       "Israel", "trade agreement", "ambassador", "international relations", "aid",
       "treaty", "geopolitics", "Russia", "Middle East", "Taiwan",
       "human rights", "foreign aid", "peacekeeping", "embargo", "coup",
       "regime", "nuclear deal", "Gaza", "refugee crisis", "military alliance",
       "why are we sending money abroad", "focus on home first", "foreign wars",
       "isolationism", "entanglement", "overreach", "weakness", "instability"
   ],
   "Energy Policy": [
       "energy", "solar", "wind power", "fossil fuels", "oil",
       "natural gas", "renewable", "power grid", "electricity costs", "pipeline",
       "carbon", "clean energy", "coal", "nuclear", "energy independence",
       "utility bills", "sustainability", "composting", "fracking", "offshore drilling",
       "energy storage", "battery", "EV", "electric vehicle", "carbon emissions",
       "energy subsidy", "OPEC", "refinery", "greenhouse gas", "energy transition",
       "utility bills too high", "gas prices are crushing us",
       "blackouts", "dependency", "pollution"
   ],
   "Race Relations": [
       "race", "racism", "equity", "diversity", "civil rights",
       "discrimination", "systemic racism", "police brutality", "DEI", "minority",
       "reparations", "racial justice", "hate crime", "inclusion", "implicit bias",
       "equality", "affirmative action", "racial profiling", "segregation", "redlining",
       "white supremacy", "NAACP", "Black Lives Matter", "racial wealth gap", "voting rights",
       "intersectionality", "anti-racism", "racial disparity",
       "hate is on the rise", "oppression", "tokenism", "erasure", "backlash"
   ],
   "LGBTQ+ Rights": [
       "LGBTQ", "gay rights", "transgender", "same-sex marriage", "gender identity",
       "pronouns", "pride", "discrimination", "conversion therapy", "queer",
       "non-binary", "bathroom bill", "adoption rights", "equality act", "sexual orientation",
       "inclusion", "drag", "gender affirming care", "trans youth", "hate crime",
       "coming out", "lesbian", "bisexual", "gay marriage", "domestic partnership",
       "LGBTQ employment", "military ban", "rainbow", "safe space", "gender expression"
   ],
   "Housing": [
       "housing", "rent", "landlord", "mortgage", "affordable housing",
       "eviction", "homelessness", "property", "rent hike", "condo",
       "homeowner", "zoning", "housing costs", "shelter", "tenant rights",
       "down payment", "gentrification", "housing market", "housing shortage", "section 8",
       "public housing", "rent control", "housing voucher", "foreclosure", "property taxes",
       "accessory dwelling", "mixed income housing", "first time homebuyer"
   ],
}

# ── Load model ────────────────────────────────────────────────────────────────

# nlp already preloaded by server

# ── Layer 1: PhraseMatcher ────────────────────────────────────────────────────

phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
for issue, terms in KEYWORDS.items():
   patterns = [nlp.make_doc(term) for term in terms]
   phrase_matcher.add(issue, patterns)

# ── Layer 2: Token Matcher (LEMMA, POS+LEMMA, LOWER) ─────────────────────────

token_matcher = Matcher(nlp.vocab)

for issue, terms in KEYWORDS.items():
   single_word_terms = [t for t in terms if len(t.split()) == 1]
   if not single_word_terms:
       continue

lemma_patterns = []
   pos_patterns   = []
   lower_patterns = []

for term in single_word_terms:
       term_doc = nlp(term)
       if not term_doc:
           continue
       token     = term_doc[0]
       lemma     = token.lemma_
       pos       = token.pos_
       lower_str = term.lower()

# LEMMA — catches inflected forms: "teachers" → "teacher"
       lemma_patterns.append([{"LEMMA": lemma}])

# POS + LEMMA — same base form AND part of speech, reduces false positives
       if pos in ("NOUN", "VERB", "PROPN", "ADJ"):
           pos_patterns.append([{"LEMMA": lemma, "POS": pos}])

# LOWER — catches caps variants: "INSULIN", "Medicaid"
       lower_patterns.append([{"LOWER": lower_str}])

if lemma_patterns:
       token_matcher.add(f"LEMMA_{issue}", lemma_patterns)
   if pos_patterns:
       token_matcher.add(f"POS_{issue}", pos_patterns)
   if lower_patterns:
       token_matcher.add(f"LOWER_{issue}", lower_patterns)

# ── Layer 3: Issue vectors ────────────────────────────────────────────────────

issue_vectors: dict[str, np.ndarray] = {}
for issue, terms in KEYWORDS.items():
   vecs = [nlp(term).vector for term in terms if nlp(term).has_vector]
   if vecs:
       issue_vectors[issue] = np.mean(vecs, axis=0)

def _cosine_similarity(doc, issue_vec: np.ndarray) -> float:
   if not doc.has_vector:
       return 0.0
   denom = np.linalg.norm(doc.vector) * np.linalg.norm(issue_vec)
   return float(np.dot(doc.vector, issue_vec) / denom) if denom else 0.0

# ── Classifier ────────────────────────────────────────────────────────────────

def classify_note(note_text: str) -> list[str]:
   doc = nlp(note_text)
   hit_counts: dict[str, int] = {}

# Layer 1 — exact phrase hits
   for match_id, _s, _e in phrase_matcher(doc):
       label = nlp.vocab.strings[match_id]
       hit_counts[label] = hit_counts.get(label, 0) + 1

# Layer 2 — token attribute hits
   for match_id, _s, _e in token_matcher(doc):
       raw_key = nlp.vocab.strings[match_id]
       for prefix in ("LEMMA_", "POS_", "LOWER_"):
           if raw_key.startswith(prefix):
               label = raw_key[len(prefix):]
               hit_counts[label] = hit_counts.get(label, 0) + 1
               break

# Layer 3 — vector fallback (only when nothing matched above)
   if not hit_counts:
       scores = {
           issue: _cosine_similarity(doc, vec)
           for issue, vec in issue_vectors.items()
       }
       for issue, score in scores.items():
           if score >= VECTOR_THRESHOLD:
               hit_counts[issue] = round(score * 2)

if not hit_counts:
       return ["Unclassified"]

ranked = sorted(hit_counts.items(), key=lambda x: x[1], reverse=True)
   top_issue, top_count = ranked[0]
   result = [top_issue]
   if len(ranked) > 1:
       second_issue, second_count = ranked[1]
       if second_count >= top_count * SECOND_ISSUE_MIN_RATIO:
           result.append(second_issue)
   return result

sp_result = classify_note(doc)
print(f"{'SpaCy:':<22}{', '.join(sp_result)}")

# ── Zero-Shot Classification ──────────────────────────────────────────────────
# from transformers import pipeline
# classifier = pipeline('zero-shot-classification', model='cross-encoder/nli-MiniLM2-L6-H768')
classifier = zero_shot

pred_cat = classifier(doc, candidates)['labels'][0]
print(f"{'Zero-Shot:':<22}{pred_cat}")

# ── Sentence Transformers ─────────────────────────────────────────────────────
# from sentence_transformers import SentenceTransformer, util
# model = SentenceTransformer("all-MiniLM-L6-v2")
model = sentence_model

keyword_embeddings = model.encode(candidates, convert_to_tensor=True)

SECOND_KEYWORD_THRESHOLD = 0.40

def classify_st(note, threshold=SECOND_KEYWORD_THRESHOLD):
   note_embedding = model.encode(note, convert_to_tensor=True)
   scores = util.cos_sim(note_embedding, keyword_embeddings)[0]

top_indices = scores.topk(2).indices.tolist()
   top_scores = scores.topk(2).values.tolist()

primary = candidates[top_indices[0]]
   secondary = candidates[top_indices[1]] if top_scores[1] >= threshold else None

return primary, secondary

primary, secondary = classify_st(doc)
keywords = [primary] + ([secondary] if secondary else [])
print(f"{'Sentence Transformer:':<22}{', '.join(keywords)}")

Ctrl+Enter to run

Sample Notes

Talked to Lou at the door. He supports our candidate for state house but it's clear his top issue is affordability. Talked about grocery and gas prices rising plus childcare costs with a baby on the way. Gave some resources on local childcare and charities with lower-cost supplies.
Diabetic, insulin costs rising is taking a toll on her and her family. Interested in our international cuisines event so left some information with her and will follow-up.
Talked to him on Monday 3/16 morning, isn't affected by the recent rent hikes in the area since he owns his condo. He says his main issue is the 2nd amendment and the worry that progressive leadership might take away his guns. Feels strongly about gun ownership, big NRA member
Has a lot of friends involved in Planned Parenthood work. Interested in getting involved in ballot initiative work for this year. Maybe can connect us with her network.
Has a 7 and 9 year old and thinking about moving schools because they're worried about the quality of education at the public school. Lots of religion being pushed that makes them uncomfortable. Was asking about our candidate's stance on school choice and resources about vouchers.
Talked to Christine on Saturday afternoon. Interested in volunteering with us. Cares a lot about sustainability and wants to learn about composting. Needs more voter education on voting by mail.

Project Details

Instead of developing an entirely new tool, research is compiled below so that existing field tools can implement a similar model into their systems. Each avenue explored has its own demo to try out the model's performance followed by some comparative analysis.

Logistics

Implementing Topic Modeling using Natural Language Processing will ideally look something like:

Start with a list of pre-determined candidates for tags, like "Economy", "Healthcare", or "Immigration".
Model runs in the background and assigns issue tags to the note based on the content.
Tags, one or two if relevant, are assigned to summarize the constituent's current concerns.

Issue Tags: Candidates

All the tags chosen for this project, using this Gallup Poll's list of most important issues influencing the 2024 election as a starting point. A predetermined list of keyword candidates were chosen instead of allowing the model to assign topics randomly to ensure data cleanliness.

Economy
Democracy
Terrorism
Immigration
Education
Healthcare
Gun Policy
Abortion
Taxes
Crime
Foreign Affairs
Energy Policy
Race Relations
LGBTQ+ Rights
Housing

Approaches + Models

Keyword Extraction (KeyBERT)

The first and simplest approach to this problem is keyword extraction. It requires the least amount of processing power and coding effort, but isn't very intelligent - needing the exact keyword to be present in the note text to match the list of keywords. Here are two short and relatively easy examples to demonstrate the model's pros + cons.

KeyBERT Demo Keyword Extraction

Description

Field Note

Ctrl+Enter to run

Examples

"Talked to Lou at the door. He supports our candidate for state house but it's clear his top issue is affordability. Talked about grocery and gas prices rising plus childcare costs with a baby on the way. Gave some resources on local childcare and charities with lower-cost supplies."
It's clear the main tag here should be "economy", but the model turns up nothing. Now, try changing the word "affordability" to "the economy" and it quickly returns the correct value.

Pros & Cons

Pros

Simple Implementation
Free

Cons

Can't interpret note as a whole thought, only searches for a word
Not a real option for nuanced organizer conversations

Out-of-the-Box LLM (Claude)

On the opposite end of this model spectrum is a fully developed LLM that's pre-existing and pre-packaged. This is a far more intelligent approach but requires a lot more computing power, cost, and energy. It's also likely to have a steeper buy-in process with organizers.
My colleague Aaliyah Wood conducted a survey to gather some anecdotal feedback from organizers on using AI for field work. Here are some of their thoughts:

"It can definitely be useful and I do personally use it sometimes, but the litany of moral issues around it (impact on the workforce, slop, intellectual property issues, energy and environmental impacts, data privacy issues, etc.) make me feel generally pretty negative about it."
"From what I know about data centers and how harmful they are to the surrounding community, how much water they require, the data privacy issues around it etc, I don't have the best opinion about it."
"Bad for the environment, can stunt human learning, needs extensive human oversight. Unsure of its net positives on the world."

Aside from organizer buy-in, AI companies generally have different, more corporate priorities that often lead to differing ethical considerations than those of the progressive movement. For example, ChatGPT, another leading LLM, recently struck a deal with the Trump administration's Pentagon to provide them with their data and tools. Claude was chosen in this case because of its established use cases in the progressive community (for example here), its more neutral brand perception, and its refusal to adhere to these same Trump administration asks that ChatGPT did.
Ultimately, though, corporate LLMs remain ethically ambiguous and ever-changing - privacy, governance, and corporate social responsibility must be considered.

Unsurprisingly, an LLM like Claude can create correct output to simple organizer notes. Try this example to test it: "Has a 7 and 9 year old and thinking about moving schools because they're worried about the quality of education at the public school. Lots of religion being pushed that makes them uncomfortable. Was asking about our candidate's stance on school choice and resources about vouchers."

Claude Demo LLM

Descriptionss

Field Note

Ctrl+Enter to run

Pros & Cons

Pros

Simple to implement
Stronger understanding of nuance and overall themes
Can attempt to handle spanglish or english misspellings

Cons

Costly, especially to ensure data privacy
Unknown Environmental Impact
Steeper organizer buy-in

Tingum's Tagging

Logistics

It's clear, then, that a happy medium is needed that incorporates the simplicity of keyword extraction and the intelligence of an LLM. We'll explore three options below - SpaCy, Zero-Shot Classification, and Sentence Transformers.

SpaCy

SpaCy is an NLP model developed by MIT that acts most similarly to keyword extraction of our three options. This model uses three different methods to understand the input and produce one or two keyword tags -
3 Layers - each attempt to match lists corresponding to each keyword

Phrase Matcher - looks for exact matches of the keywords or phrases in the note
Token Matcher - looks for matches of the individual words in the keywords, allowing for some flexibility with word forms (e.g., "affordable" vs. "affordability")
Vector Similarity - looks at the overall meaning of the note and compares it to the meaning of the keywords using word vectors, allowing for more nuanced matches

Field Note

# import numpy as np
# import spacy
# from spacy.matcher import Matcher, PhraseMatcher

# Minimum vector cosine similarity to count as a match (Layer 3 only)
VECTOR_THRESHOLD = 0.55

# Second issue is only returned if its hit count is at least this fraction
# of the top issue's hit count. 0.5 = needs half as many hits as the top.
SECOND_ISSUE_MIN_RATIO = 0.5

# nlp already preloaded by server
print("Using preloaded spaCy model...")

# ── Layer 1: PhraseMatcher ────────────────────────────────────────────────────

phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
for issue, terms in KEYWORDS.items():
   patterns = [nlp.make_doc(term) for term in terms]
   phrase_matcher.add(issue, patterns)

# ── Layer 2: Token Matcher (LEMMA, POS+LEMMA, LOWER) ─────────────────────────

token_matcher = Matcher(nlp.vocab)

for issue, terms in KEYWORDS.items():
   single_word_terms = [t for t in terms if len(t.split()) == 1]
   if not single_word_terms:
       continue

lemma_patterns = []
   pos_patterns   = []
   lower_patterns = []

# LEMMA — catches inflected forms: "teachers" → "teacher"
       lemma_patterns.append([{"LEMMA": lemma}])

# POS + LEMMA — same base form AND part of speech, reduces false positives
       if pos in ("NOUN", "VERB", "PROPN", "ADJ"):
           pos_patterns.append([{"LEMMA": lemma, "POS": pos}])

# LOWER — catches caps variants: "INSULIN", "Medicaid"
       lower_patterns.append([{"LOWER": lower_str}])

# ── Layer 3: Issue vectors ────────────────────────────────────────────────────

print("Building issue vectors...")
issue_vectors: dict[str, np.ndarray] = {}
for issue, terms in KEYWORDS.items():
   vecs = [nlp(term).vector for term in terms if nlp(term).has_vector]
   if vecs:
       issue_vectors[issue] = np.mean(vecs, axis=0)

def classify_note(note_text: str) -> list[str]:
   """
   Takes a paragraph of canvassing note text.
   Returns a list with 1 issue label, or 2 if a strong second issue is present.
   Returns ["Unclassified"] if nothing matches.
   """
   doc = nlp(note_text)
   hit_counts: dict[str, int] = {}

# Layer 1 — exact phrase hits
   for match_id, _s, _e in phrase_matcher(doc):
       label = nlp.vocab.strings[match_id]
       hit_counts[label] = hit_counts.get(label, 0) + 1

if not hit_counts:
       return ["Unclassified"]

# Rank by hit count
   ranked = sorted(hit_counts.items(), key=lambda x: x[1], reverse=True)
   top_issue, top_count = ranked[0]

result = [top_issue]

# Only add a second issue if it has enough hits relative to the top
   if len(ranked) > 1:
       second_issue, second_count = ranked[1]
       if second_count >= top_count * SECOND_ISSUE_MIN_RATIO:
           result.append(second_issue)

return result

# ── Run ───────────────────────────────────────────────────────────────────────

note = input("Note: ").strip()
if note:
    result = classify_note(note)
    print(", ".join(result))

Ctrl+Enter to run

Pros & Cons

Pros

Speedy
Most transparent/explainable
Handles typos well

Cons

These "trigger" lists need to be maintained
Doesn't handle implied topics or context-dependent topics as well

Zero Shot Classification

Zero-Shot Classification is a type of NLP model that can classify text into categories it hasn't seen before. It uses Natural Language Inference (NLI) to determine logical implication of a note related to the keyword candidate list. It asks the question:
"Does this note have to do with [keyword tag]?"
Then it scores each result and outputs the highest score(s).

Field Note

Ctrl+Enter to run

Pros & Cons

Pros

Understands implied topics
Simple set-up

Cons

Bulky model options
Slower run time
Doesn't handle typos well

Sentence Transformers

Sentence Transformers is a type of NLP model that tests similarity of the field note and each of the keyword candidates. Turning the notes into vector embeddings, it measures the cosine similarity of the note vector and each keyword vector, outputting the keyword(s) with the highest similarity score(s).

Field Note

# from sentence_transformers import SentenceTransformer, util
# model = SentenceTransformer("all-MiniLM-L6-v2")
model = sentence_model

KEYWORD_BANK = ["Economy",
             "Democracy in the US",
             "Terrorism",
             "Immigration",
             "Education",
             "Healthcare",
             "Gun Policy",
             "Abortion",
             "Taxes",
             "Crime",
             "Foreign Affairs",
             "Energy Policy",
             "Race Relations",
             "LGBTQ+ rights",
             "Housing"]

keyword_embeddings = model.encode(KEYWORD_BANK, convert_to_tensor=True)

SECOND_KEYWORD_THRESHOLD = 0.40

def classify_note(note, threshold=SECOND_KEYWORD_THRESHOLD):
   note_embedding = model.encode(note, convert_to_tensor=True)
   scores = util.cos_sim(note_embedding, keyword_embeddings)[0]
  
   top_indices = scores.topk(2).indices.tolist()
   top_scores = scores.topk(2).values.tolist()

primary = KEYWORD_BANK[top_indices[0]]
   secondary = KEYWORD_BANK[top_indices[1]] if top_scores[1] >= threshold else None

return primary, secondary

note = input("Enter field note: ")
primary, secondary = classify_note(note)
keywords = [primary] + ([secondary] if secondary else [])
print(f"Keywords: {keywords}")

Ctrl+Enter to run

Pros & Cons

Pros

Understands implied topics
Small Model size
Can handle multilingual notes with a different model input

Cons

Doesn't handle typos well

Recommendations

Developers looking to implement a model like this should consider layering these approaches to find a happy medium of accuracy, cost, and simplicity. For example, using SpaCy as a first layer to catch the low-hanging fruit and then passing unclassified notes to the Sentence Transformer model for further analysis could be a good way to maximize accuracy while minimizing cost and complexity.
Other important takeaways from this project development:

NLP Models - for task simplification + organizing ease
AI Skepticism - among organizers, a long-term hurdle
Organizer Involvement in Dev - buy-in and quick use-case testing

🐙 GitHub

Synopsis

DxP

About This Project

The Problem

The Process

The Solution

Project Demo

Sample Notes

Evaluation Metrics

Ease of Implementation

Adaptability

Smart Interpretation

Cost Benefit Analysis

Recommendations

Future Development

Project Details

Logistics

Issue Tags: Candidates

Approaches + Models

Keyword Extraction (KeyBERT)

Examples

Pros & Cons

Pros

Cons

Out-of-the-Box LLM (Claude)

Pros & Cons

Pros

Cons

Tingum's Tagging

Logistics

SpaCy

Pros & Cons

Pros

Cons

Zero Shot Classification

Pros & Cons

Pros

Cons

Sentence Transformers

Pros & Cons

Pros

Cons

Recommendations

Contact

Acknowledgements