Cheat sheet

Corporate Data Breaches — Cheat Sheet

645 U.S. corporate data breaches, 29,000+ 10-K filings, and what changes in the language of an annual report once a company has been hacked. Bachelor's thesis at Universidad Carlos III (9.75/10, INNCYBER award).

Read the full thesisUpdated June 2026
1

The question

After a public company is hacked, how does the language of its annual report change — and does that change carry information about future performance?

  • Companies signal calm to markets.
  • Disclosure rules force them to mention the breach.
  • They have an incentive to mention it in the softest possible language.

The hypothesis: language reveals what the numbers hide.

2

The data

  • 645 confirmed U.S. corporate data breaches between 2005 and 2019.
  • 29,000+ 10-K annual reports filed with the SEC over the same window (matched on CIK).
  • Breach data: Privacy Rights Clearinghouse + manual cross-references.
  • 10-K texts: SEC EDGAR full-text downloads.

Sampling design: each breached firm gets a matched control firm — same sector, similar size, no recorded breach in the window.

3

Linguistic features

Standard finance-NLP measures from the Loughran–McDonald financial-tone dictionaries:

MeasureCaptures
Positive / negative toneOptimism vs pessimism vocabulary.
Uncertainty"Maybe", "approximately", "uncertain"...
LitigiousLegal-defensive vocabulary.
Modal-weak"Could", "might", "perhaps".
Modal-strong"Must", "will", "definitely".
Document lengthTotal words. Longer = more obfuscation per some lit.

Each year-firm filing → a vector of these scores.

4

The methodology

A difference-in-differences design:

  • Pre-period: filings before the breach.
  • Post-period: filings after the breach.
  • Treated group: breached firms.
  • Control group: matched non-breached firms.

Difference in language change (post − pre) between treated and control = the causal effect of the breach on tone, controlling for time trends and industry shifts.

Then a second regression: does abnormal positive tone predict future ROA / stock returns?

5

The headline findings

After a breach, treated firms' 10-Ks:

  • More positive tone (+ significantly).
  • Less uncertainty language.
  • Longer documents overall.
  • More litigious language (this is forced — they have to disclose lawsuits).

Consistent with strategic obfuscation — drowning the bad news in softer, longer prose.

The kicker: abnormally positive tone predicts worse future earnings. The language contains information the numbers don't yet show.

6

What I learned

  • Match-and-difference designs are how you get causal claims from observational financial data.
  • Loughran–McDonald is the right dictionary for finance — general-purpose sentiment (VADER, etc.) underperforms.
  • Document length is a feature, not a control. Long 10-Ks correlate with bad outcomes.
  • Awards opened doors — INNCYBER Innovation Award 2019 + 2020. The methodology mattered more than the headline.