Information Content of SEC filings: text compression for annual reports

One of the functions of the SEC is to make sure people have consistent "information" about public companies.

Of course, "information" means something else to computer scientists and people who study text compression. The more a piece of text repeats itself, the smaller a compressed file will be. So naturally I wanted to see how well SEC filings would compress.

Here's what I did:
  1. I went to the SEC's companysearch page and found the pages for SEC filings for some big tech companies.
  2. I used Firefox's "save as text" to convert the huge HTML to plaintext.
  3. Then I ran 7-zip to compress.
Of the four companies I tried, Google's 10-K reports are the least redundant, and Yahoo's the most.

Here are the specific ratios I calculated (the "terseness" factor?)
  • Google: 22.34%
  • Microsoft: 21.26%
  • Apple: 20.92%
  • Yahoo: 20.37%
Probably someone is strange enough to be tracking metrics like this, but it's kind of fun to look at the 10,000-foot view. Maybe tracking compression over time would give useful insights, like who's covering up bad results with flowery language.

Of course, this "terseness" number is only make-believe information theory. It probably matters more if the information you're reading is any good.

No comments:

Post a Comment