What research topics are covered in your favorite sci-fi show

Why would you want to do that… well if you like a show enough to binge watch on it say – the big bang theory, Grey’s anatomy, Star Trek , silicon valley etc… you can dig deeper on the ideas/topics presented there (that fascinated you so much) to find newer alternatives, recent extensions, occasional fallacies and hopefully a few very insightful ideas for your own research…

So if you have the same need as the title of this article, here’s a brief write up of steps needed to answer it reasonably:

  1. download subtitle files
  2. extract words from the file
  3. change all words to lowercase
  4. remove numbers
  5. remove common english words
  6. find unique words

Here’s an illustration with “Grey’s anatomy” one of my favorite shows

  1. download subtitles from here : http://www.tvsubtitles.net/tvshow-7-1.htm
  2. download common english words from here (also shows technique to use them) : https://gist.github.com/JKirchartz/ff4106c07fed1ae9c223b90a245241e0#file-rm-en-google10000-anywhere-sed

some preparatory steps are here

remaining steps can be executed in one pipe

ls | xargs -I {} unzip {}

mv "Grey s Anatomy - Grey s Anatomy - 1x04 - No Man's
Land.en.srt"  1x04-No_Man_s_Land.en.srt
$ grep -oE '\w+' 1x04-No_Man_s_Land.en.srt  | tr '[A-Z]' '[a-z]' |  grep -v  '[0-9]'   | sed -f rm-en-google10000-anywhere.sed | sort -u >> "1x04-No_Man_s_Land.en.srt.keywords"

Here are some results

$ head -n 10 1x04-No_Man_s_Land.en.srt.keywords


and more

$ wc -l *.keywords
 237 1x01-A_Hard_Day_s_Night.en.srt.keywords
 224 1x02-The_First_Cut_Is_the_Deepest.en.srt.keywords
 214 1x03-Winning_a_Battle_Losing_the_War.en.srt.keywords
 230 1x04-No_Man_s_Land.en.srt.keywords
 905 total

After thoughts :

  • Some noise words do exist.
  • It can definitely be further simplified with a loop for all zip files.

This post was later published on LinkedIn here.

