Why would you want to do that… well if you like a show enough to binge watch on it say – the big bang theory, Grey’s anatomy, Star Trek , silicon valley etc… you can dig deeper on the ideas/topics presented there (that fascinated you so much) to find newer alternatives, recent extensions, occasional fallacies and hopefully a few very insightful ideas for your own research…
So if you have the same need as the title of this article, here’s a brief write up of steps needed to answer it reasonably:
- download subtitle files
- extract words from the file
- change all words to lowercase
- remove numbers
- remove common english words
- find unique words
Here’s an illustration with “Grey’s anatomy” one of my favorite shows
- download subtitles from here : http://www.tvsubtitles.net/tvshow-7-1.htm
- download common english words from here (also shows technique to use them) : https://gist.github.com/JKirchartz/ff4106c07fed1ae9c223b90a245241e0#file-rm-en-google10000-anywhere-sed
some preparatory steps are here
remaining steps can be executed in one pipe
ls | xargs -I {} unzip {} mv "Grey s Anatomy - Grey s Anatomy - 1x04 - No Man's Land.en.srt" 1x04-No_Man_s_Land.en.srt
$ grep -oE '\w+' 1x04-No_Man_s_Land.en.srt | tr '[A-Z]' '[a-z]' | grep -v '[0-9]' | sed -f rm-en-google10000-anywhere.sed | sort -u >> "1x04-No_Man_s_Land.en.srt.keywords"
Here are some results
$ head -n 10 1x04-No_Man_s_Land.en.srt.keywords abdominal accidentally activating adenocarcinoma afebrile airbrush alzheimer amish apologize
and more
$ wc -l *.keywords 237 1x01-A_Hard_Day_s_Night.en.srt.keywords 224 1x02-The_First_Cut_Is_the_Deepest.en.srt.keywords 214 1x03-Winning_a_Battle_Losing_the_War.en.srt.keywords 230 1x04-No_Man_s_Land.en.srt.keywords 905 total
After thoughts :
- Some noise words do exist.
- It can definitely be further simplified with a loop for all zip files.
This post was later published on LinkedIn here.