Why algorithmic thinking is needed

Introduction

Let’s say you have two files A and B.
Let’s say file size of A, in interms of count of lines, is m and file size of B is n.
Both files are reasonably large esp. file B. Let’s say file B is 180 GB.
Both tend to grow over time esp. file A. Let’s say file A gets updated every week.
File A’s contents i.e. each line (or a part of it) needs to be searched in file B against its lines (or part of it)

Sort file A
Sort file B
Iterate through A while simultaneously reading through contents of B
See if there is a match.
If both are equal report.
If file B’s content is ‘smaller’ than file A. Read next line of file B. Continue till you find a match.
If file A’s content is ‘smaller’ than file B. Read next line of file A. Continue till you find a match.
if ‘end of file’ is reached on either A or B then report and exit

Big-O time complexity is : n * log(n) + m * log(m) + m + n.
Sorting file A and B ahead will save considerable time.
Slightly more time consuming than Approach 2 when all steps are done for the first time. However, if sorting is done apriori subsequent searches will take linear time : m + n.
No additional space requirements
Approach is future-proof as file size for A and B grows/changes

Search Pubmed article ids linked to any clinical trial against NLM’s MRCOC detailed MeSH term co-occurrence file.
Detailed MeSH term co-occurrence file is very large with 183GB size and 1.7B rows and will grow.
Number of global clinical trials done/in-flight is approximately 760k. As trials grow, Pubmed article ids linked to any clinicl trial will also grow.
Brute force approach for this problem will take long time and binary search approach will need consideable memory esp. if you search in MRCOC.
Here’s an implementation

This post was later published on LinkedIn here.