- Let’s say you have two files A and B.
- Let’s say file size of A, in interms of count of lines, is m and
file size of B is n.
- Both files are reasonably large esp. file B. Let’s say file B is 180
- Both tend to grow over time esp. file A. Let’s say file A gets
updated every week.
- File A’s contents i.e. each line (or a part of it) needs to be
searched in file B against its lines (or part of it)
- It’s a search problem.
- It’s same as this
Approach 1 : Brute force
- Iterate through one file say A.
- Read each line of A
- Iterate through the other file B.
- Read each line of B
- See if there is match
- Big-O time complexity is m * n.
- Will take a very long time
2 : Sort file B and do binary search for file A’s contents
- Sort file B
- Iterate through A.
- Read each line
- Do a binary search in file B
- See if there is a match
- Big-O time complexity is : n * log(n) + m * log(n)
- Big-O space complexity is : n (you need to host the file B in
- Sorting file B ahead will save time.
- However m * log(n) is the Big-O time complexity for each search
- will need a lot of space
- will still be slow
- Will break in future as m and n grow
3 : Sort file A and B and do file co-parsing by streaming both
- Sort file A
- Sort file B
- Iterate through A while simultaneously reading through contents of
- See if there is a match.
- If both are equal report.
- If file B’s content is ‘smaller’ than file A. Read next line of file
B. Continue till you find a match.
- If file A’s content is ‘smaller’ than file B. Read next line of file
A. Continue till you find a match.
- if ‘end of file’ is reached on either A or B then report and
- Big-O time complexity is : n * log(n) + m * log(m) + m + n.
- Sorting file A and B ahead will save considerable time.
- Slightly more time consuming than Approach 2 when all steps are done
for the first time. However, if sorting is done apriori subsequent
searches will take linear time : m + n.
- No additional space requirements
- Approach is future-proof as file size for A and B grows/changes
- Search Pubmed article ids linked to any clinical trial against NLM’s
detailed MeSH term co-occurrence file.
- Detailed MeSH term co-occurrence file is very large with 183GB size
and 1.7B rows and will grow.
- Number of global clinical trials done/in-flight is approximately
760k. As trials grow, Pubmed article ids linked to any clinicl trial
will also grow.
- Brute force approach for this problem will take long time and binary
search approach will need consideable memory esp. if you search in
- Here’s an implementation