Why algorithmic thinking is needed

Introduction

  • Let’s say you have two files A and B.
  • Let’s say file size of A, in interms of count of lines, is m and file size of B is n.
  • Both files are reasonably large esp. file B. Let’s say file B is 180 GB.
  • Both tend to grow over time esp. file A. Let’s say file A gets updated every week.
  • File A’s contents i.e. each line (or a part of it) needs to be searched in file B against its lines (or part of it)

Problem modeling

  • It’s a search problem.
  • It’s same as this problem.

Solution Approach

Approach 1 : Brute force

  • Iterate through one file say A.
  • Read each line of A
  • Iterate through the other file B.
  • Read each line of B
  • See if there is match
  • Report

Analysis

  • Big-O time complexity is m * n. 
  • Will take a very long time

Approach 2 : Sort file B and do binary search for file A’s contents

  • Sort file B
  • Iterate through A.
  • Read each line
  • Do a binary search in file B
  • See if there is a match
  • Report

Analysis

  • Big-O time complexity is : n * log(n) + m * log(n)
  • Big-O space complexity is : n (you need to host the file B in memory)
  • Sorting file B ahead will save time.
  • However m * log(n) is the Big-O time complexity for each search operation.
  • will need a lot of space
  • will still be slow
  • Will break in future as m and n grow

Approach 3 : Sort file A and B and do file co-parsing by streaming both files

  • Sort file A
  • Sort file B
  • Iterate through A while simultaneously reading through contents of B
  • See if there is a match.
  • If both are equal report.
  • If file B’s content is ‘smaller’ than file A. Read next line of file B. Continue till you find a match.
  • If file A’s content is ‘smaller’ than file B. Read next line of file A. Continue till you find a match.
  • if ‘end of file’ is reached on either A or B then report and exit

Analysis

  • Big-O time complexity is : n * log(n) + m * log(m) + m + n. 
  • Sorting file A and B ahead will save considerable time.
  • Slightly more time consuming than Approach 2 when all steps are done for the first time. However, if sorting is done apriori subsequent searches will take linear time : m + n.
  • No additional space requirements
  • Approach is future-proof as file size for A and B grows/changes
Author : Madhulatha Mandarapu

Context:

  • Search Pubmed article ids linked to any clinical trial against NLM’s MRCOC detailed MeSH term co-occurrence file.
  • Detailed MeSH term co-occurrence file is very large with 183GB size and 1.7B rows and will grow.
  • Number of global clinical trials done/in-flight is approximately 760k. As trials grow, Pubmed article ids linked to any clinicl trial will also grow.
  • Brute force approach for this problem will take long time and binary search approach will need consideable memory esp. if you search in MRCOC.
  • Here’s an implementation

This post was later published on LinkedIn here.

Proudly powered by WordPress | Theme: Outfit Blog by Crimson Themes.