Introduction
- Let’s say you have two files A and B.
- Let’s say file size of A, in interms of count of lines, is m and file size of B is n.
- Both files are reasonably large esp. file B. Let’s say file B is 180 GB.
- Both tend to grow over time esp. file A. Let’s say file A gets updated every week.
- File A’s contents i.e. each line (or a part of it) needs to be searched in file B against its lines (or part of it)
Problem modeling
- It’s a search problem.
- It’s same as this problem.
Solution Approach
Approach 1 : Brute force
- Iterate through one file say A.
- Read each line of A
- Iterate through the other file B.
- Read each line of B
- See if there is match
- Report
Analysis
- Big-O time complexity is m * n.
- Will take a very long time
Approach 2 : Sort file B and do binary search for file A’s contents
- Sort file B
- Iterate through A.
- Read each line
- Do a binary search in file B
- See if there is a match
- Report
Analysis
- Big-O time complexity is : n * log(n) + m * log(n)
- Big-O space complexity is : n (you need to host the file B in memory)
- Sorting file B ahead will save time.
- However m * log(n) is the Big-O time complexity for each search operation.
- will need a lot of space
- will still be slow
- Will break in future as m and n grow
Approach 3 : Sort file A and B and do file co-parsing by streaming both files
- Sort file A
- Sort file B
- Iterate through A while simultaneously reading through contents of B
- See if there is a match.
- If both are equal report.
- If file B’s content is ‘smaller’ than file A. Read next line of file B. Continue till you find a match.
- If file A’s content is ‘smaller’ than file B. Read next line of file A. Continue till you find a match.
- if ‘end of file’ is reached on either A or B then report and exit
Analysis
- Big-O time complexity is : n * log(n) + m * log(m) + m + n.
- Sorting file A and B ahead will save considerable time.
- Slightly more time consuming than Approach 2 when all steps are done for the first time. However, if sorting is done apriori subsequent searches will take linear time : m + n.
- No additional space requirements
- Approach is future-proof as file size for A and B grows/changes
Author : Madhulatha Mandarapu
Context:
- Search Pubmed article ids linked to any clinical trial against NLM’s MRCOC detailed MeSH term co-occurrence file.
- Detailed MeSH term co-occurrence file is very large with 183GB size and 1.7B rows and will grow.
- Number of global clinical trials done/in-flight is approximately 760k. As trials grow, Pubmed article ids linked to any clinicl trial will also grow.
- Brute force approach for this problem will take long time and binary search approach will need consideable memory esp. if you search in MRCOC.
- Here’s an implementation
This post was later published on LinkedIn here.