登录查看更多内容

Coding Challenge #75 - Duplicate File Finder

John Crickett

Helping you become a better software engineer by building real-world applications.

发布日期: 2024年10月12日

This challenge is to build your own version of a file deduplication tool. These tools are useful for finding duplicate files that can be deleted to free up storage space.

For many of us this probably isn’t a problem we face often, but it’s enough of a problem that several tools exist, including fdupes written in C. It’s fairly readable code, so even if you don’t fancy writing your own tool, you could learn more about the techniques used by reading the code for fdupes.

There are several uses for deduplication and the techniques used for identifying duplicate files. The first is obviously to find duplicate files and remove them freeing up storage space. Another common use is to exclude duplicate files from backups.

When we extend this to enterprise level backups they also often use these techniques to find duplicates within files to further reduce the storage (and network bandwidth) requirements.

Aside: Performance Quiz - The Answer

Last week I asked readers of Coding Challenges this question: if we re-wrote Docker in Python, what impact would it have on the performance of the containers you run?

The correct answer is that it would have no impact on the performance of the containers that you run. Docker is simply doing some configuration with Linux namespaces and cgroups and then running another process (your container) within those namespaces. The runtime is therefore completely unaffected by the language used to develop Docker.

To really understand Docker,?check out the build your own Docker ?challenge. By building your own Docker, you’ll gain a deeper understanding of Docker and become a better software engineer.

Question: Should I Build A Course On How To Build Your Own Docker?

If you think so, and would be interested, please sign up to the waitlist . If there is enough interest I’ll build a course that explains how to create your own solution to the Docker Coding Challenge in Python, Go and Rust.

Anyone on the waitlist will be offered a 50% discount.

If You Enjoy Coding Challenges Here Are Three Ways You Can Help Support It

Refer a friend or colleague to the newsletter. ??
Sign up for a paid subscription - think of it as buying me a coffee ?? twice a month, with the bonus that you also get 20% off any of my courses .
Buy one of my courses that walk you through a Coding Challenge.

Arpit Bhayani 2 个月前

Coding Challenge #70 - Memcached CLI Client

John Crickett 3 个月前

Coding Challenge 44 - Build Your Own DNS Forwarder

John Crickett 10 个月前

The Challenge - Building A Duplicate File Finder

In this coding challenge we’ll be building a command line tool that can scan a directory to identify and remove duplicate files.

Step Zero

Like all the best programming languages we’re zero indexed!

For this step, I’ll leave you to setup your IDE / editor of choice and programming language of choice. After that here’s what I’d like you to do to be ready to test your solution:

for i in {1..20}; do dd if=/dev/urandom bs=100 count=1 of=file$i; done
cp file1 file21

You can tweak this to create more files (change the 1..20 to use a bigger range), or different sized files (change the value of bs). The cp command then ensures we have a duplicate, again feel free to create more duplicates for testing.

Please make your testing more complete by creating subdirectories and ensuring there are duplicate files in different levels of the directory hierarchy you create.

Step 1

In this step your goal is to accept and directory in the command line and then scan that directory and list all the files in it recursively. That should look something like this:

% ccdupe .
file1
file2
subdir/file11
subdir/anothersubdir/file21

Continued...

You can find the remaining steps for this challenge on the Coding Challenges Substack here .

Coding Challenges

125,738 位关注者

John D. Wilson

Experienced Principal Software Engineer/Architect | Cloud Technologies | Hardware-Software Integration | Embedded Systems

4 周

Thank you for this. I have been wanting to address a duplication issue with post production of photos from Lightroom (Lr) and the initial backup of the raw files on a NAS. The issue is that after post production, photos are culled and I typically remove the discarded photos from Lr, but the backup on the NAS will still have the full backup. At some point the backup needs to be culled too, and this coding challenge should help!

1 次回应

Esthi Feldman

1 个月

very useful!

1 次回应

Okoro Wisdom

First Class, B.ENG MECHANICAL ENGINEERING

1 个月

Very informative

1 次回应

Udayakarthik Janarthanan

Bachelor Of Engineering

1 个月

I agree

1 次回应

Konstantin Yovkov

Senior Software Engineer at TIS (Treasury Intelligence Solutions)

1 个月

An interesting one, indeed! I would also try to extend to using something like a Bloom Filter, in case the file space is enormous. But this is applicable only when the definition of duplicate is "100% matching". Also, this challenge can be additionally extended with some kind of logic that calculates how "close" are two files to one another. I would approach this with playing with algorithms like Levenstein distance (on byte level). This can be further extended to even a ML model that shows the "closeness" of files.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Coding Challenge #75 - Duplicate File Finder

John Crickett

Helping you become a better software engineer by building real-world applications.

Aside: Performance Quiz - The Answer

Question: Should I Build A Course On How To Build Your Own Docker?

If You Enjoy Coding Challenges Here Are Three Ways You Can Help Support It

领英推荐

The Challenge - Building A Duplicate File Finder

Step Zero

Step 1

Continued...

Coding Challenges

125,738 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Coding Challenge #45 - yq

HOW TO LEARN CODING

Coding Challenge #41 - curl

Coding Challenge #38 - Network Modelling Tool

7 Best Python and JavaScript Web Scrapping Courses in 2024

GitHub Copilot: Friend or Foe? Navigating the Future of Coding

Know More About Go: 5 Online Resources for Learning Go Language (Golang)

DevOps & Python: Bringing The Best Of Both Worlds Together

GitHub Copilot for Developers: How the AI-powered tool can improve productivity

Empowering Python Beginners: Mastering PyCharm Setup and GitHub Integration

Aside: Performance Quiz - The Answer

Question: Should I Build A Course On How To Build Your Own Docker?

If You Enjoy Coding Challenges Here Are Three Ways You Can Help Support It

领英推荐

The Challenge - Building A Duplicate File Finder

Step Zero

Step 1

Continued...

Coding Challenges

125,738 位关注者

Coding Challenge #77 - Build Your Own Static Site Generator

2024年11月9日

From The Challenges - Redis

2024年11月2日

Coding Challenge #76 - Build Your Own Video Chat Application

2024年10月26日

From The Challenges - Calculator

2024年10月19日

From The Challenges - Sort

2024年10月5日

Coding Challenge #74 - Asteroids

2024年9月28日

From The Challenges - Load Balancer

2024年9月21日

Coding Challenge #73 - Text Editor

2024年9月14日

From The Challenges - cut

2024年9月7日

Coding Challenge #72 - Sudoku

2024年8月31日

社区洞察

其他会员也浏览了

Coding Challenge #45 - yq

HOW TO LEARN CODING

Coding Challenge #41 - curl

Coding Challenge #38 - Network Modelling Tool

7 Best Python and JavaScript Web Scrapping Courses in 2024

GitHub Copilot: Friend or Foe? Navigating the Future of Coding

Know More About Go: 5 Online Resources for Learning Go Language (Golang)

DevOps & Python: Bringing The Best Of Both Worlds Together

GitHub Copilot for Developers: How the AI-powered tool can improve productivity

Empowering Python Beginners: Mastering PyCharm Setup and GitHub Integration