Data Management

Merge 2 Huge Files & Remove Duplicates in 5 Steps (2025)

Tired of crashing Excel? Learn how to merge two huge files and remove duplicates in 5 simple, memory-efficient steps using powerful command-line tools in 2025.

A

Alex Donovan

A data engineer and writer passionate about making complex tech accessible to everyone.

7 min read15 views

You’ve been there. Two massive files sit on your desktop, mocking you. Maybe they're customer lists from different campaigns, server logs from two periods, or just gigantic piles of text data. Your mission, should you choose to accept it, is to merge them and hunt down every last duplicate entry.

So you double-click the first one. Your computer groans. You try to copy and paste it into a spreadsheet. The rainbow wheel of death starts spinning its hypnotic, soul-crushing dance. Excel freezes, Google Sheets gives up, and your trusty text editor just displays a blank screen, having fainted under the pressure.

Trying to wrangle gigabyte-sized files with tools designed for kilobytes is like trying to bail out a cruise ship with a teacup. It’s not going to work. But what if I told you there’s a powerful, lightning-fast method to do this that’s probably already built into your computer? Forget fancy software. In 2025, the most elegant solution is still the command line.

The Problem with Pretty Interfaces

The reason your favorite apps choke on huge files is simple: they try to load the entire file into your computer's memory (RAM). When a file is bigger than your available RAM, your system grinds to a halt. It’s a brute-force approach, and it’s deeply inefficient.

Command-line tools, on the other hand, are masters of efficiency. They work like a conveyor belt, processing files line by line in a continuous stream. They read a little, process it, write it out, and move on, barely touching your system's memory. This is the secret to handling datasets of virtually any size.

The 5-Step Game Plan to Merge and Deduplicate

We're going to use a trio of powerful yet simple commands: cat, sort, and uniq. Our battle plan looks like this:

  1. The Setup: Open your command-line terminal.
  2. The Merge: Combine your two files into one giant file using cat.
  3. The Order: Sort the combined file alphabetically with sort. This is the crucial step.
  4. The Purge: Remove all duplicate lines with uniq.
  5. The Verify: Check your work and clean up.

Let's get our hands dirty. For this guide, we'll assume you have two files named list1.txt and list2.txt.

Step-by-Step: The Definitive Guide

Step 1: Get Your Tools Ready

First, you need to open your command-line interface. Don't be intimidated; it’s just a blank screen waiting for your commands.

  • On macOS or Linux: Open the Terminal app.
  • On Windows: You can use PowerShell, but for the commands we're using, the best experience is with WSL (Windows Subsystem for Linux). If you have it, open your Linux distribution (like Ubuntu).

Next, navigate to the folder where your files are saved. Use the cd (change directory) command. For example, if they're in a folder called "data" on your Desktop:

cd ~/Desktop/data

Step 2: Combine the Files (The Merge)

The cat command (short for "concatenate") reads files and prints their content. We can redirect that output into a new file.

Advertisement

Run this command:

cat list1.txt list2.txt > combined.txt

Let's break it down:

  • cat list1.txt list2.txt tells the computer to read list1.txt first, then immediately read list2.txt.
  • The > symbol is a redirect. It takes all the output from the cat command and, instead of printing it to the screen, saves it into a new file.
  • combined.txt is the name of our new, merged file.

This will happen almost instantly, no matter how big the files are, because it never holds both in memory at once.

Step 3: Sort the Combined File (The Secret Sauce)

This is the most important step. The next command, uniq, is powerful but a little dumb—it can only detect duplicate lines if they are right next to each other. Sorting groups all identical lines together, preparing them for removal.

Think of it like a deck of cards. You can't easily find all the duplicate Kings if they're scattered throughout the deck. But if you sort the deck by rank, all four Kings will end up in a neat little pile, easy to spot.

Use the sort command:

sort combined.txt > sorted_combined.txt

Like cat, the sort command is designed for huge files. If it runs out of memory, it will cleverly use temporary space on your hard drive to get the job done. This might take a minute for truly enormous files, but it won't crash.

Step 4: Find the Uniques (The Purge)

Now that our file is perfectly sorted, we can unleash uniq. This command will go through the file, and for every group of identical, adjacent lines, it will print only the first one.

Run the final command:

uniq sorted_combined.txt > final_unique_list.txt

And that’s it! The file final_unique_list.txt now contains every unique line from both list1.txt and list2.txt, with all duplicates banished.

Step 5: Verify and Clean Up

How can you be sure it worked without opening the files? Use wc -l, which counts the number of lines in a file.

Run this to see the line counts for all your files:

wc -l list1.txt list2.txt combined.txt final_unique_list.txt

You'll see something like this:

  5000000 list1.txt
  3000000 list2.txt
  8000000 combined.txt
  6500000 final_unique_list.txt

This confirms that combined.txt is the sum of the first two, and final_unique_list.txt is smaller, indicating duplicates were successfully removed. Once you're satisfied, you can delete the intermediate files:

rm combined.txt sorted_combined.txt

The One-Liner Power Move

Once you're comfortable with the steps, you can chain them all into a single, beautiful command using pipes (|). A pipe takes the output of one command and feeds it directly as the input to the next, without creating any intermediate files.

This one command does everything we just did:

cat list1.txt list2.txt | sort | uniq > final_unique_list.txt

This is the epitome of command-line elegance. It’s faster, cleaner, and doesn’t use extra disk space. This is how the pros do it.

A Note for Windows Users

If you're on Windows without WSL, you can achieve the same result using PowerShell's native cmdlets. The philosophy is the same, but the names are different.

Task Linux/macOS/WSL Command Windows PowerShell Command
Merge Files cat list1.txt list2.txt Get-Content list1.txt, list2.txt
Sort & Deduplicate sort | uniq Sort-Object -Unique
Save to File > final_list.txt | Set-Content final_list.txt

So, the PowerShell one-liner equivalent would be:

Get-Content list1.txt, list2.txt | Sort-Object -Unique | Set-Content final_unique_list.txt

While this works well, many data professionals on Windows install WSL to get access to the standard Linux toolchain, as it's the lingua franca for this kind of data wrangling.

You're Now a Data-Wrangling Pro

There you have it. You’ve tamed massive datasets that would bring most programs to their knees. You didn't need expensive software or a supercomputer—just a few timeless commands that prioritize efficiency over a flashy interface.

The next time you're faced with a terrifyingly large file, don't panic. Just open your terminal, remember the magic mantra—cat | sort | uniq—and get the job done in seconds. Welcome to the world of efficient data manipulation.

Tags

You May Also Like