Extracting password from data leaks dump files

Recently I’ve read about this data leak; COMB: largest breach of all time leaked online with 3.2 billion records.

According to the article, it was known as “Compilation of Many Breaches” (COMB). This data was leaked on a popular hacking forum. It contains billions of user credentials from past leaks from Netflix, LinkedIn, Exploit.in, Bitcoin and more. This leak contains email and password pairs.

Inside the data dump, it was structured something like this:

CompilationOfManyBreaches
  folderdata
    folder1
       file0
       file1
    folder2
       file0
       file1

The file contains something like this:

Which indicated as email:password

So I’m wondered… What if we extract either email or password only from all those files? We can maybe create a password list from that. Or we can analyze the password trend. See what’s the top password being used & stuff.

So… We’re not going thru all hundreds of files which total up 100GB+ to extract the password manually… That’s crazy ma man!

To make it easier, I’ve created a Python script to extract the password from all dump file recursively. The code as below:

#!/usr/bin/env python
import os
from timeit import default_timer as timer
from datetime import timedelta

inputfile = "/Desktop/test/data" #change this to your dump files locations

outputfile = open("extracted_password.txt", "w")

print("\nStart extracting...")
start = timer()

for path, dirs, files in os.walk(inputfile):
    for filename in files:
        fullpath = os.path.join(path, filename)
        with open(fullpath, "r") as f:
            #print(f.read())
            for line in f:
                email, password, *rest = line.split(":")
                outputfile.write("%s" % password)
                #print(password, end='')

outputfile.close()

print("Finish!\n")
end = timer()
print("Time Taken: ", end='')
print(timedelta(seconds=end-start))

Save the code above & run the script:

$ python password_extractor.py

It may takes some times depending on your hardware resources and dump file size. You should see output something like this after the script completed execution:

When completed, you should see a new file named “extracted_password.txt” being created. Inside it contains all the password from all dump file; consolidated into 1 single big ass file.

Now we can start analyzing the password pattern. We can use this command below to see what’s the top 10 password:

$ time sort extracted_password.txt | uniq -c | sort -bgr | head -10

Happy hunting & analyzing! 🙂

Any Comments?

This site uses Akismet to reduce spam. Learn how your comment data is processed.