Recently I’ve read about this data leak; COMB: largest breach of all time leaked online with 3.2 billion records.
According to the article, it was known as “Compilation of Many Breaches” (COMB). This data was leaked on a popular hacking forum. It contains billions of user credentials from past leaks from Netflix, LinkedIn, Exploit.in, Bitcoin and more. This leak contains email and password pairs.
Inside the data dump, it was structured something like this:
CompilationOfManyBreaches folderdata folder1 file0 file1 folder2 file0 file1
The file contains something like this:
[email protected]:15935755b [email protected]:jumpjet1111 [email protected]:beamerbum2 [email protected]:dmitri79 [email protected]:7210996
Which indicated as email:password
So I’m wondered… What if we extract either email or password only from all those files? We can maybe create a password list from that. Or we can analyze the password trend. See what’s the top password being used & stuff.
So… We’re not going thru all hundreds of files which total up 100GB+ to extract the password manually… That’s crazy ma man!
To make it easier, I’ve created a Python script to extract the password from all dump file recursively. The code as below:
#!/usr/bin/env python import os from timeit import default_timer as timer from datetime import timedelta inputfile = "/Desktop/test/data" #change this to your dump files locations outputfile = open("extracted_password.txt", "w") print("\nStart extracting...") start = timer() for path, dirs, files in os.walk(inputfile): for filename in files: fullpath = os.path.join(path, filename) with open(fullpath, "r") as f: #print(f.read()) for line in f: email, password, *rest = line.split(":") outputfile.write("%s" % password) #print(password, end='') outputfile.close() print("Finish!\n") end = timer() print("Time Taken: ", end='') print(timedelta(seconds=end-start))
Save the code above & run the script:
$ python password_extractor.py
It may takes some times depending on your hardware resources and dump file size. You should see output something like this after the script completed execution:
When completed, you should see a new file named “extracted_password.txt” being created. Inside it contains all the password from all dump file; consolidated into 1 single big ass file.
Now we can start analyzing the password pattern. We can use this command below to see what’s the top 10 password:
$ time sort extracted_password.txt | uniq -c | sort -bgr | head -10
Happy hunting & analyzing! 🙂