File System Scanning


A while ago I wrote a simple file system scanning tool using Python. The article can be found here. The tool was a bit on the rough side but it was just to see what was possible with Python.

I’ve recently dusted off that tool to hopefully improve it. I’ve broken the tool into two pieces: (1) a very simple tool to gather the data, and (2) a tool that processes the scan(s) and creates a statistical report.

Data gathering tool

This tool is a simple Python code that walks a file system tree gathering data on all files using the os module, particularly the os.stat function. It gathers the following information:

  • Full file path
  • Size of file in bytes
  • The mtime (modify time)
  • The ctime (change time)
  • The atime (access time)
  • The file owner (uid and gid)

The data is gathered in Python lists and then writes it to a python pickle file.

You can specify the starting directory for the scan using the “-d ” option. If you don’t specify the path it will default to the current working directory (cwd). You can also specify the name of the output pickle file using the “-o ” option. If you don’t specify one then the default is “file.pickle”.

You can download the file from here. I apologize for the link but WordPress doesn’t allow my to post scripts because of security concerns.

To run the tool can you just use the command

% ./ -d <directory -o 

The first option allows you to specify the full path to the file tree you want to scan. For example, it could be something like “-d /home” to scan the entire /home subdirectory.

The second option allows you to specify the location and name of the output file (pickle file). For example you could use the option “-o /tmp/home.pickle” if you are scanning /home.

Statistical report tool

The second tool takes one or more scan files (pickles) and performs a statistical analysis on them. This analysis has some output written to stdout but it also creates an HTML report that can be viewed with a browser. The HTML version also includes some data plots.

The current version of the tool, mdpostp can be found here.

There aren’t too many options to use with this script. The easiest option is to pass the name of a file that containst a list of the pickle files to be analyzed. For example,

% ./

where “” is a file with just a list of the pickle files.

Be default the analysis tool does not do a duplicate file search. But you can use the “-dup” option to force it to a duplicate file scan.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: