|

Open Source Data Recovery

The purpose of this project was to apply digital forensics principles to search for and locate lost data from storage devices of historical significance to the author, with the hope of recovering media. Attempting to uncover lost data provided the opportunity to parallel an actual forensic investigation similar to an intellectual theft or industrial espionage case. Actually recovering information held the reward of providing the author with artifacts of historical interest that would otherwise have been lost. As an additional consideration, the author found many digital forensics tools have limited utility for free-tier users or require expensive licenses to unlock essential capabilities. The cost of these systems limit them to a digital forensic professional but not to a limited practitioner. The author found these costs, and the requirement to reinstall software to remain in the free-tier, a limitation to long term usage. For these reasons the author chose to explore this data recovery project using open source utilities to maintain an enduring data retrieval capability that could prove useful in the future while avoiding the costs associated with commercial digital forensics tooling. This project resulted in the recovery of deleted files from two hard drives in possession of the author, one 500 GB the other 2 TB. 

Requirements and Environment Setup

This analysis leveraged Kali Linux, an open source, Debian based Linux distribution with a focus on information security, ethical hacking and security research. Kali is free, and comes with a variety of forensically sound packages out of the box, making it a natural choice for this investigation. Kali Linux ran on a virtual machine using Oracle VirtualBox. This software enforces a soft limit of 2 TB per virtual machine, though this limitation may be overcome by manually creating a virtual machine using the command like and including expanded parameters.  See figure A, “Creating a Large Virtual Machine” for details on how to create a larger volume. The author used a 12 TB volume for this analysis.

Figure A-1  “Creating a Large Virtual Machine”
Figure A-2 “Creating a Large Virtual Machine VirtualBox Config”

The physical hardware used for this investigation is a Windows 11 machine with 3.5Ghz core, and a 2 TB SSD hard drive. Local storage proved insufficient for this project since one of the target storage devices was 2 TB and would produce a ~2 TB forensic image, maxing out the computer’s hard drive. The author purchased and installed a 16 TB expansion drive to increase the storage capability and accommodate larger forensic images. It was initially unclear if an HDD would work for the purpose of launching a high memory virtual machine. The alternative hard drive choice was an SSD, but large SSDs are expensive and max out on capacity at lower levels than HDDs. Investigating this point revealed that booting a virtual machine purely from an HDD posed no problem. This is intuitive since HDDs used to be the only type of hard drive for computers to use, namely before SSDs were invented. Another potential complication was that VirtualBox was installed on the main computer hard drive and the author needed to install and run a virtual machine on the empty expansion drive. This also proved to be no problem. VirtualBox’s versatility in this area impressed the author as it was able to handle these kinds of situations with minimal configuration changes. The final configuration was a 12 TB virtual machine stored on a 16 TB HDD expansion drive, running Kali Linux on VirtualBox inside a Windows 11 PC.   

Imaging and Data Extraction

The target drives for this analysis were two hard drives used during the author’s previous career as a U.S. Marine Corps helicopter pilot. One of these drives was used to perform everyday tasks, such as creating flight briefs, performing imagery analysis, and planning for flights. This drive (HD1) was a 500 GB NTFS drive. Another drive (HD2) was primarily for personal use, containing books, movies and random documents and was 2 TB in size. HD2 contained a dual partition, part AFS and part NTFS, and was used to transfer documents between the author’s Mac and work-based PC systems. The hope is that through forensic analysis and by leveraging operations like file carving, forgotten files could be recovered.

Creating the forensic images

The software used to generate forensic images was guymager which comes preinstalled with Kali Linux. Guymager includes a basic UI, enough to fulfill the functions of image extraction and basic folder navigation.  One notable challenge when using this software was that it required root access to reach the full folder structure of the VM, otherwise guymager crashed with no error message. This crash occurred when selecting a drive for imaging. Using sudo was sufficient to allow guymager to run as root, providing free reign to search the system for drives to image.

When selecting a drive for imaging it was not always clear which drive was the right drive since many drives were listed as available for imaging, namely all those to which the virtual machine had access. To determine the correct drive the author chose lsblk, a utility that helped disambiguate the target drive from other drives on the host machine by providing the human readable name of each drive. See figure B for a sample output.

Figure B “lsblk Identifying Hard Drives”

Disk imaging came with a few important configuration options.  The default setting on guymager involved breaking the image files into smaller image files of 2 GB each.  Initially, the author performed an extraction using this configuration but it turned out to be problematic for such a large drive. The first attempt at imaging created ~250 x 2 GB image files from a 500 GB drive, which proved difficult to search since each image needed to be loaded. This capability may be useful for smaller drives or when storing media on DVDs, but it was a hindrance in this analysis. The author reran the imaging operation using a single 2 TB image file. Additionally, when initially attempting to backup an image file using VirtualBox’s shared folder function, the author encountered a data transfer limit between Windows and Linux VMs.  Transfers topped out at ~30MB/sec which would have required 4-5 hours to move a large forensic image. This was a factor that led the author to explore a very large VM, so all forensic files could be stored on the same drive to avoid slow transfer speeds. 

Guymager provided support for image hashing. The author explored the use of hashing when creating hard drive images (MD5, SHA1, SHA2) that would allow a validity check of the extracted image. This process guarantees the extraction matches the original which is important for forensic analysis and chain of custody. The author found that performing a hash and validity check approximately doubled the time spent in image extraction. Metadata related to the image extraction was visible from a file ending in .info which was created alongside the extraction image. This file provided details about the validity check against data corruption and imaging problems, an example of which can be seen in figure C.

Figure C “ Validity Checks”

Imaging such large hard drives took a long time, 2.5 hours (HD1 – 500 GB) and 10 hours (HD2 – 2 TB), see figure D. Due to this long period of time the author set the drives to image overnight. In the case of HD2 the author forgot to disable sleep mode on the computer and as a result the physical machine, and thus the virtual machine, went to sleep during the night before imaging could be completed.  Surprisingly this did not negatively impact the imaging except to delay its completion. When the computer was awakened from sleep it picked up progress and continued with the imaging process right where it left off. The imaging validity check confirmed there were no issues with the extraction as a result of this unplanned interruption.

Figure D “Image Timing HD1”
Figure D – 2 “Imaging Timing HD2”

Analysis

The author used Autopsy to mount and search the extracted drive images. Autopsy, like guymager, comes standard in Kali Linux and provides the ability to mount an image, view the file tree, search using regex, and filter for deleted files. These capabilities provided a way to perform exploratory analysis and determine if a drive was worth pursuing as a cache of important information.  See figure E for images of Autopsy being used to explore the target hard drives. 

Figure E “Autopsy File Tree”

Autopsy provided many forensically relevant pieces of information about each hard drive including the serial number and drive name. Using Autopsy, the author was able to locate several caches of image files; the Autopsy UI provided the full path while searching which allowed the author to easily find clusters of images and the corresponding parent folders by filtering for images and sorting alphabetically. 

While searching the folder tree the author noticed a large quantity of deleted files, depicted with bright red text, which validated the idea to pursue file caving as a viable next step in the analysis process. See figure F for images of deleted files which attracted the author’s interest. In addition to displaying deleted files, Autopsy also provided metadata about when files were created and modified.  The author filtered and sorted by these values allowing him to search specific time periods of interest in the hard drives. See figure G for an example. Interestingly, metadata was present for files which had been deleted but which could not be recovered. While not directly useful, filenames could be recovered from the metadata even when images could not. For example, HD1 did not yield any deleted files from file carving but still included many filenames for deleted files which could provide clues as to which files were once present on the drive. This can be seen in figure F.

Figure F “Deleted Files in Autopsy”
Figure G “Sorting and Searching Deleted Files”

File Carving 

The author chose scalpel to attempt file carving, a command line tool that comes preinstalled with Kali Linux and is a standard choice for digital forensics. Scalpel has no UI when used through Kali Linux and to configure the tool it is necessary to make changes to a configuration file named scalpel.conf.  Changes to the configuration determined which files scalpel would attempt to recover, as well as the size of files it should attempt to carve. Initially, the author attempted to carve all file types, but it quickly became evident that scalpel would take days, possibly a full week, to carve a 2 TB image with every file type included for carving.  A paper on file carving times written by a security researcher from the University of Otago, suggested that certain file types are particularly difficult to carve due to their tendency to trigger “false positives” in the file carver.  This means the file carver returns objects that appear to be files but are not, and which cannot be opened by the user. Carving files with a high tendency for false positives drives up file craving time substantially. By eliminating these file types, namely mpg, zip and bespoke file formats, the author was able to substantially decrease the duration of file carving from weeks to days.  A sample file carving configuration is available in figure H.

Figure H “Scalpel File Carving Configuration”

File carving large drives resulted in dozens of folders containing possible file-objects, though many were not usable. This was a byproduct of the file carving process and required the author to scan the returned folders in search of recovered files. By quickly scanning file thumbnails the author was able to determine if a given block of files was junk or actual data and thereby narrow the search. This still ended up being a slow, manual process but ultimately a fruitful one. The number of non-file false positive returns from file carving was surprising. HD1, despite the initial promise of many deleted files, returned no carved files. The author speculates that this may be the case since this hard drive was used in day to day operations and endured many more write operations as part of daily use that would have overwritten previously deleted files. HD2 yielded many files from file carving, though not without complication. HD2 file carving proceeded for approximately 36 hours before hanging at ~76% completion.  This was after the first phase of carving, which took 12 hours to complete. Once the file carving ETA started to increase (figure I) the author decided to abort the exercise using a keyboard interrupt. Thankfully, usable files were still retrieved from HD2 providing validation of the process despite this added complication. It is unclear what caused the file carving operation to hang but the author speculates it was because HD2 was a dual partition.  This drive had two partitions (one AFS the other NTFS) which were carved using a single command. This may have caused an issue if the file carver was expecting a consistent format.

Figure I – “Scalpel File Carving Complications with HD2”

Results

HD2 provided a trove of deleted files.  Many images and PDFs were extracted, much of which involved images from books and album art from music files. Additionally some videos were recovered including parts from movies which were stored on this hard drive temporarily.  HD2 was used as a “morale drive” which contains data like books and movies and is not prone to a deluge of write operations associated with daily writes of constantly changing files.  For this reason it makes sense that HD2 was able to offer up files to carving while HD1 was not. The recovery folder included a series of subfolders organized by carved file type, and can be seen in figure K.  An example of a “false positive” file is visible in figure L, an object which looks like a file but which cannot be opened. Lastly, an example of recovered files are visible in figure M, they are easily identifiable because they produce thumbnails. 

The author was able to recover a particular file that he suspected existed but thought lost, a picture of an old friend with whom he flew many times as a Marine aviator. Ultimately this project provided many learning experiences but was more satisfying for turning up this piece of history the author would otherwise have lost. 

Figure K – “Recovered Folder List”
Figure L – “False Positives”
Figure M – “Recovered Images”

Additional Considerations and Learning Points

The time and cost associated with file carving was high. Carving small drives may not take long, but file carving a 2 TB hard drive was time intensive enough that performing a rerun (in case of misconfiguration) was a serious hurdle. Using an expansion drive to store data on a single virtual machine, to avoid the transfer times associated with copy/pasting a 2 TB file, was expensive in terms of both cost and time.

Ultimately file carving proved to be a powerful tool even when operating using open source methods. It requires that an investigator have some idea what he or she is looking for ahead of time, or risk carving too many problematic file types. Being able to exclude problematic file types improved the speed of file carving and kept runtimes under control. Ultimately, the author was impressed with how much can be done using open source systems alone.

Similar Posts