Follow this Blog!

Hi Everyone, f Comment


I am Michael. I blog about helpful tips, anything from technical (e.g. how to remove computer virus ) to hobbies (e.g. how to solve the rubik's cube ) to my favorite crafts like how to make an origami swan that may help you solve an issue you've long had or simply inspired you to enjoy life more!

Google has been a great friend to me for helping me find solutions to my problems, but it is not omnipotent. Quite often I just cannot find the solution regardless of how I tweak the search query. It is frustrating and disappointing.

When that happens it doesn't mean that the solution does not exist. It may mean that it exists but for some reason you cannot find it on Google or any other search engine.

To that end I try to make every one of my blogs search engine friendly so that you have the best chance of finding useful information. This is the purpose behind www.oneminuteinfo.com!

Feel free to leave a comment in the comment box below each post!

Check below for my latest posts!

May 6, 2017

Unix Shell Script to Find Big Unreferenced or Unused Files on Your Website
Amazon If you are maintaining a website, you may be motivated to find big unreferenced files such as images, videos, PDF documents on the web server so that you can remove them to save disk space, especially when there were people before you who maintained the site and you have no idea if they had forgotten to delete unused files on the web server.

First let's make sure what an unreferenced file is. Suppose you have a video.mp4 in your video directory, but no webpages on your website use this file. Then video.mp4 is an unreferenced file. You can safely remove it without breaking your website.

Let's go over a few Unix commands that can help you identify the unreferenced or unused big static files such as large images and large videos for removal for minimizing disk usage.

We usually target big files for removal as they take up more disk space than small files.

List the base file names of big files recursively

Suppose $1 is the document root, $2 is the size in MB, $3 is the extension of the target files. In our examples, here are the values:

$1 = /usr/share/nginx/
$2 = 5
$3 = mp4

The Unix command to list base names of files with a specific extension above a certain file size in MB looks like the following:

find $1 -type f -size +$2M -name *.$3 -exec basename {} \; | sort | uniq

The result looks like the following:

video.mp4
video2.mp4

You can append the output to some file, say /tmp/big-files.txt. This example uses the .mp4 extension, which is a common extension for video files. In your case, you may want to run the same command with many extensions such as .jpg, .png, .pdf, .doc, and so on.

Do NOT think you should include all extensions because you do NOT want to include extensions that are used to serve your website's content, such as .html, .css, .js, .php, just to name a few.

However, listing just the base file names may lead to false results. So we need the next step.

List the relative paths of big files recursively

We are only interested in the relative paths of the big files, not the absolute paths, because your webpages won't reference the absolute paths. For example, suppose one big file's absolute path is:

/usr/share/nginx/project1/video2.mp4

Since the document root is /usr/share/nginx/, chances are your webpage will reference the following text in the HTML markup in some HTML tag such as <video> and <a>:

project1/video2.mp4

Therefore, we want to know the paths relative to document root of the big files, too. The following command will list the relative paths of the big files recursively.

find $1 -type f -size +$2M -name *.$3 | sed -r "s|^$1||" | sort | uniq

The output looks like this:

project1/video2.mp4
en/project1/video.mp4

You can append the output to the same file, say /tmp/big-files.txt.

Identifying unreferenced files

Now you simply go through each entry in /tmp/big-files.txt and report if the text string does not exist in any of the files that reside in your website's document root. The grep command is particularly useful for this purpose, as follows:
for f in $(cat /tmp/big-files.txt); do
    grep -R $f $1 > /dev/null || echo $f;
done
In this loop, if a match of the current entry is found in some file in the document root, grep will generate some output, and the following command will return true:

grep -R $f $1 > /dev/null

And therefore this entry won't be printed out. Otherwise, grep will not generate any output, and the above command will return false, which causes the entry to be printed out on the screen, which means this entry is not found anywhere from the document root.

The final output is a list of big files that are not referenced anywhere on your website's document root, but don't delete them yet. You still need to go through them one by one to make sure they are indeed unreferenced anywhere because there's a chance for false results depending on how your website is written.

For example, if you see the following in the output:

tutorial.mp4
en/video/tutorial.mp4
tc/video/tutorial.mp4

You can be rest assured that tutorial.mp4 is not used anywhere on your website, and you can
safely remove en/video/tutorial.mp4 and tc/video/tutorial.mp4.

However, if you see the following output instead:

en/video/tutorial.mp4

Then you must double check because you know for a fact that tutorial.mp4 is referenced somewhere in your document root.

Simply do a grep to see where tutorial.mp4 is referenced to determine if it can be deleted.

Now you can easily write a script to include these commands. One thing you may have noticed is the loop is case-sensitive. To make it case-insensitive, simply add the -i option in the grep command.

Questions? Let me know!
Please leave a comment here!
One Minute Information - by Michael Wen
Find Michael on Google or Facebook
ADVERTISING WITH US - Direct your advertising requests to Michael