Lewis' Blog Tales from the trenches of information technology

30Jun/130

Yet another method to grab download-disabled slideshows from SlideShare

Download PDF

Yes, I know. Horrible, horrible subject. The thought of stealing jpgs which are publicly viewable... Oh, well.

Standard disclaimer applies: Teaching someone how to steal a book does not make the teacher guilty of theft. If you get in trouble for following these directions, shame on you, not on me.

So, as a proof-of-concept, I was curious as to what SlideShare does to inhibit downloading of presentations. Apparently, all they do is not provide the (original?) PowerPoint document for download 1 , 2. However, if one examines the source of the page, it is fairly easy to determine the filename of each slide image, and then automate a fetch to grab each one.

Requirements

  • Web browser or something to retrieve the source of one of the slideshow's pages (well, since you're reading this, I suppose we have this one covered)
  • cURL (look for a version compatible with your OS; start here)

That's it.

Steps

  1. Open the page containing any slide in the set you want to download.
  2. View the source of the page (in Mozilla-based browsers, this is usually accomplished with Ctrl-U).
  3. Search for "og:image" in the source, and copy the url which follows.
  4. Note the slide count in the lower left of the presentation.
  5. Open a terminal (command prompt or window session).
  6. Navigate to where you would like to save the downloaded images.
  7. Run the following cURL command:

     

    curl -O http://image.slidesharecdn.com/<name-of-presentation-including-numeric-string>-phpapp02/95/slide-[1-n]-<resolution>.jpg

An illustration

Searching for og:image in the source, we find:

<!-- fb open graph meta tags -->

  <meta name="fb_app_id" property="fb:app_id" class="fb_og_meta" content="7890123456" />
  <meta name="og_type" property="og:type" class="fb_og_meta" content="slideshare:presentation" />
  <meta name="og_url" property="og:url" class="fb_og_meta" content="http://www.slideshare.net/somedirectory/some-presentation" />
  <meta name="og_image" property="og:image" class="fb_og_meta" content="http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg" />

The url specified by og_image is:

http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg

Assume that the slide count is 55 (i.e., on the first slide, the lower left indicates "1/55"). Once in the directory where I want to save the images, I simply tell cURL:

curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-[1-55]-1024.jpg

and cURL will retrieve each jpg in the deck.

How it works

The -O option tells cURL to save the data as the original filename. Without this, cURL will dutifully retrieve a data stream, which is of little use.

The [1-55] tells cURL to successively download the filename, replacing that space (between the dashes in this example) with the subsequent number, e.g.:

curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-1-1024.jpg
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-2-1024.jpg
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-3-1024.jpg
[...]
curl -O http://image.slidesharecdn.com/somepresentation-1234567890-phpapp02/95/slide-55-1024.jpg

Frustration with wget

My natural inclination was to use wget for this. However, wget does not support globbing for http (no wildcards), and while I could have fed it some regex to specify one url after the other, this is a horribly clumsy way of accomplishing the task.

Apply the concepts

The point of all of this is not to go and rip off every download-disabled presentation on SlideShare, but rather to present a working example of how to use cURL to retrieve sequential filenames via http (or ftp). If you find another good use for this one-liner, please post a comment to let me know.

  1. Point of fact #1: I don't use PowerPoint, and I absolutely go ballistic when someone emails me one of those disgustingly-huge files which I must then convert to something readable (i.e., pray that it will open in Impress and then allow me to save it to an Impress file - or better, a pdf).
  2. Point of fact #2: I do not (yet) have an account on SlideShare, which is apparently required to download any presentations from their site.
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.