Remove content creator and encoding data in metadata

9/3/2023

In this case, the file name was the same, but the timestamp was different, resulting in a different string each time. Update3: According to the this part of the documentation of QPDF, this string appears to be dependent on the file name and timestamp of runtime. You can try to use a hex editor to replace the differences in such PDFs with garbage values, so that its origins become untraceable)įast forward to removing the data by using ExifTool + QPDF (method listed in the problem description below), the linearized files output from QPDF differ ONLY in the UUID in the lines found at the beginning and the end of the document as shown here (in bold): This is why step 4 is so important: the identifying elements may not be in the metadata only. They are not erased by ExifTools+QPDF since its not in the metadata. (update3: I have checked IEEE articles, which do have unique identifying data in many parts of the PDF. (update2: I tested a Nature article (downloaded from Switzerland and UK and it did not have any identifying hash embedded in the PDF, both documents were byte-to-byte identical despite having different origins.) ( update1: I have additionally tested the same article downloaded from three four different countries and there does not appear to be any more identifying metadata embedded other than the hash in question - this might change in the future, just FYI.) I have only tested downloads from Elsevier, and other publishers may be using different techniques (if anyone knows, please let us know). The only difference, as pointed out by the Tweet I've linked below is the hash in the metadata.

I used Visual Studio to compare the bit-to-bit differences in the two files obtained at two different instances (t1, t2), from two different institution accesses (i1, i2) (altogether 4 files - i1t1.pdf, i1t2.pdf, i2t1.pdf, i2t2.pdf). Remove those identifiers as well without breaking the PDF.Find any other unique identifiers of the processed file.Use QPDF to linearize the stripped file from ExifTool - this erases the metadata (including the hash).Use ExifTool to expunge the metadata - doing this still keeps the hash inside the file, only it doesnt show up when checking the metadata.Identify the fingerprint in the PDF which relates to the file origin - in this case, a hash in the metadata.

I saw that Elsevier "embeds a hash in the PDF metadata that is unique for each download" so they can easily identify the violating party. The principle I have gone by here is: If only the two files from different sources are bit-to-bit identical, then by definition, there is nothing in them that can be used to tell them apart. So I wanted to know how to circumvent this issue and eliminate all traces of origin of the PDF file so this doesnt come to bite me in the back again.

It also had info in the format like those we would get from Grabify. Console.WriteLine("Creating a document with utf-8 encoding") Ĭonsole.WriteLine("Encoding is:", uploaded few articles months ago and I got a notice saying Elsevier detected that the file uploaded to libgen was originally downloaded from elsevier at XX:XX:XX time and date, and relevant IP it was donwloaded from, which links to XX institutional access. It then loads the documents and prints the encoding to the console. The following example creates two documents, one with utf-8 encoding, and one with utf-16 encoding. Example: Create two documents that have different encoding and identify the encoding. If you set Encoding to a valid code page name, LINQ to XML will serialize with the specified encoding. If you read an encoded document, the Encoding property will be set to the code page name. To create an encoded XML document, you add an XDeclaration to the XML tree, setting the encoding to the desired code page name.Īny value returned by WebName is a valid value.

0 Comments

Remove content creator and encoding data in metadata

Leave a Reply.

Author

Archives

Categories